buzz.typo3.org: Spam Protecting Your TYPO3 E-mail Addresses With a Special Twist

January 15, 2007

Spam Protecting Your TYPO3 E-mail Addresses With a Special Twist

By: Ron Hall

With a little adjustment in code you can improve TYPO3's capabilities in obscuring e-mail addresses.

First, let me say to all you TYPO3 veterans. If you are familiar with how to obscure e-mail addresses with TypoScript then you may want to skip down to the section where I introduce the "The Twist" and the extra JavaScript needed to pull this off.

The Challenge

Spam bots are continually crawling your website trying to harvest your e-mail address so they they can send you boatloads of junk mail. It is always a challenge to obscure addresses from these bots and still make the site easy for users to use and editors to edit.

I am going to first present the code TYPO3 already has to address the problem. After that I will introduce a twist that I can up with through experimentation which I believe better hides the addresses from bots while maintaining usability. The e-mail address that I will be using in the examples is myname@mydomain.com. With out any attempt at obscuring, this linked address will look like this to the front end user:

myname(at)mydomain.com

and look like this to spam bots in the source code:

<a href="mailto:myname@mydomain.com"class="mail">
myname@mydomain.com</a>

Obviously, the address is entirely exposed to the dreaded spam bots.

Basic TYPO3 Approach for Protecting E-mail Addresses

There is a very simple way to obscure linked e-mail addresses in your TYPO3 site. Add code like this to your TYPO3 template.

config {
	spamProtectEmailAddresses = -3
	spamProtectEmailAddresses_atSubst = [at]
	spamProtectEmailAddresses_lastDotSubst = [dot]
}

After putting this code in the template and clearing the cache, our address now looks like this to our site visitor:

myname[at]mydomain[dot]com

and like this in the source code:

<a href="http://www.busynoggin.com/?id=javascript:
linkTo_UnCryptMailto('jxfiql7jvkxjbXjvaljxfk+zlj');" 
class="mail">myname[at]mydomain[dot]com</a>

What does the TypoScript do?

The value of "spamProtectEmailAddresses" is used to scramble the address for the mailto link and can be a number from -5 to 1 (yes that is negative 5 to positive 1).
The value of "spamProtectEmailAddresses_atSubst" is what the viewer sees in place of the "@" in the address and it can be any text including "@" or its equivalent HTML entity.
The value of "spamProtectEmailAddresses_lastDotSubst" is what the viewer sees in place of the "." in the address.

The address functions as normal meaning users can still click on it and have their e-mail client brought up and the address injected into a new message. Of course, this approach depends on JavaScript being available on the user's browser, but that is pretty much universal these days. You can read more about these settings on page 57 of TSref.

So, what have we accomplished so far? We have done a good job of hiding the address contained in the "mailto" part of the source code. And we have made an attempt to hide the text from bots by substituting "[at]" for "@" and substituting "[dot]" for "." However, it is becoming more common for people to substitute alternate text for these parts of an e-mail address. Bots can easily be programmed to see through this.

What have we sacrificed? Nothing for the editor as they still enter the address in the backend as myname@mydomain.com and link it. But for the web site user we have definitely sacrificed some level of usability. Especially those that are less web savvy and are confused by our odd-looking text.

There is, however, a better way

The Twist

First, let me qualify a couple of things. I have come up with this technique on my own, however, it is quite possible that others have figured this out as well. Second, for all you javascript programmers, please withhold your laughter. I am sure the js code can be much cleaner. Basically, I quickly looked stuff up in a js book, put it together and tested it. The function name and arguments are written for clarity in this post and not for compactness of code.

This approach came from asking myself, "I wonder if TypoScript will let me do this?" and ended with "Isn't that cool Typoscript allows this to be done."

The basic concept is to have TypoScript substitute JavaScript code for "@" and "." in the e-mail instead of substituting text.

Step One

You need a couple of JavaScript functions available to your page. One way to do this is to add the following code to the page object in your site template (of course, the example assumes your page object is called "page" and that there is not already a "headerData.50" -- adjust names if needed):

page {
 	headerData.50 = TEXT
 	headerData.50.value (
 	<script type="text/javascript">
 		<!--
 		function obscureAddMid() {
 		        document.write('&#64;');
 			}
 		function obscureAddEnd() {
 		        document.write('&#46;');
 			}
 		// -->
 	</script>
 	)
 }

Step Two

Use this code for your spam protection (keep values on same line as object path, not like example):

config {
 	spamProtectEmailAddresses = -2
 	spamProtectEmailAddresses_atSubst =
            <script type="text/javascript"> obscureAddMid() </script>
 	spamProtectEmailAddresses_lastDotSubst =
             <script type="text/javascript"> obscureAddEnd() </script>
 }

After clearing the cache our example e-mail will look like this to the front end user:

myname(at)mydomain.com

And to the spam bots the source code looks like:

<a href="javascript:linkTo_UnCryptMailto('jxfiql7jvkxjbXjvaljxfk+zlj');" 
class="mail" >myname<script type="text/javascript"> 
document.write(obscure('&#64;')) 
</script>mydomain<script type="text/javascript"> 
document.write(obscure('&#46;')) </script>com</a>

What have we accomplished now?

For the backend user: They still enter the address and link it as they always have.
For the front end user: The address appears normal to them, it is still clickable and can be cut and pasted.
For the Spam Bots: Made it much more difficult for them to find and read e-mail addresses.

You can see this approach in operation on the e-mail address on this page.

Anyway, this very long blog post all came from asking myself, "I wonder if...."

<- Back to: Ron Hall

comments

comment #1

Michael Stucki January 15, 2007 07:32

Hi Ron,

this solution looks nice, but there are much easier ways to do that. If you agree that a spam bot will most likely not parse CSS definitions, you can simply do this:

config {
spamProtectEmailAddresses = 1
spamProtectEmailAddresses_atSubst = ping@
spamProtectEmailAddresses_lastDotSubst = pong.
}

The bot will either find the address my.nameping@myhostpong.com (if he just strips all HTML tags) or mynamemyhostcom (if he strips HTML tags including the contents).

Pros:
+ Human accessibility (in this case you should use better placeholders, e.g. "REMOVE_THIS" instead of "ping" and "pong")
+ no additional JavaScript tricks required
- not 100% future safe: Bots might become able to parse CSS, but they also might be able to parse JavaScript as well as image placeholders

--
- michael

comment #2

Michael Stucki January 15, 2007 07:37

Too bad, the posting form does not htmlspecialchar the contents!

Here's the snippet again:

config {
spamProtectEmailAddresses = 1
spamProtectEmailAddresses_atSubst = ping@
spamProtectEmailAddresses_lastDotSubst = pong.
}

--
- michael

comment #3

Ron Hall January 15, 2007 15:43

Thanks Michael,

That is a nice solution. I agree that bots are less likely to be programmed to accommodate for CSS than JavaScript at this point in time.

I think your suggestion of using "REMOVE_THIS" or something with similar meaning is helpful in case the user has styles turned off or is using a screen reader that does not properly handle CSS. Of course, if they need a screen reader they would be foolish to use one not perfect in CSS support. Also, with this method, when a legitimate user uses copy and paste on the address it will include the placeholder (at least on my Mac it does).

Although by itself, this CSS method provides good protection. I do wonder if, in practice, the CSS approach adds much more protection than the JavaScript. After all, I would think that if the bot searching your site is able to accommodate for JavaScript but not CSS then it probably will still harvest your e-mail from the "mailto" portion of the code. Of course, that is just the thinking out loud of someone who has no idea what he is talking about. :)

One encouragement I should have included in my post was this: "Try to make it difficult for bots to harvest your addresses, but give up on making it impossible." Be satisfied with stopping most of them. Trying to stop them all will drive you crazy.

One thing that is encouraging is that our comrades in IT/systems are doing an increasingly effective job of stopping junk e-mail. I know much less gets to my inbox than three years ago. Thank you to all of you that work in this area.

Sorry, comments are closed for this post.

Archive: