loading...

Protect Your Contact Information From Crawlers

Bastian Heinlein on February 24, 2020

In Germany, we are required to publish contact information on every commercial website including email address and phone number. I never put much e... [Read Full]
markdown guide
 

I use to encode the email part (and other sensitive data) with Hex codes in HTML like this:


mailto:info@example.com

which will render as

mailto:info@example.com

I got my email embedded like this and didn't receive spam so far. It has the advantage of users getting the right values to see because the browser shows their normal entities (a normal user doesn't even recognize that the chars are written in HEX HTML codes), click on a mail address works too and no JS is required.

@edit: fun fact: this form also interpreted the hex encoded values :)

 

Nice, it seems like the easiest solutions work best, sometimes :-)

 

yes, this was nice approach, but problem is it works with "text" crawlers only. nowadays, there are many "headless browsers", which actually render the page's dom in memory, even run javascript code, and then crawl the output.

this of course applies to Mr. Heinlein's approach as well.

in both cases, just add recaptcha and you'll be good... for now...

 

I know that this is possible, but luckily by now there aren't many crawlers now here which crawl emails using JavaScript - I guess. Maybe that "business" is not so interesting any more? That solution is simple and easy to implement, but doesn't keep all bots outside.

And if I ever can avoid captchas, I will. Example: dirty-co.de/user-experience/wenn-d... - I wanted to get information about my package which should be delivered by DHL but their captcha didn't load properly so I was stuck there ... bad UX!

lol, i had the exact same captcha-not-shown problem with huawei website some time ago.

but google's recaptcha is pretty neat lately. you don't even have to copy/write anything if recaptcha evaluates you as "human" (which it does in most cases).

Ah well, I didn't read exactly enough ;). So if you're respecting the GDPR (as we are forced to in Germany here) you may come to a new issue trying to use Google Recaptcha ... :S

This! But it's not only the regulations like GDPR, I personally wouldn't like to give Google more information about the people using my websites and even more important about the website's usage.

oh, i didn't think about that in GDPR context, why there is a problem with recaptcha? (i'm more into tech than law stuff, so i don't know).

This is my imperfect understanding: While it is still possible to use Google Recaptcha, it causes a lot of privacy headaches, because Google processes personal data and places cookies. The latter means that you'll need at least some kind of cookie banner and make sure, cookies are only placed after this was explicitly allowed. But more importantly is the former: Google not only processes personal data, but it does this possibly in the US or somewhere else. This means - as I understand - that you'll need some kind of contract with them to protect your and your user's interests. That is usually a standard contract, however it is a legal binding contract.

And in some related cases of which I'm aware, courts ruled that you could theoretically be partially responsible if your contract partner disobeys privacy regulations.

While this is my best knowledge, there are of course no guarantees that my probably out-of-date-for-several-months knowledge isn't necessarily anymore correct.

 

I usually let CloudFlare handle that for me, they inject a script that mangles mailto: links and only unmangle them in JavaScript.

For websites where CloudFlare is not desirable, I run a similar mangling/demangling algorithm that uses base64 encoding of the address, with some characters replaced and the final string reversed to avoid easy detection of bWFpbHRvOg (base64 for mailto:) at the start.

I found the technique here:
code.luasoftware.com/tutorials/jav...

Example:
github.com/franky47/penelopebuckle...

 

Thanks for sharing this solution, I really like it!

 

Most of the time it's not your website that's prone to crawling but all other places. Any registry or 3rd party service that will have a data leak.
If I were you, I would present my regular gmail address. Perhaps it doesn't look "professional" but has the best spam filter that's available on the market.
For real clients for whom you run presentations, you can use expirable one time links dedicated for each client. You put your professional e-mail there. And on the business cards.

 

My personal, statistical not-significant experience of some (accidental) A/B-testing is that crawlers are a bigger problem than database breaches, especially considering the fact that most of the spam messages were explicitly targeted to businesses.

 

Or use G Suite for business and get the best of both worlds?

 
 

Yes but by doing that you also remove the google "crawlers" that searches your contact to reference it. Could be bad for some web sites.

 

Well, of course you also make life harder for "good" crawlers, but as far as I am aware, you can manually enter and edit business data on Google's website.

code of conduct - report abuse