DEV Community

Cover image for Generate Sanitized Email Hash (as Integer)
James Moberg
James Moberg

Posted on

Generate Sanitized Email Hash (as Integer)

While reviewing the logs of failed contact form submissions, I identified a couple email address variations that were exploiting some Gmail features in an attempt to bypass our filters. (Gmail has a "plus" feature and ignores periods in addresses.) A SQL query using REPLACE to remove all periods revealed that this comment form spammer had performed 279 attempts using 162 variations of their 15 character gmail username in an effort to circumvent our filters. We log the full email address that was posted and, when matching via SQL solely using the email addresses, it appeared as each email address was only used 2-4 times... versus the 279 obfuscated attempts.

To better identify & highlight abusers via SQL queries, an EmailHash (INT) column has been added to the database table. When searching or logging the email address, the value is sanitized (remove + string and . from the username) and then a java hashCode is generated. Using integers to join database records is much faster than using varchar and has lower storage requirements.

While there's a very slight risk of collision, I'm not aware of any other numeric hashing solution. I've tested this on multiple ColdFusion/Lucee installations and Java's hashCode has returned consistent results regardless of platform and/or JDK. I've used hashCode in some other ColdFusion applications and even have some other blog posts regarding it:

Here's the source code. Enjoy!

Top comments (2)

bennadel profile image
Ben Nadel

That's a clever idea! Though, I have to admit that I use the + style addressing to generate a lot of test accounts when I make sure things work properly. I suppose you could make the database column nullable, and then store a NULL for internal accounts (since NULL is "unique" in most database indexes). That said, I usually hate nullable columns :D Cool stuff though.

gamesover profile image
James Moberg

I actually use NULL. In a recent email marketing project, the SQL Server database table is shared with a legacy Access desktop application and it doesn't perform any validation or deduplication. (YIKES!) Using the hashed integer enables the query to ignore to cosmetic "plus" and dot notation to retrieve unique email addresses so that we aren't sending a message to the same email address that just happened to be entered differently by their staff.

I personally use the "plus" and "dot" feature for test accounts too. I think I would apply other patterns to allow internal test accounts, but it would still be beneficial to quickly identify + group all test accounts based on a single sanitized email hash number rather than leaving it NULL.

The Plus email address can be problematic when sending marketing emails to a the same inbox with address variations and it's rejected or bounced for each address.

Regarding contact form spam abuse, I can't reiterate how beneficial this has been to remove the uniqueness, identify and block.