Timo Schinkel

Posted on Jun 14, 2024 • Edited on Jun 28, 2024

Friday Thoughts on email validation

#webdev

While working on a new authentication system I was getting alerts that account creation was failing. After diving into the logs I learned that the accounts were rejected because the email addresses were not deemed valid. But I made sure standardized email validation was in place. What's going on here?

So many systems and so many specifications

The authentication system is web based and thus uses HTML¹. There is a backend written in JavaScript (actually TypeScript), which in turn - for some operations - talks to a service written in .NET that stores data in AWS Cognito.

Because the front-end is web based we use <input type="email">. This has a number of benefits from the perspective of usability; we get out-of-the-box validation on the format of the input, for some devices a customized keyboard is shown, and password managers are inclined to prefill the field. The validation rules of this input type are well-defined in the HTML specification. There's even a handy regular expression for you to use, which is nice as JavaScript does not have out-of-the-box email validation.

The backend of our system is dotnet and uses a feature called Data Annotations for Model Validation that can be used to validate incoming models, including a validation for email addresses. Microsoft is nice enough to share the code for this validation with us: https://github.com/microsoft/referencesource/blob/master/System.ComponentModel.DataAnnotations/DataAnnotations/EmailAddressAttribute.cs#L48

The storage for our system is AWS Cognito, and this was the actual source of our errors. Looking at the documentation of Cognito AWS tells us the following:

Generally email , Value must be a valid email address string following the standard email format with @ symbol and domain, up to 2048 characters in length.

Luckily we give AWS enough money that they are willing to answer our questions. AWS told me that they use RFC 3696 to validate email addresses.

An interesting fact is that all the specifications for email related address also need to be compliant with RFC 1035. That RFC describes how a domain name should be constructed.

To validate or not to validate

According to David Gilbertson the only proper way to validate an email address is by sending an email to that address containing a link. When that link is clicked then we know for sure that the email address is valid. But we are in the e-commerce business, and we want to remove as many blockers from our customer journey as possible. As such we don't want to interrupt the checkout process with a verification email.

So, yes ideally we would not validate the email syntax, and yes ideally we would send a verification email, but in the real world we are sometimes faced with non-ideal scenarios. In our situation it did not help that the error messages coming from Cognito do not contain distinctive error codes. And even if Cognito did then our .NET service would have to respond with a Bad Request status code, and we still have to interpret the response. Validation of the syntax of an email address is not ideal, but for our scenario it allows us to quickly give feedback to our user and prevent "invalid" email addresses to make their way into the rest of our system.

Comparison

The email addresses that were marked as invalid can be grouped in three scenarios:

invalid . usage; the local part of an email address cannot start of end with . and two of more . are not allowed. So .name@example.com, name.@example.com, and na..me@example.com are not allowed
missing TLD extensions; according to RFC 5233 the domain part can be any internet address, which makes name@localhost a valid email address, however this is in violation of RFC 3696

Let's make a comparison of how the different layers handle these addresses:

                                    HTML    .NET    Cognito  
name@example.com                     ✅      ✅       ✅
name.name@example.com                ✅      ✅       ✅
"name..name"@example.com             ❌      ✅       ❌
name@localhost                       ✅      ✅       ✅
nåme@example.com                     ❌      ✅       ✅
aA0!#$%&'*+-/=?^_`{|}~@example.com   ✅      ✅       ✅

name.example.com                     ❌      ❌       ❌
.name@example.com                    ✅      ✅       ❌
name.@example.com                    ✅      ✅       ❌
name..name@example.com               ✅      ✅       ❌
<name>@example.com                   ❌      ✅       ❌

name@-example.com                    ❌      ✅       ❌
name@example-.com                    ❌      ✅       ❌
name@example.com-                    ❌      ✅       ❌

# extra examples from the 2024-6-28 update
name@x.com                           ✅      ✅       ✅
name@e.mail.com                      ✅      ✅       ✅
name@12mail.com                      ✅      ✅       ✅
name@1.1                             ✅      ✅       ❌
name@1.com                           ✅      ✅       ✅
name@1.2.com                         ✅      ✅       ✅

The code used to test this using .NET and Cognito can be found in this Gist: https://gist.github.com/timoschinkel/fe409ce4e019138778d4f0d9d1879e1e

I was surprised by this outcome; AWS had told me that Cognito required an RFC 3696 compliant email address, but it still rejected "name..name"@example.com, which is a valid address. At least how I interpret the specification.

tl/dr;

Although we would love to, we don't actually write code under perfect circumstances. Sometimes we are bound by limitations outside our influence. It was our choice to strive for a frictionless customer journey, without email verification via an actual email, and the consequence of this choice is that we depend on Cognito accepting our data, and when Cognito rejects it we introduce friction in our customer journey. By matching the validation rules in all layers with the layer that has the strictest rules we can at least tell our customers that their email address has been rejected and why.

At the end of the day we created a regular expression that allowed addresses that are accepted in any layer of our application. It is not fully compliant with any of the specifications mentioned, but it will allow us to explain to our customers that their email address was rejected based on the structure. If a customer has an address that is blocked by our system we will find out where it is blocked and we'll try to find a way around it.

Addendum

I did not manage to write a regular expression that meets all criteria - the maximum length is still an issue - but I did manage to create one that at works against the test set from this article:

^[\p{L}\d!#$%&'*+\-/=?^_`{|}~]+(?:\.[\p{L}\d!#$%&'*+\-/=?^_`{|}~]+)*@(?:(?!-[a-z0-9]+\.)(?![a-z0-9]+-\.)(?![a-z0-9]+--[a-z0-9]+\.)[a-z0-9-]+\.)+[a-z][a-z0-9]+$

or for your HTML input²:

<input type="email" pattern="[\p{L}\d!#$%&\x27*+\-\/=?^_`\{\|\}~]+(?:\.[\p{L}\d!#$%&\x27*+\-\/=?^_`\{\|\}~]+)*@(?:(?!-[a-z0-9]+\.)(?![a-z0-9]+-\.)(?![a-z0-9]+--[a-z0-9]+\.)[a-z0-9\-]+\.)+[a-z][a-z0-9]+" required name="email">

Concessions have been made in creating this pattern. It does not completely match any of the specifications mentioned in this article, but it does filter out all email addresses that would have been blocked by any of the layers in our application. And because the pattern element does not have any flags the pattern is by definition case-sensitive.

Update 2024-06-28

After deploying this validation rule our observability platform detected a rise in errors; my regular expression missed a number of scenarios. As we don't log privacy-sensitive data to our observability platform we decided to run all our existing users against our pattern. In hindsight, we should have done this earlier in the process.

What we found is that we had email addresses with accented characters like è that were now blocked. We also had missed one character domain names - x.com and domain names that started with a numeric value.

This shows that email validation using a pattern is very difficult and that David Gilbertson was right all along. But because we are still bound by the requirements from Cognito, and because we feel that notifying a visitor that their email address is likely invalid is still a better customer journey then risking the email not reaching the customer due to email reasons we still use a pattern validation.

I have updated my test suite with email addresses that follow the same pattern as the email addresses that were falsely rejected, as well as the gists and the regular expressions. The regular expression now uses \p{L}. This matches all characters that belong to the "letter" category, and this includes special characters like é. But because this has a larger match, it is only used for the name part of the email address. See https://www.regular-expressions.info/unicode.html for more explanation.

This Friday Thought is also applicable if your frontend is built using React or Vue. ↩
The regular expression for the pattern attribute requires some changes; it is case-sensitive, ' and " need to be encoded as \x27 and \x22 respectively, and more characters need to be escaped. ↩

DEV Community