loading...
Cover image for Properly validating e-mail addresses

Properly validating e-mail addresses

tux0r profile image tux0r ・2 min read

If you are a developer of "web applications", you probably have written (or copy-pasted) some code which tries to validate an entered e-mail address. If you have indeed done that and your solution contains a long, hard-to-read regular expression (like most of them seen on Stack Overflow) do, I have bad news for you: Your code is wrong.

Did you know that parentheses, spaces and - according to the RFC 6531 document - emojis can be a part of a valid e-mail address - and that is it merely an implementation detail whether Unicode should be supported at all?

Did you know that both IPv6 addresses and resources in your intranet are valid parts of the part after the "@", so requiring a TLD (xxxx.yy) is entirely wrong?

Nor do all of the existing (and more complex than "is there an @ character?") validators I've come across. The RFCs are much more flexible than any regular expression can be, including special cases like certain characters which are only allowed when they are insite a quoted string, but the quoted string must not stand between two periods. Ah, the joy of standards.

Of course you can still catch a majority of all invalid e-mail addresses with a sufficiently complex regular expression, but I seriously recommend you to aim for more than just "almost bug-free" code. So here is a couple of suggestions:

  1. Send validation e-mails to all e-mail addresses you want to validate.

    If the user confirms having received it, the address is perfectly valid. But this is not really reliable because nobody is forced to click into an e-mail, and if the user does not, this does not automatically mean that the address is not valid. This suggestion only makes sense for things like web forums where a confirmation e-mail can properly be announced and clicking into it can be required.

  2. Read the RFCs.

    You probably do not want to do that. :-) But don't worry, I did it for you:

  3. Use libvldmail.

    libvldmail is a small, no-frills library written in C (with no special dependencies, not even regex.h) which I have developed to solve this very problem once and for all. I provide language bindings (over SWIG) as well, so if your web application runs on Python or Ruby or whatever, you should be able to safely use it. The SWIG website has a tutorial on how to generate bindings from the supplied template file if you need that.

You are invited to read the README and the example implementation for details on how to use it. Don't worry, I made it easy: There is only one function which takes one parameter.

I just thought this might come in handy for some of you.

Discussion

pic
Editor guide
Collapse
isaacdlyman profile image
Isaac Lyman

^.+@.+\..+$ is the regex I use. It needs something before an @, something between that @ and a following ., and something after that ..

Does it accept some invalid email addresses? Probably. I ask users to verify their email address anyway; worst case scenario, they enter an invalid address and the email bounces. That one is on them.

Does it deny some valid email addresses? Maybe, but in my opinion if your email address is so weird it doesn't have an @ and a . in it, then you already know you're a signup error waiting to happen.

For a little while, I was hearing complaints from users who accidentally typed a space while entering their email address. So I added a validator to make sure the email address has no spaces. I know this denies even more valid email addresses, but my goal isn't to match the RFC perfectly, it's to allow the maximum number of users to sign up with the minimum number of problems. This prioritizes a large number of users who make mistakes over a tiny number of users who do not, which may not be fair but makes sense as a business decision.

Collapse
drewknab profile image
Drew Knab

+1

I'm not really here to impress cranks who have garbage-fire emails because the spec allows it.

Collapse
artemis profile image
Diane

More than that, in most languages and frameworks, there's already some perfectly working validation mechanisms, so there's pretty much no need to validate an e-mail by "our" own logic.

Adding this small C library is awesome anyways, thanks for the ref.

Collapse
tux0r profile image
tux0r Author

I have tried some of the "perfectly working" mechanisms. Even if your language has one, it will most likely not cover corner cases. (I admit to not have tried every single one.)

Collapse
artemis profile image
Diane

In my case, PHP (the filter_var method) is known to be technically "perfectly" working (and I also admit I didn't try any extreme case, even if I tried some already serious cases).

Thread Thread
tux0r profile image
tux0r Author

I have just tested a local address with an emoji. PHP does not accept that.

Thread Thread
perttisoomann profile image
Pert Soomann

Seems like PHP implementation is perfect, but they check against older RFC.

Thread Thread
artemis profile image
Diane

Isn't emojis in domain names a hasardous spec anyways ? (talking in terms of support and implementations).

Thread Thread
tux0r profile image
tux0r Author

So the PHP implementation becomes increasingly less usable as more and more Unicode domains are registered.

Thread Thread
tux0r profile image
tux0r Author

Hazardous, but rule-compliant. libvldmail has a compiler flag for that, so you could make it reject them if needed.

Thread Thread
perttisoomann profile image
Pert Soomann

Yeah, beyond just pure technical ability to validate email address against rules, it's a weird one on so many levels:

theguardian.com/technology/2017/ap...

Can see pros and cons which ever way you go, tho having a dedicated library does help with faster updates over built in functions that might take years to release next version.

Collapse
matteorigon profile image
Matteo Rigon

Once I've read a similar article, yes the RFC is not implemented correctly nowhere ( beside your library now I guess 😉), but a user with a special email will have more serious trouble than registering to your site, since basically nothing on the web will allow that email to be used ( or created in the first place)

Collapse
tux0r profile image
tux0r Author

That problem will fade as more web developers integrate my library! ;-)

Collapse
qm3ster profile image
Mihail Malo

Wait, wait, wait... your library?
I'm calling the :oncoming_police_car:

Thread Thread
tux0r profile image
Thread Thread
qm3ster profile image
Mihail Malo

Because you have been so successful in framing your advertising post as an informational post that I took the bait.
Fortunately, police don't care about truth or justice, they just want to inflict some damage. This is the rare case when they're just what the doctor ordered.

Thread Thread
tux0r profile image
tux0r Author

I'm not advertising, I'm explaining. There is a lot of advertising on DEV. I'm not a company. I don't care how many people know my software. I don't sell anything to anyone.

Please complain to actual advertisers instead.

Thread Thread
qm3ster profile image
Collapse
martingbrown profile image
Martin G. Brown

Whether Unicode is allowed in the name part rather depends on whether the SMTP servers involved support the SMTPUTF8 extension. In my experience they mostly don't. AWS's SES and Sendgrid don't for example.

Collapse
tux0r profile image
tux0r Author

That's not the problem of the validator though...

Collapse
martingbrown profile image
Martin G. Brown

Well it is if you are letting through addresses your own SMTP server doesn't support.

Collapse
qm3ster profile image
Mihail Malo

user registers with an IP address for a domain

Oh, such fun! I definitely want this happening to me and my apps!

And no, we shouldn't be internationalizing domains, we should be de-nationalizing people.

/en/ shouldn't just be the default aliased to /, it should be the only language of the web.

Collapse
tux0r profile image
tux0r Author

Or German, which is the most-spoken language in Europe (and soon, when the only English-speaking countries leave the EU, even more relevant here)...

Collapse
qm3ster profile image
Mihail Malo

Then you should have said Chinese.
I'm talking about the language (and charset) of legacy systems, and the language of programming language keywords.

Collapse
mas profile image
Sam P

Interesting. RFC 5322 gives us this regex as the standard for an email address, and that seems to be baked into most implementations these days.

Collapse
tux0r profile image
tux0r Author

That 99.99% Works

Not good enough.

Collapse
robiii profile image
Rob Janssen

How does it hold up to, say, these tests?

code.iamcal.com/php/rfc822/tests/

You can find the tests in easy format here: github.com/dominicsayers/isemail/b...

Collapse
tux0r profile image
tux0r Author

I am not sure. I have not tried any of those. If you find something is missing, please submit a proper bug report (or even a fix).

From looking through that page, it seems to respect the RFC 822 which are declared obsolete.

Collapse
bgadrian profile image
Adrian B.G.

which I have developed to solve this very problem once and for all.

You cannot do that as long as the internet evolves.

Collapse
tux0r profile image
tux0r Author

I will update the library as soon as a new address standard is established.

Collapse
jaspr profile image
Jason Spradlin

For the rest of your life?

Thread Thread
tux0r profile image