DEV Community

Cover image for Let's stop using [a-zA-Z]+
Till Sanders
Till Sanders

Posted on

Let's stop using [a-zA-Z]+

If you, like me, regularly (see what I did here?) validate alphanumeric fields using Regex, you probably learned to do it like this:

'Till'.match(/[a-zA-Z0-9]+/gu)
Enter fullscreen mode Exit fullscreen mode

This is technically correct, of course. And it's what most validation libraries will do when you tell them a field is alpha / alphanumeric / etc.

However, I have a problem with this approach and a lot (!) of other people do, too. Because I'm from Germany. More specifically, from a town called Lüdenscheid. And Lüdenscheid won't match the regular expression above because of the Umlaut. Same applies for languages like French, Spanish, Czech, just to name a few.

So how can we as developers be more inclusive towards languages other than English? Do we have to include all possible variations of the latin alphabet? That's a common suggestion, but of course, it doesn't scale well.

Luckily, Unicode has us covered:

'Lüdenscheid'.match(/[\p{Letter}\p{Mark}]+/gu)
Enter fullscreen mode Exit fullscreen mode

The \p flag allows us to pick a so called Unicode Character Category. In Unicode, all characters are sorted into categories that we can use in our regular expression. The Letter category includes letters from all kinds of languages, not just A-Z. But it does not include, e.g. <, >, + or $ which is important for security. The Mark category – as lionelrowe pointed out in the comments (thanks) – contains combining marks. In Unicode, a letter like ü can be either one or two combined code points. So depending on how the character is coded, we need the Mark category.

More details on the Mark category

If we omit the Mark category and run the following Regex: 'Lüdenscheid'.match(/[\p{Letter}]+/gu) it will match Lüdenscheid, if the ü is encoded as a single character. On the other hand, if the ü is encoded as a letter-mark-combination (u + ̈), the regex will only match Lu, because it will stop at the ̈ mark.

Browser support

Browser support for this feature is good, IE (not Edge) being the only exclusion.

Bonus

// Match only letters
'Lüdenscheid'.match(/[\p{Letter}\p{Mark}]+/gu)

// Match letters and spaces
'Pražští filharmonici'.match(/[\p{Letter}\p{Mark}\s]+/gu)

// Match letters and hyphens
'Île-de-France'.match(/[\p{Letter}\p{Mark}-]+/gu)

// Match letters hyphens and spaces
'Île-de-France'.match(/[\p{Letter}\p{Mark}\s-]+/gu)
Enter fullscreen mode Exit fullscreen mode

Latest comments (6)

Collapse
 
einshervin profile image
Shervin Lepinat

Great work!

Collapse
 
patarapolw profile image
Pacharapol Withayasakpunt • Edited

A while ago, I also learnt that JavaScript regex has unicode support (via /.../u).

I always had to use XRegExp before.

Also, the syntax is a little different. XRegExp('\\p{Han}') vs /\p{sc=Han}/u.

For this, I also recommend this VSCode extension, so that you don't write an invalid regex. (Like, valid in some compilers, but not native JavaScript.)

For /\p{Letter}/u, I would normally write shorthand.

/^[\p{L}\p{N}\p{M}-]+$/u
Enter fullscreen mode Exit fullscreen mode
Collapse
 
mcertan profile image
Mehmet Can Ertan

You are my savior

Collapse
 
steveconqueror1 profile image
Steve Kilonzo

This one really solved my problem

Collapse
 
lionelrowe profile image
lionel-rowe • Edited

Great post! But unfortunately, this still fails for some languages. For example, Burmese:

const burmese = 'မြန်မာဘာသာ'

;/^\p{Letter}+$/u.test(burmese) // false
Enter fullscreen mode Exit fullscreen mode

Or even certain normalized representations of "Lüdenscheid":

// visually the same, but splits the umlaut and the "u" into two characters
const town = 'Lüdenscheid'.normalize('NFD')

;/^\p{Letter}+$/u.test(town) // false
Enter fullscreen mode Exit fullscreen mode

You can get around this by also allowing the Mark category:

const regex = /^[\p{Letter}\p{Mark}]+$/u

regex.test(burmese) // true
regex.test(town) // true
Enter fullscreen mode Exit fullscreen mode
Collapse
 
tillsanders profile image
Till Sanders • Edited

Good remark! I will update the post to include this 👍