DEV Community

Alexandru Bucur
Alexandru Bucur

Posted on

Quick and easy way of counting UTF-8 characters in Javascript

Reading the following tutorial regarding a VueJS component that displays the character count for a textarea got me thinking.

You see, the problem is that when Javascript was first created it didn't had proper UTF-8 support. Javascript's internal encoding is UCS-2 or UTF-16 depending the articles you find on the internet. (actually there's an awesome article from 2012 that explains this in detail ) .

What does that mean you say ? Well it's rather straightforward, if you're trying to get the length property of a string that contains UTF-8 3/4 byte (that translate into UTF-16 surrogate pair characters) your length will return 2 for each of the characters.

This might not be an issue usually, but it's a big issue if you're having a password policy of 8 characters that can be filled by just 4 "😹🐶😹🐶" (ok, not the best example, but everybody likes cats and dogs)

let lengthTest = "😹🐶😹🐶";
console.log(lengthTest.length);
// will display 8
Enter fullscreen mode Exit fullscreen mode

Now the fix with modern Javascript is rather easy, because it supports surrogates properly in arrays, and using array destructuring makes it a quick and easy one liner.

let lengthTest = "😹🐶😹🐶";
console.log([...lengthTest].length);
// will display 4
Enter fullscreen mode Exit fullscreen mode

I'm interested in knowing if you got any weird/interesting experiences with UTF-8

PS: Use this link for a nice simple-ish explanation of Unicode encodings

Oldest comments (6)

Collapse
 
galdolber profile image
Gal Dolber • Edited

Hi Alexandru,

Nice post, I recently had to deal with this.
There are some cases where destructuring wont work, for example with punctuation.

[..."וְאֵ֗לֶּה"].length
▶ 9

My original code is in clojurescript:
gist.github.com/galdolber/1568e767...

In javascript:
"וְאֵ֗לֶּה".split(/(\P{Mark}\p{Mark}*)/u).filter((val) => val)
▶ ["וְ", "אֵ֗", "לֶּ", "ה"]

Collapse
 
coolgoose profile image
Alexandru Bucur

Hi Gal,

That's really interesting, any idea why that might be the case ?

Collapse
 
galdolber profile image
Gal Dolber

I think is because punctuation symbols are separate unicode characters that are collapsed into the first preceding non-Mark character.

Example: ד ָ דָ

So if you want to count the visible characters, you need to account for the marks.

Collapse
 
tbroyer profile image
Thomas Broyer

That's a good reminder, which holds true for other languages / runtimes as well (I know at least Java/JVM that works the same, maybe also .NET, can't remember).

I don't get why you're talking about UTF-8 here though. It's Unicode vs UCS-2.

Collapse
 
coolgoose profile image
Alexandru Bucur

Hi Thomas,

The title is missleading I agree (hence linking to the UCS-2 explanation).
Unfortunately when encountered with unicode, nobody really searches for UCS-2 javascript unicode handling.

Collapse
 
maxart2501 profile image
Massimo Artizzu

That method unfortunately fails for more complex cases, like grapheme clusters, common in Eastern languages or in Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞ memes:

[...'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'].length // 75!

But also emojis can be a pain, thanks to the wonderful U+200D ZERO WIDTH JOINER:

[...'👨‍👩‍👧‍👦'].length // 7

If you know Mathias Bynens' blog (and it looks like you do!), you've probably come across this majestic article about JavaScript and Unicode. There's a solution for these cases that uses Punycode (provided in Node but deprecated - here's a valid substitute for the browser too).