loading...

Quick and easy way of counting UTF-8 characters in Javascript

coolgoose profile image Alexandru Bucur ・1 min read

Reading the following tutorial regarding a VueJS component that displays the character count for a textarea got me thinking.

You see, the problem is that when Javascript was first created it didn't had proper UTF-8 support. Javascript's internal encoding is UCS-2 or UTF-16 depending the articles you find on the internet. (actually there's an awesome article from 2012 that explains this in detail ) .

What does that mean you say ? Well it's rather straightforward, if you're trying to get the length property of a string that contains UTF-8 3/4 byte (that translate into UTF-16 surrogate pair characters) your length will return 2 for each of the characters.

This might not be an issue usually, but it's a big issue if you're having a password policy of 8 characters that can be filled by just 4 "😹🐶😹🐶" (ok, not the best example, but everybody likes cats and dogs)

let lengthTest = "😹🐶😹🐶";
console.log(lengthTest.length);
// will display 8
Enter fullscreen mode Exit fullscreen mode

Now the fix with modern Javascript is rather easy, because it supports surrogates properly in arrays, and using array destructuring makes it a quick and easy one liner.

let lengthTest = "😹🐶😹🐶";
console.log([...lengthTest].length);
// will display 4
Enter fullscreen mode Exit fullscreen mode

I'm interested in knowing if you got any weird/interesting experiences with UTF-8

PS: Use this link for a nice simple-ish explanation of Unicode encodings

Discussion

pic
Editor guide
Collapse
galdolber profile image
Gal Dolber

Hi Alexandru,

Nice post, I recently had to deal with this.
There are some cases where destructuring wont work, for example with punctuation.

[..."וְאֵ֗לֶּה"].length
▶ 9

My original code is in clojurescript:
gist.github.com/galdolber/1568e767...

In javascript:
"וְאֵ֗לֶּה".split(/(\P{Mark}\p{Mark}*)/u).filter((val) => val)
▶ ["וְ", "אֵ֗", "לֶּ", "ה"]

Collapse
coolgoose profile image
Alexandru Bucur Author

Hi Gal,

That's really interesting, any idea why that might be the case ?

Collapse
galdolber profile image
Gal Dolber

I think is because punctuation symbols are separate unicode characters that are collapsed into the first preceding non-Mark character.

Example: ד ָ דָ

So if you want to count the visible characters, you need to account for the marks.

Collapse
maxart2501 profile image
Massimo Artizzu

That method unfortunately fails for more complex cases, like grapheme clusters, common in Eastern languages or in Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞ memes:

[...'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'].length // 75!

But also emojis can be a pain, thanks to the wonderful U+200D ZERO WIDTH JOINER:

[...'👨‍👩‍👧‍👦'].length // 7

If you know Mathias Bynens' blog (and it looks like you do!), you've probably come across this majestic article about JavaScript and Unicode. There's a solution for these cases that uses Punycode (provided in Node but deprecated - here's a valid substitute for the browser too).

Collapse
tbroyer profile image
Thomas Broyer

That's a good reminder, which holds true for other languages / runtimes as well (I know at least Java/JVM that works the same, maybe also .NET, can't remember).

I don't get why you're talking about UTF-8 here though. It's Unicode vs UCS-2.

Collapse
coolgoose profile image
Alexandru Bucur Author

Hi Thomas,

The title is missleading I agree (hence linking to the UCS-2 explanation).
Unfortunately when encountered with unicode, nobody really searches for UCS-2 javascript unicode handling.