DEV Community

Quick and easy way of counting UTF-8 characters in Javascript

Alexandru Bucur on June 08, 2018

Reading the following tutorial regarding a VueJS component that displays the character count for a textarea got me thinking. You see, the problem ...
Collapse
 
galdolber profile image
Gal Dolber • Edited

Hi Alexandru,

Nice post, I recently had to deal with this.
There are some cases where destructuring wont work, for example with punctuation.

[..."וְאֵ֗לֶּה"].length
▶ 9

My original code is in clojurescript:
gist.github.com/galdolber/1568e767...

In javascript:
"וְאֵ֗לֶּה".split(/(\P{Mark}\p{Mark}*)/u).filter((val) => val)
▶ ["וְ", "אֵ֗", "לֶּ", "ה"]

Collapse
 
coolgoose profile image
Alexandru Bucur

Hi Gal,

That's really interesting, any idea why that might be the case ?

Collapse
 
galdolber profile image
Gal Dolber

I think is because punctuation symbols are separate unicode characters that are collapsed into the first preceding non-Mark character.

Example: ד ָ דָ

So if you want to count the visible characters, you need to account for the marks.

Collapse
 
maxart2501 profile image
Massimo Artizzu

That method unfortunately fails for more complex cases, like grapheme clusters, common in Eastern languages or in Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞ memes:

[...'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'].length // 75!

But also emojis can be a pain, thanks to the wonderful U+200D ZERO WIDTH JOINER:

[...'👨‍👩‍👧‍👦'].length // 7

If you know Mathias Bynens' blog (and it looks like you do!), you've probably come across this majestic article about JavaScript and Unicode. There's a solution for these cases that uses Punycode (provided in Node but deprecated - here's a valid substitute for the browser too).

Collapse
 
tbroyer profile image
Thomas Broyer

That's a good reminder, which holds true for other languages / runtimes as well (I know at least Java/JVM that works the same, maybe also .NET, can't remember).

I don't get why you're talking about UTF-8 here though. It's Unicode vs UCS-2.

Collapse
 
coolgoose profile image
Alexandru Bucur

Hi Thomas,

The title is missleading I agree (hence linking to the UCS-2 explanation).
Unfortunately when encountered with unicode, nobody really searches for UCS-2 javascript unicode handling.