Ayc0

Posted on Jul 25, 2023 • Edited on Jan 6, 2025 • Originally published at ayc0.github.io

Intl.Segmenter(): Don't use string.split() nor string.length

#javascript #intl #encoding #unicode

TL;DR
Explanation
Intl.Segmenter
1. Browser compatibility

The other day I was playing with JS and I saw this:

'é'.length;
// 1
'é'.length;
// 2, not the same output as the line before
'é'.split('').join('|');
// 'e|́'

(Yes, all of those are valid, you can copy paste them 😅)

TL;DR

As an image is worth 1000 words:

You can use Intl.Segmenter

const seg = new Intl.Segmenter('en', { granularity: "grapheme" });

[...seg.segment('🙌🏾')].length
// 1
[...seg.segment('é')].length
// 1

Explanation

This article will talk about character vs code unit vs code point vs grapheme vs glyph.

Definitions

Character: generic term that can mean any of the other 4 terms.
Code Unit: A code unit is the smallest unit of data in UTF-16 encoding. In UTF-16, each code unit is 16 bits (2 bytes) in size. It can represent a part of a character or a complete character, depending on the character's Unicode value.
Code Point: A code point is a numerical value assigned to a specific character in the Unicode standard. It's a unique identifier for each character and is typically represented in hexadecimal. For example, the code point for the letter "A" is U+0041. In UTF-16, every code point is composed by either 1 or 2 code unit.
Grapheme: A grapheme is the smallest unit of a writing system that carries meaning and represents a single "user-perceived" character. In UTF-16, every grapheme is composed by at least 1 code point. Not all code points are part of graphemes, like the zero-width non-joiner.
Glyph: A glyph is a visual representation or image of a character. It is the actual shape or form of a character as it appears on a screen or in print. A single character can have multiple glyphs associated with it, representing different typographic variations or font styles.

You can check https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme for more details.

UTF-16

JavaScript uses UTF-16 (and not UTF-8 as opposed as many other languages. To note: UTF-8 would also have all of those issues).

In UTF-16, characters are encoded in 16-bit chunks (code unit). For instance $ is encoded in hexadecimal into 0024 (thus its notation U+0024 or '\u0024'); and € is encoded as 20AC.

Problem: Using a 16-bit code unit can only result in 65536 possible characters, so how do we represent the other characters? UTF-16 has a system where it can use 2 code units to encode some code points. For instance 𐐷 is the code point U+10437 will be encoded as D801 DC37 (a high surrogate D801 and a low surrogate DC37).

String.prototype.length

According MDN, the length is based on code units:

The length data property of a String value contains the length of the string in UTF-16 code units.

This explains why for 🙌 (U+1F64C) or 𐐷 (U+10437), using .length doesn’t return 1 as those are encoded in 2 code units:

'𐐷'.length; // U+10437
// 2
'🙌'.length; // U+1F64C
// 2

One possible fix for this case is to use iterators. According to MDN again, iterators work on code points (they say characters, but they mean code points):

Since length counts code units instead of characters, if you want to get the number of characters, you can first split the string with its iterator, which iterates by characters

And it does work indeed…

[...'𐐷'].length // U+10437
// 1
[...'🙌'].length // U+1F64C
// 1
[...'é'].length
// 2
[...'🙌🏾']
// 2

… but not for all characters. Why?

Unicode composition

Another specificity of Unicode is that it can combines multiple code points to form a grapheme. This is called canonical equivalence
(see https://unicode.org/reports/tr15/#Canon_Compat_Equivalence).

For instance the letter "Ç" can either be the code point for this character, or the code point for "C" followed by the diacritic mark "◌̧"

We can also use normalization NFD and NFC to switch between the precomposed and decomposed forms (see https://unicode.org/reports/tr15/#Norm_Forms):

Many characters are known as canonical composites, or precomposed characters. In the D forms, they are decomposed; in the C forms, they are usually precomposed.

This explains why é’s length was either 1 or 2 in the initial example:

decomposed form → 2 code points
precomposed form → 1 code point

In JavaScript, you can use String.prototype.normalize (MDN):

'é'.length;
// 1
'é'.normalize('NFD').length;
// 2
'é'.normalize('NFD').normalize('NFC').length;
// 1

Emoji Sequence

Similarly to character compositions, emojis can be combined together with special characters (this is not an exhaustive list):

Skin tone modifiers can be used to customize the color skin of emojis
For instance "🙌🏾" is composed of "🙌" + "🏾" (Medium-Dark Skin Tone modifier)
```
[...'🙌🏾'];
// ['🙌', '🏾']
```
Zero-Width Joiner (ZWJ) can be used to merge some emojis together
For instance "😮‍💨" is composed of "😮" + "‍" (ZWJ) + "💨"
And "👩‍👩‍👧‍👦" is composed of each individual family members plus ZWJs:
```
[...'👩‍👩‍👧‍👦'];
// ['👩', '‍', '👩', '‍', '👧', '‍', '👦']
```
Variation Selectors can be used to choose a different glyph variant for a code point
For instance "ℹ️" is composed of "ℹ" + "️" (Variation Selector-16 to force the display as an emoji)

Intl.Segmenter

In 2021, the TC39 committee added to ECMAScript Intl.Segmenter:

The Intl.Segmenter object enables locale-sensitive text segmentation, enabling you to get meaningful items (graphemes, words or sentences) from a string.

Once a locale is picked, you can use .segment to generate an iterator with each grapheme of a string:

const seg = new Intl.Segmenter('en', { granularity: 'grapheme' });

for (const grapheme of seg.segment('Hélô 👩‍👩‍👧‍👦 🙌🏾')) {
    console.log(grapheme.segment);
}
// "H"
// "é"
// "l"
// "ô"
// " "
// "👩‍👩‍👧‍👦"
// " "
// "🙌🏾"

And if you want to get the number of grapheme (like .length), you can transform it to an array first:

[...seg.segment('🙌🏾')].length;
// 1

[...seg.segment('é')].length;
// 1

Browser compatibility

Sadly, at the date of the writing (July 2023) is not supported on Firefox yet – check on caniuse.com. You can track this issue if you want to follow its development.

Top comments (1)

Ayc0 • Oct 10 '23

For more informations, I invite you to read tonsky.me/blog/unicode/