DEV Community

Cover image for Length of a string
Sujeet Jaiswal
Sujeet Jaiswal

Posted on • Originally published at sujeet.pro

Length of a string

TL;DR

'👩‍👩‍👦‍👦🌦️🧘🏻‍♂️'.length is 21 instead of 3 because JS gives length UTF-16 code units and icons are a combination of more than one of such code units. Use Intl.Segmenter to get the length of rendered graphemes.

console.log("👩‍👩‍👦‍👦🌦️🧘🏻‍♂️".length); // 21  - W
console.log(getVisibleLength("👩‍👩‍👦‍👦🌦️🧘🏻‍♂️")); // 3 - How can we get this?
Enter fullscreen mode Exit fullscreen mode

What is the .length?

The length data property of a string contains the length of the string in UTF-16 code units. - MDN

I always thought we used utf-8 encoding, mostly because we use to set <meta charset="UTF-8"> in our HTML file.

💡Did you know, JS engines use UTF-16 encoding and not UTF-8?

const logItemsWithlength = (...items) =>
  console.log(items.map((item) => `${item}:${item.length}`));
logItemsWithlength("A", "a", "À", "", "");
// ['A:1', 'a:1', 'À:1', '⇐:1', '⇟:1']
Enter fullscreen mode Exit fullscreen mode

In the above example. A, a, and À can be represented using utf-8 encoding and hence in length is 1, irrespective if you check utf-8 or utf-16 encoding.

and needs utf-16 (if it was utf-8, its length would be 2)

But since all the characters could be represented using utf-16, the length for each character is 1.

Length of Icons

logItemsWithlength("🧘", "🌦", "😂", "😃", "🥖", "🚗");
// ['🧘:2', '🌦:2', '😂:2', '😃:2', '🥖:2', '🚗:2']
Enter fullscreen mode Exit fullscreen mode

The above icon needs two code points of UTF-16 to be represented, and hence the length of all the icons is 2.

Encoding values for the icon - 🧘

  • UTF-8 Encoding: 0xF0 0x9F 0xA7 0x98
  • UTF-16 Encoding: 0xD83E 0xDDD8
  • UTF-32 Encoding: 0x0001F9D8

Icons With different colors

While using reactions in multiple apps, we have seen the same icons with different colors, are they different icons or the same icons with some CSS magic?

Irrespective of the approach, the length should be now 2, right? After all, two codepoints of utf-16 encoding (basically utf-32 encoding) have a lot of possible spaces to accommodate different colors.

logItemsWithlength("🧘", "🧘🏻‍♂️");
//  ['🧘:2', '🧘🏻‍♂️:7']
Enter fullscreen mode Exit fullscreen mode

Why is the icon in blue have a length of 7?

Icons are like words!

console.log("👩‍👩‍👦‍👦".length); // 11
console.log([..."👩‍👩‍👦‍👦"]);
// ['👩', '‍', '👩', '‍', '👦', '‍', '👦']
Enter fullscreen mode Exit fullscreen mode

Icons, like words in English, are composed of multiple icons. And this can make the icons of variable length.

How do you split these?

console.log("👩‍👩‍👦‍👦🌦️🧘🏻‍♂️".length); // 21
console.log("👩‍👩‍👦‍👦🌦️🧘🏻‍♂️".split(""));
// ['\uD83D', '\uDC69', '‍', '\uD83D', '\uDC69', '‍', '\uD83D', '\uDC66', '‍', '\uD83D', '\uDC66', '\uD83C', '\uDF26', '️', '\uD83E', '\uDDD8', '\uD83C', '\uDFFB', '‍', '♂', '️']
Enter fullscreen mode Exit fullscreen mode

Since JS uses utf-16 encoding, splitting would give you those codepoints and is not useful.

Introducing Intl.Segmenter

The Intl.Segmenter object enables locale-sensitive text segmentation, enabling you to get meaningful items (graphemes, words or sentences) from a string. - MDN

const segmenterEn = new Intl.Segmenter("en");
[...segmenterEn.segment("👩‍👩‍👦‍👦🌦️🧘🏻‍♂️")].forEach((seg) => {
  console.log(`'${seg.segment}' starting at index ${seg.index}`);
});
// '👩‍👩‍👦‍👦' starting at index 0
// '🌦️' starting at index 11
// '🧘🏻‍♂️' starting at index 14
Enter fullscreen mode Exit fullscreen mode

Getting the visible length of a string

Using the segmenter API, we could split the text based on the graphemes and get the visible length of the string.

Since the output of .segment() is iterable, we will collect that in an array and return its length.

function getVisibleLength(str, locale = "en") {
  return [...new Intl.Segmenter(locale).segment(str)].length;
}
console.log("👩‍👩‍👦‍👦🌦️🧘🏻‍♂️".length); // 21
console.log(getVisibleLength("👩‍👩‍👦‍👦🌦️🧘🏻‍♂️")); // 3
Enter fullscreen mode Exit fullscreen mode

References

Top comments (0)