JavaScript String Encoding Gotchas

#javascript #programming

What do these three lines of code have in common?

const len = str.length;

const chars = str.split('');

const firstChar = str[0];

Answer: none of them do what you want when emoji or certain other classes of characters are involved!

Well, what do they do then?

Let's have a look. Try running this code, or just look at the comments I added:

// this evaluates to 2!
'😋'.length;

// this evaluates to [ "\ud83d", "\ude0b" ]!
'😋'.split('');

// this evaluates to "\ud83d"!
'😋'[0];

Weird, right? This can also happen with other types of characters, such as relatively rare Chinese characters, certain mathematical characters, musical symbols, and more.

So what's going on here?

It all has to do with how text is encoded internally in JavaScript. In a computer's memory, everything is just a series of bits. Characters are no exception. The letters a, b, c, Chinese characters, musical characters, mathematical characters, emoji, all of them are translated into bits and represented as bits in memory. Only when they are output onto your screen (or printer, etc) are they translated into a visual representation that you, as a human, are able to read.

So if a computer wants to store the character a, it has to translate it into bits first. But which bits? Should it be 0, 1, 0101010, 111, 00001? None of the above? Well, someone has to decide, and whoever that person is could decide however they want.

Fortunately, as JavaScript developers, we do not have to make that decision ourselves. The designers of the JavaScript language made it for us.

And they (fortunately and unsurprisingly) decided to kick the can even further down the road. They decided to use a pre-existing text encoding standard called UTF-16. You can think of UTF-16 as just being the definition of an algorithm that can convert a character (technically a Unicode code point) into a byte sequence, and vice-versa. For example, in UTF-16 the character a is encoded in these two bytes (binary): 01100001 00000000.

But what's special about emoji and rare Chinese characters? Well, in UTF-16, every character is encoded into either two bytes or four bytes. Most characters, including all of the characters that are regularly used in English and other major languages, are encoded as two bytes. But Unicode contains more than 100,000 characters. That's more than can fit in just two bytes.

What happens with the characters that cannot fit into two bytes? They get encoded into four bytes! More technically, they are encoded into a surrogate pair. Each half of the surrogate pair is two bytes long. When the surrogate pair gets read by a computer, the computer looks at the first two bytes and knows that it's one half of a surrogate pair, and it needs to read the next two bytes in order to determine which character that 4-byte sequence is representing.

In UTF-16 a two-byte long sequence is also referred to as a "code unit". So instead of saying that characters are either two or four bytes long in UTF-16, we can say that they are either one or two code units long.

Do you see where this is going? Emoji are encoded as two code units! And as it turns out, JavaScript string functions tend to treat strings not as a sequence of characters, but as a sequence of code units! The .length property, for example, does NOT return the number of characters that are in a string, it actually returns the number of UTF-16 code units. And since emoji consist of two UTF-16 code units, one emoji character has a .length of 2. Worse, doing .split('') on a string does not split it at character boundaries, but actually at code unit boundaries. That's almost never what you really want to do.

Okay so how do I fix it?

JavaScript strings are iterable, and if you iterate over a string, it returns one character at a time. This gives us a way to work around these issues, by iterating over the string and getting all the characters. There are two main convenient ways to do this: using Array.from(), or using the spread operator. Let's try it:

Array.from('😋').length; // this evaluates to 1! Yay!
[...'😋'].length; // this evaluates to 1! Yay!

Array.from('😋'); // this evaluates to [ "😋" ]! Yay!
[...'😋']; // this evaluates to [ "😋" ]! Yay!

Array.from('😋')[0]; // this evaluates to "😋"! Yay!
[...'😋'][0]; // this evaluates to "😋"! Yay!

Yay!

But doesn't JS use UTF-8?

There is a common misconception that JavaScript uses UTF-8 encoding internally for strings. This is understandable, but incorrect. I think people have this misconception because they see that libraries like fs in Node will write files as UTF-8 if you do not specify an encoding. But for fs to do that, it does a conversion from UTF-16 to UTF-8 before writing to the file. Basically, there can be a difference between the encoding used to store strings in memory in JavaScript and the encoding that libraries like fs choose to use by default for I/O.

.charCodeAt() vs .codePointAt()

One last thing. I often see .charCodeAt() used on strings to get a character's numeric character code. For example, 'a'.charCodeAt(0) returns the number 91.

As you might expect, this doesn't work on 4-byte characters. Look what happens if we try to convert an emoji to a character code, and then back again:

// It evaluates to "\ud83d". Gross!
String.fromCharCode('😋'.charCodeAt(0));

Instead, use the codePointAt() function:

// It evaluates to "😋". Yay!
String.fromCodePoint('😋'.codePointAt(0));

I cannot think of any good reason to use charCodeAt() instead of codePointAt(). They both return the same number except for 4-byte characters, in which case charCodeAt() is basically wrong and codePointAt() is correct. So I would suggest always using codePointAt() unless you have a really good reason not to.

I would even argue that charCodeAt() is misnamed, or at least misleadingly named. What it really does is return the code unit at the given position. And that's something we rarely have reason to do.

Conclusion

I think we are all pretty used to using .length and friends on strings, but they have some serious issues with characters that encode into 4 bytes in UTF-16. Unless you are certain that your code is not going to have to handle 4-byte characters, I would recommend using the spread operator or Array.from() techniques instead. They can save you from some really weird bugs. When performance is critical though, just be aware that, in the case of calculating the length of a string, iterating it first is significantly slower than accessing .length on it.