Vincent Corbee

Posted on Aug 5

UTF-16 to UTF-8 in Javascript

#javascript #typescript #utf8 #programming

How are strings in Javascript encoded? It might surprise some people that Javascript uses UTF-16 to encode strings. But if your are like me, you might be wondering: how can I convert this to UTF-8? Well today we are going to find out. 🧐

The main focus in this post is on how to convert a UTF-16 string to UTF-8 in Javascript. It is not about how to deal with strings in Javascript because that is a topic all on its own. We will shortly go over Unicode. Further we will discuss the UTF-16 and UTF-8 encoding scheme in so far as we will be needing it for our purpose.

Unicode

Unicode, not to be confused with unicorn — although the are both pretty magical 🦄 — is the de-facto encoding standard for symbols. You can even adopt a character🤘🏻. Unicode is everywhere. Unicode stores each character with a unique 21 bit scalar value for a total 1,114,112. To represent a Unicode character an encoding format is used called a Unicode transformation format UTF, not to be confused with WTF — although working with unicode you probably use it from time to time — for short.

To encode Unicode, one of three encoding forms are used: 32-bit (UTF-32), 16-bit form (UTF-16) and 8-bit (UTF-8). You might have noticed that you probable have seen different spellings. For example you will typically see UTF-8 written as utf8, utf-8 or UTF8. But the official writing is UTF followed by a hyphen followed by the encoding form.

The most widely used and probably known is UTF-8. It stores every unicode scalar value as a sequence of 8 bit unsigned integers from one up to four bytes. A single character can take more than four bytes since it can consist of more than one scalar value. For example the Australian flag 🇦🇺 takes two values U+1F1E6 and U+1F1FA and is stored in 8 bytes in UTF-8.

Unicode in Javascript

In Javascript strings are encoded in UTF-16. The string: “Hello world 💩” will be encoded as a sequences of 16 bit unsigned integers. If I asked you what is the length of this string, what would you answer? If you answered 14 you are correct. For the ones who didn’t know or pasted ‘Hello world 💩’.length in the console, you might have been surprised it was not 13. So what’s up with that? To answer that question what we have to see what the UTF-16 encoding scheme looks like.

UTF-16

In UTF-16 Unicode scalar values are either encoded in one or two unsigned 16 bit integers.

The encoding schema is as follows.

Scalar value                UTF-16 values

xxxxxxxxxxxxxxxx            xxxxxxxxxxxxxxxx
000uuuuuxxxxxxxxxxxxxxxx    110110wwwwxxxxxx    110111xxxxxxxxxx

Scalar values in the range of U+0000 — U+D7FF and U+E000 — U+FFFF are encoded as is in a single uint16.

Scalar values in the range of U+10000 — U+10FFFF are encoded as two uint16’s.

Surrogates

The code points U+10000 — U+10FFFF are encoded as a pair of surrogates. A high (leading) surrogate in the range U+D800 — U+DBFF and a low (trailing) surrogate in the range U+DC00 to U+DFFF. These surrogate values should always come in pairs. A single surrogate is not a unicode scalar value and thus makes the encoding ill formed. So how are the surrogates constructed from the scalar value?

First Subtract 0x10000 from the code point to account for the starting offset to the scalar value.

For the high surrogate shift the remainder unsigned right by 10 and add 0xD800.

For the low surrogate take the low 10 bits and add 0xDC200.

Let’s take 💩 as an example which has the value: U+1F4A9 and convert it in Javascript to a surrogate pair.

const remainder = 0x1f4a9 - 0x10000 /* 62633 0xf4a9 */
const high = (remainder >>> 10) + 0xd800 /* 55357 0xd83d */
const low = (remainder & 0x3ff) + 0xdc00 /* 56489  0xdca9 */

So 💩 is converted into the surrogate pair 0xD87D and 0xDCA9.

That’s why our initial example “Hello world 💩” has a length of 14 and not 13 because a pair of surrogates where used for our poop emoji.

And since 🇦🇺 consists of two scalar values which are both encoded as surrogate pair, we get a length of 2.

This is one the reasons if you don’t know what symbols are used in your string, you could get weird results with accessing an index in your string.
What further complicates things is that you can have composed characters. These are characters which are encoded with a base and one or more combining marks. For example the letter é — Latin Small Letter E with Acute — can be represented as one character code 0xe9 or as two 0x65 and 0x301.

UTF-8

Now let’s look at UTF-8. It assigns each Unicode scalar value to an unsigned 8 bit sequence of one to four bytes in length. This means that unlike with UTF-16 which uses surrogate pairs for U+10000 — U+10FFFF all scalar values are encoded in an 1–4 byte sequence.

Scalar value        Byte 1   Byte 2   Byte 3    Byte 4    Code points

U+0000  - U+007F    0xxxxxxx                                      128
U+0080  - U+07FF    110xxxxx 10xxxxxx                            1920
U+0800  - U+FFFF    1110xxxx 10xxxxxx 10xxxxxx                  61440
U+10000 - U+10FFFF  11110xxx 10xxxxxx 10xxxxxx  10xxxxxx      1048576

The encoding scheme looks like this.

Scalar Value                Byte 1    Byte 2    Byte 3    Byte 4

00000000 0xxxxxxx           0xxxxxxx
00000yyy yyxxxxxx           110yyyyy  10xxxxxx
zzzzyyyy yyxxxxxx           1110zzzz  10yyyyyy  10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx  11110uuu  10uuzzzz  10yyyyyy  10xxxxxx

The symbols in the first range are encoded as is in 8 bits. The next three ranges are encoded with the most significants bits of the first byte set to the length of the byte sequence followed by a 0 i.e. 110, 1110 and 11110. And the following bytes have the first two most significants bits set to 10.

So how would we store 💩? Well we know the value is 0x1F4A9. This means that if we look at our encoding scheme we have to store it in four bytes.

If we overlay our schema, 0x1F4A9 looks like this: 000 00001 11110100 10101001.

We start with 1111 0000 then we add the first 3 bits of the 21 bit value which is 000 so we end up with 1111 0000. Then we start with 1000 0000 and add the next 6 bits which is 01 1111 and we and up with 1001 1111. We start again with 1000 0000 and add the next 6 bits which results in 100100 10. For the fourth byte we repeat the process and end up with 10101001.

The encoding for 💩 results in: F0, 9F, 92, A9.

In Javascript we could implement this as follows.

const codePoint = 0x1f4a9

const firstByte = 0xf0 | (codePoint >>> 18) /* 1111 0000 + first 3 bits */
const secondByte = 0x80 | (codePoint >>> 12 & 0x3f) /* 1000 0000 + next 6 bits */
const thirdByte = 0x80 | (codePoint >>> 6 & 0x3f) /* 1000 0000 + next 6 bits */
const fourthByte = 0x80 | (codePoint & 0x3f) /* 1000 0000 + next 6 bits */

Because the encoding is a variable length 8 bit sequence, it is impossible to access a character by an index without going forward and backward in the stream. So if you have an array of n 8 bit unsigned integers you can’t say give me the character at index 8 because you don’t know if that is an actual character or part of a sequence of bytes.

UTF-8 in Javascript

So now that we know more or less how UTF-16 and UTF-8 works, how would we implement UTF-16 to UTF-8 in Javascript? We are going to build two things an encoder and a decoder. The first one takes a UTF-16 string and converts into an Uint8Array. The decoder takes in a Uint8Array and outputs a UTF-16 string. You can type along but the full source code is also available on github: https://github.com/vincentcorbee/string-encoding.

Let’s get typing

We don’t need any dependencies for our actual implementation. But since we are going to be using Typescript we are going to install typescript, ts-node and types/nodes as dev dependencies. The following commands will setup the basic structure that we need for our project.

touch string-encoding && cd string-encoding

yarn init -y

yarn add -D ts-node typescript @types/node

npx tsc --init

mkdir src && mkdir src/lib && mkdir src/modules

touch src/lib/index.ts && touch src/index.ts && touch src/modules/index.ts

In index.ts add.

export * from './modules/string-encoder'
export * from './modules/string-decoder'

In lib/index.ts add.

export * from './is-leading-surrogate'
export * from './is-trailing-surrogate'
And finally in modules/index.ts add.

export * from './string-reader'
export * from './string-encoder'
export * from './string-decoder'

Encoder

We are going to start with our encoder. Create modules/string-encoder.ts.

import { isLeadingSurrogate, isTrailingSurrogate } from "../lib"
import { StringReader } from "./string-reader"

export class StringEncoder {
  readonly encoding = 'UTF-8'

  constructor() {}

  encode(source: string): Uint8Array {
    return StringEncoder.stringToUtf8(source)
  }

  static UTF16SurrogatePairToCodePoint(leading: number, trailing: number): number {
    return ((leading - 0xd800) * 0x400) + (trailing - 0xDC00) + 0x10000
  }

  static stringToUtf8(source: string): Uint8Array {
    const utf8codes: number[] = []

    const stringReader = new StringReader(source)

    let charCode: number | null

    while((charCode = stringReader.next()) !== null) {
      /* Character takes one byte in UTF-8 */
      if (charCode >= 0x0000&& charCode <= 0x007F) utf8codes.push(charCode)

      /* Character takes two bytes in UTF-8 */
      else if (charCode >= 0x0080 && charCode <=0x07FF) {
        const firstByte = 0xc0 | (charCode >>> 6)
        const secondByte = 0x80 | (charCode & 0x3f)

        utf8codes.push(firstByte, secondByte)
      }

      /* High surrogate */
      else if (isLeadingSurrogate(charCode))  {
        const leading = charCode

        /* Low surrogate */
        const trailing = stringReader.peak()

        if (trailing && isTrailingSurrogate(trailing)) {
          /* Surrogate pairs takes four bytes in UTF-8 */
          const codePoint  = StringEncoder.UTF16SurrogatePairToCodePoint(leading, trailing)

          const firstByte = 0xf0 | (codePoint >>> 18)
          const secondByte = 0x80 | (codePoint >>> 12 & 0x3f)
          const thirdByte = 0x80 | (codePoint >>> 6 & 0x3f)
          const fourthByte = 0x80 | (codePoint & 0x3f)

          utf8codes.push(firstByte, secondByte, thirdByte, fourthByte)

          stringReader.advance()
        }

         else { /* Isolated high surrogate */ }
      }

      /* Low surrogate */
      else if (isTrailingSurrogate(charCode)) { /* Isolated low surrogate */ }

      /* Character takes three bytes in UTF-8 */
      else if (charCode >= 0x0800 && 0xFFFF) {
        const firstByte = 0xe0 | (charCode >>> 12)
        const secondByte = 0x80 | (charCode >>> 6 & 0x3f)
        const thirdByte = 0x80 | (charCode & 0x3f)

        utf8codes.push(firstByte, secondByte, thirdByte)
      }
    }

    return new Uint8Array(utf8codes)
  }

  static typedArraytoString(typedArray: Uint8Array, radix = 16) {
    return `<${typedArray.reduce((acc: string, number: number, index: number) => {
      if (index > 0) acc += ', '

      acc += number.toString(radix)

      return acc
    }, '')}>`
  }
}

Our StringEncoder is instantiated without any arguments en has the encoding set to UTF-8. You could let the encoding take in an optional parameter for encoding and implement different encodings. Our class exposes a single method called encoding that has a string as its input.

Since we only encode UTF-8 we call the function stringToUtf8 with the string. The main loop in this functions runs until the entire string is processed. We have created a helper class called StringReader to read the string. Which this we can call next, peak and advance to walk through our string. It uses String.charCodeAt(0) to read a particular index in the string and gives back the character code which is the UTF-16 character code. So let’s create that first in modules/string-reader.ts.

export class StringReader {
  private position: number

  constructor(public source: string) {
    this.position = 0
  }

  private isInRange(position: number) {
    if (position > -1 && position < this.source.length) return true

    return false
  }

  next() {
    if (!this.isInRange(this.position)) return null

    return this.source[this.position++].charCodeAt(0)
  }

  previous() {
    if (!this.isInRange(this.position)) return null

    return this.source[this.position--].charCodeAt(0)
  }

  peak() {
    const oldPosition = this.position

    const character = this.next()

    this.position = oldPosition

    return character
  }

  seek(position: number) {
    if (!this.isInRange(position)) throw RangeError(`Offset: ${position} is out of range`)

    this.position = position

    return this
  }

  advance() {
    const position = this.position + 1

    if (!this.isInRange(position)) this.position = -1
    else this.position++

    return this
  }

  getPosition() {
    return this.position
  }
}

We are then going to check if we need to store the character as one, two, three or four bytes. In every case we will be outputting 8 bit unsigned integers.

If the character code can be represented as one, two or three bytes the character code in UTF-16 maps to the Unicode scalar value so we can just store these codes according to our schema. If the character code needs to be stored in four bytes we are dealing with surrogate pairs. So we peak the next code en check if it is in fact a low surrogate. If it is we construct the scalar value from the high and low surrogate. We simple reverse the process of encoding the the value. Once we have that we can encode the code point as we already have discussed above and advance are string reader.

Once we have processed the entire string we return a new Uint8Array with our new and shine UTF-8 encodings.

Decoder

Now that we have our decoder, we would also like to be able to decode a UTF-8 encoding into a UTF-16 string.

Let’s create modules/string-decoder.ts with the following.

export class StringDecoder {
  readonly encoding = 'UTF-8'

  constructor() {}

  static stringFromUTF16CharCode(charCodes: number[]): string {
    return String.fromCharCode(...charCodes)
  }

  static UTF8ToString(uint8Array: Uint8Array): string {
    const stringFromUTF16CharCode = StringDecoder.stringFromUTF16CharCode

    const charCodes: number[] = []

    for (let index = 0, length = uint8Array.byteLength; index < length; index++) {
      const charCode = uint8Array[index]

      /* Character takes one byte */
      if (charCode < 0xc0) charCodes.push(charCode)

      /* Character takes two bytes */
      else if (charCode < 0xe0) charCodes.push((charCode & 0x1f) << 6 | (uint8Array[++index] & 0x3f))

      /* Character takes three bytes */
      else if(charCode < 0xef) charCodes.push((charCode & 0xf) << 12 | ((uint8Array[++index] & 0x3f) << 6) | (uint8Array[++index] & 0x3f))

      /* Character takes four bytes */
      else {
        /* Character consists of high and low surrogate pair */
        const codePoint = ((charCode & 0x7) << 18 | ((uint8Array[++index] & 0x3f) << 12) | ((uint8Array[++index] & 0x3f) << 6) | (uint8Array[++index] & 0x3f)) - 0x10000

        charCodes.push((codePoint >>> 10) + 0xD800, (codePoint & 0x3ff) + 0xDC00)
      }
    }

    return stringFromUTF16CharCode(charCodes)
  }

  decode(typedArray: Uint8Array): string {
    return StringDecoder.UTF8ToString(typedArray)
  }
}

The decoder also does not take any arguments cause we are going to set the encoding fixed to UTF-8. And surprisingly we expose a public method called decode which takes in a typed array which in our case will be a Uint8Array because that is the only one we will be supporting. This functions calls UTF8ToString with the typed array.

The main loops runs over the length of the array. In this loop we check if the encoding will consists of one, two, three or four bytes and process them accordingly. If you recalled the encoding, we know that the first byte of a sequence tells us how many bytes we need to process to retrieve the scalar value. When we only have one byte we know we can just store the value as is. If we have two or three bytes we know we can use the stored scalar value as is because in UTF-16 they are stored in 2 bytes. To piece the code point back together we need to retrieve the stored code value from the byte sequence as shown in the schema. If we have a four byte sequence, we are dealing with surrogate pairs. In this case next to retrieving the scalar value, we will be transforming that scalar value into a surrogate pair.

When every byte is processed, call stringFromUTF16CharCode which calls String.fromCharCode with all of our UTF-16 character codes and returns a string.

If everything go’s well we should output a valid string.

Now let’s tests our new encoder and decoder. Create src/test.ts.

import { StringDecoder, StringEncoder } from "."

const stringA = '💩'
const stringB = '🇬🇧'
const stringC = 'abce'
const stringD = 'e\u0301'
const stringE = 'é'
const stringF = '𐐷€'
const stringG = '🫱🏿‍🫲🏻'
const stringH = 'Привет, мир!'

const input = stringA

const stringEncoder = new StringEncoder()
const stringDecoder = new StringDecoder()

const uint8Array = stringEncoder.encode(input)

console.log(StringEncoder.typedArraytoString(uint8Array))

const decoded = stringDecoder.decode(uint8Array)

console.log(decoded)

Now if we run

npx ts-node src/test.ts

from our root folder, we should see the UTF-8 encoded data in the console followed by the decoding which should equal our original string.

I did notice something though. If I were to encode and decode a multi scalar value for example the flag of Great Britain, I would get back GB in my terminal in VS Code. Which is not wrong since the flag uses REGIONAL INDICATOR SYMBOL LETTER G and REGIONAL INDICATOR SYMBOL LETTER B. But if I take the UTF-16 character codes that we produced and turn them into a symbol in the console I do get to see the flag symbol. Not really sure what is going on there. 🤔

Conclusion

I hope if you made it this far you have found it mildly interesting. At least I did. And I always learn something myself when I try to teach others something I learned.

DEV Community