DEV Community

Paweł bbkr Pabian
Paweł bbkr Pabian

Posted on • Edited on

UTF-8 internal design

We already know from previous posts of this series that UTF-8 is variable, multi byte encoding. But how does this exactly work? How does any program know where each character starts and how many bytes it has?

0xxxxxxx - This is 1 byte character. You may notice that it uses the same bits as 7 bit ASCII and that is correct - UTF-8 is compatible with ASCII. However this 0 is important, because it was repurposed to serve as a byte length terminator.
110xxxxx 10xxxxxx - This is 2 bytes character.
1110xxxx 10xxxxxx 10xxxxxx - This is 3 bytes character.
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx - This is 4 bytes character.

So:

  • Amount of 1s before 0 tells how many bytes multi byte character has.
  • Following bytes must start with 10, which means they are multi byte character continuation.
  • If there are no leading 1s it is ASCII.

Here are some real characters analyzed:

raku -e '"a".encode>>.fmt( "%08b" ).say'
(01100001)
Enter fullscreen mode Exit fullscreen mode
$ raku -e '"ź".encode>>.fmt( "%08b" ).say'
(11000101 10111010)
Enter fullscreen mode Exit fullscreen mode
$ raku -e '"😊".encode>>.fmt( "%08b" ).say'
(11110000 10011111 10011000 10001010)
Enter fullscreen mode Exit fullscreen mode

Note on Raku: I will illustrate many UTF examples using Raku language. It has excellent built-in UTF support and compact syntax with no boilerplate. I also will explain syntax briefly, which may be outside of the scope of this series, but will help to understand what is going on in these one-liners.
In this case character is encoded into byte buffer. Each byte is passed to formatting function (>> is just a lazy way to avoid for or map) which prints them as eight zero-padded bits.

Let's stop for a moment to admire genius UTF-8 design:

  • It is 7 bit ASCII compatible. Which also means it is space efficient, most commonly used letters use single byte.

  • Has natural protection against ASCII extensions described previously, which used 1xxxxxxx space. If byte representation of a text encounters byte starting with 1 which does not match "amount of 1s followed by 0 must be followed by the same amount of bytes starting with 10" pattern, then parser can detect that some crooked encoding is being loaded as UTF-8.

$ raku -e 'Buf.new( 0b10000000 ).decode'
Malformed UTF-8 at line 1 col 1
Enter fullscreen mode Exit fullscreen mode
  • Software does not have to know list of Unicode characters to find their boundaries. It is very easy to add basic UTF-8 support and multi byte character concept to very old programs by doing simple byte math.

Cobol

  • Programs does not have to support latest Unicode version. They can find start/end of an unknown character and display some replacement glyph without messing up the rest of text.

Coming up next: Fun with printing sound (optional). Codepoints, what does U+0105 mean?

Top comments (0)