DEV Community

Paweł bbkr Pabian
Paweł bbkr Pabian

Posted on • Edited on

4

UTF-8 internal design

We already know from previous posts of this series that UTF-8 is variable, multi byte encoding. But how does this exactly work? How does any program know where each character starts and how many bytes it has?

0xxxxxxx - This is 1 byte character. You may notice that it uses the same bits as 7 bit ASCII and that is correct - UTF-8 is compatible with ASCII. However this 0 is important, because it was repurposed to serve as a byte length terminator.
110xxxxx 10xxxxxx - This is 2 bytes character.
1110xxxx 10xxxxxx 10xxxxxx - This is 3 bytes character.
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx - This is 4 bytes character.

So:

  • Amount of 1s before 0 tells how many bytes multi byte character has.
  • Following bytes must start with 10, which means they are multi byte character continuation.
  • If there are no leading 1s it is ASCII.

Here are some real characters analyzed:

raku -e '"a".encode>>.fmt( "%08b" ).say'
(01100001)
Enter fullscreen mode Exit fullscreen mode
$ raku -e '"ź".encode>>.fmt( "%08b" ).say'
(11000101 10111010)
Enter fullscreen mode Exit fullscreen mode
$ raku -e '"😊".encode>>.fmt( "%08b" ).say'
(11110000 10011111 10011000 10001010)
Enter fullscreen mode Exit fullscreen mode

Note on Raku: I will illustrate many UTF examples using Raku language. It has excellent built-in UTF support and compact syntax with no boilerplate. I also will explain syntax briefly, which may be outside of the scope of this series, but will help to understand what is going on in these one-liners.
In this case character is encoded into byte buffer. Each byte is passed to formatting function (>> is just a lazy way to avoid for or map) which prints them as eight zero-padded bits.

Let's stop for a moment to admire genius UTF-8 design:

  • It is 7 bit ASCII compatible. Which also means it is space efficient, most commonly used letters use single byte.

  • Has natural protection against ASCII extensions described previously, which used 1xxxxxxx space. If byte representation of a text encounters byte starting with 1 which does not match "amount of 1s followed by 0 must be followed by the same amount of bytes starting with 10" pattern, then parser can detect that some crooked encoding is being loaded as UTF-8.

$ raku -e 'Buf.new( 0b10000000 ).decode'
Malformed UTF-8 at line 1 col 1
Enter fullscreen mode Exit fullscreen mode
  • Software does not have to know list of Unicode characters to find their boundaries. It is very easy to add basic UTF-8 support and multi byte character concept to very old programs by doing simple byte math.

Cobol

  • Programs does not have to support latest Unicode version. They can find start/end of an unknown character and display some replacement glyph without messing up the rest of text.

Coming up next: Fun with printing sound (optional). Codepoints, what does U+0105 mean?

Heroku

Simplify your DevOps and maximize your time.

Since 2007, Heroku has been the go-to platform for developers as it monitors uptime, performance, and infrastructure concerns, allowing you to focus on writing code.

Learn More

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more