DEV Community

Paweł bbkr Pabian
Paweł bbkr Pabian

Posted on • Edited on

4

UTF-8 internal design

We already know from previous posts of this series that UTF-8 is variable, multi byte encoding. But how does this exactly work? How does any program know where each character starts and how many bytes it has?

0xxxxxxx - This is 1 byte character. You may notice that it uses the same bits as 7 bit ASCII and that is correct - UTF-8 is compatible with ASCII. However this 0 is important, because it was repurposed to serve as a byte length terminator.
110xxxxx 10xxxxxx - This is 2 bytes character.
1110xxxx 10xxxxxx 10xxxxxx - This is 3 bytes character.
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx - This is 4 bytes character.

So:

  • Amount of 1s before 0 tells how many bytes multi byte character has.
  • Following bytes must start with 10, which means they are multi byte character continuation.
  • If there are no leading 1s it is ASCII.

Here are some real characters analyzed:

raku -e '"a".encode>>.fmt( "%08b" ).say'
(01100001)
Enter fullscreen mode Exit fullscreen mode
$ raku -e '"ź".encode>>.fmt( "%08b" ).say'
(11000101 10111010)
Enter fullscreen mode Exit fullscreen mode
$ raku -e '"😊".encode>>.fmt( "%08b" ).say'
(11110000 10011111 10011000 10001010)
Enter fullscreen mode Exit fullscreen mode

Note on Raku: I will illustrate many UTF examples using Raku language. It has excellent built-in UTF support and compact syntax with no boilerplate. I also will explain syntax briefly, which may be outside of the scope of this series, but will help to understand what is going on in these one-liners.
In this case character is encoded into byte buffer. Each byte is passed to formatting function (>> is just a lazy way to avoid for or map) which prints them as eight zero-padded bits.

Let's stop for a moment to admire genius UTF-8 design:

  • It is 7 bit ASCII compatible. Which also means it is space efficient, most commonly used letters use single byte.

  • Has natural protection against ASCII extensions described previously, which used 1xxxxxxx space. If byte representation of a text encounters byte starting with 1 which does not match "amount of 1s followed by 0 must be followed by the same amount of bytes starting with 10" pattern, then parser can detect that some crooked encoding is being loaded as UTF-8.

$ raku -e 'Buf.new( 0b10000000 ).decode'
Malformed UTF-8 at line 1 col 1
Enter fullscreen mode Exit fullscreen mode
  • Software does not have to know list of Unicode characters to find their boundaries. It is very easy to add basic UTF-8 support and multi byte character concept to very old programs by doing simple byte math.

Cobol

  • Programs does not have to support latest Unicode version. They can find start/end of an unknown character and display some replacement glyph without messing up the rest of text.

Coming up next: Fun with printing sound (optional). Codepoints, what does U+0105 mean?

Heroku

Amplify your impact where it matters most — building exceptional apps.

Leave the infrastructure headaches to us, while you focus on pushing boundaries, realizing your vision, and making a lasting impression on your users.

Get Started

Top comments (0)

Image of Checkly

Replace beforeEach/afterEach with Automatic Fixtures in Playwright

  • Avoid repetitive setup/teardown in spec file
  • Use Playwright automatic fixtures for true global hooks
  • Monitor JS exceptions with a custom exceptionLogger fixture
  • Keep your test code clean, DRY, and production-grade

Watch video

👋 Kindness is contagious

Explore a trove of insights in this engaging article, celebrated within our welcoming DEV Community. Developers from every background are invited to join and enhance our shared wisdom.

A genuine "thank you" can truly uplift someone’s day. Feel free to express your gratitude in the comments below!

On DEV, our collective exchange of knowledge lightens the road ahead and strengthens our community bonds. Found something valuable here? A small thank you to the author can make a big difference.

Okay