DEV Community

Paweł bbkr Pabian
Paweł bbkr Pabian

Posted on • Edited on

4

UTF-8 internal design

We already know from previous posts of this series that UTF-8 is variable, multi byte encoding. But how does this exactly work? How does any program know where each character starts and how many bytes it has?

0xxxxxxx - This is 1 byte character. You may notice that it uses the same bits as 7 bit ASCII and that is correct - UTF-8 is compatible with ASCII. However this 0 is important, because it was repurposed to serve as a byte length terminator.
110xxxxx 10xxxxxx - This is 2 bytes character.
1110xxxx 10xxxxxx 10xxxxxx - This is 3 bytes character.
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx - This is 4 bytes character.

So:

  • Amount of 1s before 0 tells how many bytes multi byte character has.
  • Following bytes must start with 10, which means they are multi byte character continuation.
  • If there are no leading 1s it is ASCII.

Here are some real characters analyzed:

raku -e '"a".encode>>.fmt( "%08b" ).say'
(01100001)
Enter fullscreen mode Exit fullscreen mode
$ raku -e '"ź".encode>>.fmt( "%08b" ).say'
(11000101 10111010)
Enter fullscreen mode Exit fullscreen mode
$ raku -e '"😊".encode>>.fmt( "%08b" ).say'
(11110000 10011111 10011000 10001010)
Enter fullscreen mode Exit fullscreen mode

Note on Raku: I will illustrate many UTF examples using Raku language. It has excellent built-in UTF support and compact syntax with no boilerplate. I also will explain syntax briefly, which may be outside of the scope of this series, but will help to understand what is going on in these one-liners.
In this case character is encoded into byte buffer. Each byte is passed to formatting function (>> is just a lazy way to avoid for or map) which prints them as eight zero-padded bits.

Let's stop for a moment to admire genius UTF-8 design:

  • It is 7 bit ASCII compatible. Which also means it is space efficient, most commonly used letters use single byte.

  • Has natural protection against ASCII extensions described previously, which used 1xxxxxxx space. If byte representation of a text encounters byte starting with 1 which does not match "amount of 1s followed by 0 must be followed by the same amount of bytes starting with 10" pattern, then parser can detect that some crooked encoding is being loaded as UTF-8.

$ raku -e 'Buf.new( 0b10000000 ).decode'
Malformed UTF-8 at line 1 col 1
Enter fullscreen mode Exit fullscreen mode
  • Software does not have to know list of Unicode characters to find their boundaries. It is very easy to add basic UTF-8 support and multi byte character concept to very old programs by doing simple byte math.

Cobol

  • Programs does not have to support latest Unicode version. They can find start/end of an unknown character and display some replacement glyph without messing up the rest of text.

Coming up next: Fun with printing sound (optional). Codepoints, what does U+0105 mean?

Image of AssemblyAI tool

Transforming Interviews into Publishable Stories with AssemblyAI

Insightview is a modern web application that streamlines the interview workflow for journalists. By leveraging AssemblyAI's LeMUR and Universal-2 technology, it transforms raw interview recordings into structured, actionable content, dramatically reducing the time from recording to publication.

Key Features:
🎥 Audio/video file upload with real-time preview
🗣️ Advanced transcription with speaker identification
⭐ Automatic highlight extraction of key moments
✍️ AI-powered article draft generation
📤 Export interview's subtitles in VTT format

Read full post

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Explore a sea of insights with this enlightening post, highly esteemed within the nurturing DEV Community. Coders of all stripes are invited to participate and contribute to our shared knowledge.

Expressing gratitude with a simple "thank you" can make a big impact. Leave your thanks in the comments!

On DEV, exchanging ideas smooths our way and strengthens our community bonds. Found this useful? A quick note of thanks to the author can mean a lot.

Okay