Explain UTF-8 character encoding like I'm five

#explainlikeimfive

Top comments (1)

Austin S. Hemmelgarn • Sep 29 '20 • Edited

Probably easiest if we build up from base concepts, so (note, all Unicode code points listed below are in decimal to specifically avoid any ambiguity involved in the discussion of encodings):

Unicode

Unicode is, quite simply, a method of representing (almost) every grapheme (character in common terminology) used in known writing systems in the world alongside a very large number of symbols and a lot of different control codes as unsigned integers known as 'code points'. Originally code points were 32-bit integers, but it very quickly became obvious that this was beyond excessive, and when UTF-16 finally gained traction the standard was revised so that code points are now only 25-bit unsigned integers.

All of the code points are sorted into groups of 65536 known as 'planes'. The plane a code point belongs to is indicated by the value of the top two bytes of it's overall value, with the other two bytes representing where in that plane it is. Seven of these planes actually have code points assigned to them (specifically planes 0-3 and 14-16), but about 99% of actual usage involves only the first two planes, known as the Basic Multilingual Plane (which holds a vast majority of the most common writing systems except certain less frequently used Hanzi characters and most of the Bhramic scripts) and the Supplementary Multilingual Plane (which includes the aforementioned Bhramic scripts, an assortment of historical scripts, a number of supplemental characters for scripts in the BMP, and large number of symbols).

Of specific note for later discussion, the first 128 code points (numbered 0-127) have the exact same meaning as the bytes with the same value interpreted as 7-bit ASCII or any of the ISO-8859 character encodings (and a number of other encodings as well).

UTF-32

Of course, Unicode is kind of pointless unless you have a way for computers to represent those characters. The most naive encoding for this is to just have each character be four bytes interpreted as an integer which represents the code point. This has three major issues:

It's very inefficient, because for most writing systems you only, in fact, need at most two bytes (and many Western European languages can get away with only using one). This means that most of your bytes in any text file will be null bytes. This is compounded by the fact that most of the planes are unused, so at minimum there will always be at least one null byte between actually significant bytes in a UTF-32 stream (at least, until we start using planes numbered 256 or higher).
Correct parsing of bytes into a UTF-32 character stream is dependent on the byte order of the stream. More specifically, for a given set of four bytes, there are actually four ways you can interpret it as a signed integer, and you have to pick the right ordering to interpret a stream of UTF-32 text correctly. This is commonly solved by prepending code point 65279 to the character stream, known as a 'byte order mark'.
It's incompatible with existing text encodings. Even interpreting it as 7-bit ASCII does not yield correct results even if it's only got code points with a value of 127 or less. This is significant because it means you cannot use UTF-32 with any existing system that expects a certain byte-based encoding, because many environments assign special meaning to certain characters (especially /, \, and %), and those characters may appear as part of actual UTF-32 characters.
It requires special synchronization when transmitting or receiving characters or loading characters from the middle of a text stream. More specifically, you need some way to know where the character boundaries are.

Obviously there's got to be a better solution...

UTF-16

UITF-16 evolved out of an interest of dealing with the efficiency issues with UTF-32. Instead of using a sequence of 32-bit integers, it uses a sequence of 16-bit integers. Code points in the BMP are represented simply as their exact value as a 16-bit integer. Code points beyond the BMP up through plane 16 (the highest plane likely to actually be used during our lifetime) are represented by a special pair of code points known as a surrogate pair. Anything beyond plane 16 cannot be represented (though this is no longer an issue, as the Unicode standard has been revised as mentioned above).

This sounds complicated, and it kind of is, but for most everyday text not involving a Bhramic script or Hanzi, it's more efficient than UTF-32 by a significant margin. It's also greatly simplified by the fact that the surrogate code points used for encoding things beyond the BMP are specifically reserved for that purpose, so there's no guesswork involved to figure out whether or not something is a character by itself or part of a surrogate pair.

UTF-16 has it's own issues though:

It's still got some efficiency issues for a small handful of languages, most notably English (English text in UTF-16 is mostly alternating null bytes and characters). However, this is nowhere near as bad as for UTF-32.
It's still dependent on byte order and still needs a byte order mark.
It's still incompatible with any existing encoding scheme and still has the same issues because of this that UTF-32 does.
It still requires synchronization.

Microsoft Windows uses UTF-16 internally, as do a handful of other systems.

UTF-8

UTF-8, in turn, evolved out of an interest in fixing that compatibility issue that UTF-16 and UTF-32 have. It's much more complicated than both, and works as follows:

For code points 0-127, represent them using a single byte with the same value as the code point.
For code points 128-2047, represent them using two bytes, embedding 5 bits of the code point in the first byte and 6 in the second.
For code points 2048-65535, represent them using three bytes, embedding 4 bits of the code point in the first byte and 6 each in the second and third.
For code points 65536-1114111, represent them using four bytes, embedding 3 bits of the code point in the first byte and 6 each in the second, third, and fourth.

This allows representation of the same set of characters as UTF-16, with the following specific benefits:

It preserves compatibility with 7-bit ASCII, which means most environments can work with it safely.
It is byte order agnostic, not requiring a byte order mark at all (in fact, the official standard explicitly forbids having one, but many implementations just ignore it if it's there and a number actually add one anyway).
It's inherently self synchronizing. Because of how the encoding works, if you already know it's UTF-8 you're looking at, you can pick up anywhere in the byte stream and start decoding safely without having to worry about getting the character boundaries wrong.
Because it's self-synchronizing, deletion errors that remove whole bytes from a stream are non-fatal. You just lose those characters, while in UTF-16 such an error occuring and causing loss of an odd number of bytes will result in the rest of the stream being garbled.
It takes up slightly less space for many languages that use the Latin alphabet, and significantly less space for many programming and markup languages (because they heavily utilize ASCII characters for formatting).

There are still some downsides to UTF-8 though, though they're different from those with UTF-16 and UTF-32:

For writing systems on the BMP that need characters above code point 2047 (including many East-Asian writing systems), it's actually less efficient than UTF-16, because it requires 3 bytes to represent such characters instead of only 2.
Because characters are variable width, indexing characters (that is, figuring out what character is at a given position in the stream) requires parsing the whole stream.

Combining characters, normalization, and special symbols.

Of course, the 'complicated long form' part of UTF-8 is not really as specific to it as it seems. One of the more interesting aspects of Unicode from an implementer's perspective is the concept of 'combining characters'. These are special code points that work similarly to how old typewriters could over-type on top of existing text (used for things like underlines and diacritical marks). The most common examples are diacritical marks, but others exist.

What makes this interesting is that many of the characters you can produce by combining combining characters with other characters have their own pre-combined forms assigned to other code points. This, in turn, results in a 1:N mapping of actual characters to code points, which makes parsing any Unicode text more complicated no matter what encoding system you're using. Many systems prefer one form or the other for these characters, and will 'normalize' any text they process to be in that form.

This gets even more interesting though when you consider that Unicode actually differentiates characters used as letters versus the same characters used as symbols, which in turn can lead to an even larger number of ways to represent a character.

An easy example of this is 'Å'. It's a character used in a number of Nordic languages, and also is used as the symbol for the unit of measurement known as an Ångstrom. In Unicode, it's got three representations:

Code point 197 'Latin Capital Letter A with Ring Above'
Code point 8491 'Angstrom Sign'
Code point 65 'Latin Capital Letter A' followed by code point 778 'Combining Ring Above'