DEV Community

pickuma
pickuma

Posted on • Originally published at pickuma.com

UTF-8 vs Unicode: The Difference Every Developer Should Know

People use "Unicode" and "UTF-8" as if they were two answers to the same question. They are not. They sit at different layers of the same system, and confusing them leads to mojibake, off-by-one bugs in string length, and broken emoji.

Unicode is the catalog, not the bytes

Unicode is a standard maintained by the Unicode Consortium. Its core job is simple to state: give every character humans write a unique number. That number is called a code point, written in the form U+XXXX. The letter A is U+0041. The grinning-face emoji is U+1F600. A code point is an abstract identity, not a sequence of bytes — it is an entry in a giant catalog.

Unicode covers far more than letters and emoji. It includes scripts (Latin, Cyrillic, Han, Arabic), combining marks, control characters, and symbols. The code space runs from U+0000 up to U+10FFFF, which is room for over a million code points, of which a large but growing fraction are assigned.

What Unicode does not tell you is how to store a code point in memory or send it over a wire. U+1F600 is a number around 128,512. Do you write it as four bytes? Two? A variable number? That choice is a separate decision, and that decision is an encoding.

UTF-8, UTF-16, UTF-32: three ways to write the same numbers

An encoding is a concrete rule that maps each code point to a sequence of bytes (and back). Unicode defines several, and they all represent the exact same characters — they just disagree on the byte layout.

  • UTF-32 uses a fixed 4 bytes per code point. Simple to index, but wasteful: plain English text quadruples in size.
  • UTF-16 uses 2 bytes for common characters and a 4-byte "surrogate pair" for code points above U+FFFF (like most emoji). It is what Java, JavaScript, and Windows use internally.
  • UTF-8 is variable-width: 1 to 4 bytes per code point. The first 128 code points (U+0000U+007F, i.e. ASCII) encode as a single byte identical to ASCII. Latin text stays compact; other scripts use 2–4 bytes.

Here is the same character at three layers — try it yourself:

$ printf '😀' | xxd
00000000: f09f 9880    # 4 bytes in UTF-8
# code point: U+1F600   (the Unicode identity)
Enter fullscreen mode Exit fullscreen mode

UTF-8 won the web for a few concrete reasons. It is backward compatible with ASCII, so decades of existing files and protocols just worked. It is compact for the Latin-script content that dominated the early web. And it has no byte-order problem: UTF-16 and UTF-32 come in big-endian and little-endian flavors and need a byte-order mark to disambiguate, while UTF-8's byte sequence is fully defined on its own. Today the overwhelming majority of web pages are served as UTF-8.

Unicode is the map of characters; UTF-8 is one way to write them down as bytes. Asking "is this Unicode or UTF-8?" is like asking whether a phone number is a person or ink on paper. The number identifies who you call (the code point); the writing is how it is recorded (the encoding). A file is "Unicode text encoded as UTF-8" — both, at different layers.

Why "string length" is a trick question

Once you separate the layers, a classic interview trap dissolves. What is the length of a string containing one emoji? It depends which layer you measure:

  • Bytes: how much storage it takes. 😀 is 4 bytes in UTF-8.
  • Code points: how many Unicode entries it contains. 😀 is 1 code point.
  • Grapheme clusters: how many "characters" a human perceives. Usually 1.

These usually diverge with emoji. A thumbs-up with a skin-tone modifier is two code points (the base symbol plus a modifier) that render as one glyph. A family emoji can be several code points joined by an invisible zero-width joiner (U+200D), yet a person sees a single picture. So "👨‍👩‍👧".length in JavaScript can return a surprising number, because JS counts UTF-16 units, not perceived characters.

The practical rule: decide which count you actually need. Truncating a string for a database column? Count bytes. Validating a username limit a human understands? Count grapheme clusters, using a library that understands Unicode segmentation — naive slicing can split a multi-byte sequence and corrupt the text.


Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.

Top comments (0)