Neural Download

Posted on Apr 22

How Unicode Actually Works

#unicode #utf8 #utf16 #codepoints

https://www.youtube.com/watch?v=Z_LQa_NeA8w

One emoji can have three different lengths at the same time.

In UTF-8 bytes, the family emoji 👨‍👩‍👦 is 25. In Unicode code points, it's 7. In grapheme clusters, it's 1.

All three answers are correct.

That's the bug waiting underneath almost every piece of text handling code: we keep asking "how long is this string?" as if the question only has one meaning.

Unicode exists because text actually lives in three different layers.

Layer 1: Bytes

At the bottom, computers only store bytes.

ASCII was the first successful shared mapping: a byte value for A, a byte value for z, a byte value for space. It was clean, simple, and completely insufficient. ASCII only gave you 128 slots. Enough for English. Not enough for the world.

So every region built its own encoding table. Shift-JIS in Japan. KOI8 in Russia. Latin-1 in Western Europe. Each worked locally. None agreed globally. Move text between systems and you got mojibake — garbage symbols where words should be.

Unicode fixed the agreement problem by separating character identity from byte storage.

UTF-8 fixed the storage problem by keeping ASCII as one byte and expanding only when needed:

1 byte for ASCII
2 bytes for many European and Middle Eastern scripts
3 bytes for most modern writing systems
4 bytes for everything else, including emoji

That's why UTF-8 won. English stays compact. Old ASCII files still work. And the encoding can represent the full Unicode space.

Layer 2: Code Points

Unicode's core idea is almost boring:

Give every character an abstract number.

That's a code point.

A is U+0041. The Arabic letter alef is U+0627. The Chinese character for water is U+6C34. A snowflake is U+2744.

This is the level most developers think they're working at when they say "character." But a code point is not the same thing as bytes, and it's not the same thing as what a human sees on screen.

A code point answers:

What symbol is this?

It does not answer:

How many bytes does it take in memory?
How many visible characters will a user perceive?

That split is where most Unicode confusion starts.

The `é` Problem

Take the letter é.

It can be represented in Unicode two different ways:

U+00E9              -> é
U+0065 U+0301       -> e + combining acute accent

Visually, they're the same.

Under the hood, they are different sequences.

So now all the "simple" operations stop being simple:

Equality checks can fail
String lengths can differ
Search can miss identical-looking text
Cursor movement can behave strangely

This is not a rendering bug. It's a modeling bug. Your code assumed one visible character always equals one code point. Unicode does not make that promise.

Layer 3: Grapheme Clusters

A grapheme cluster is what a human reader experiences as one character.

Sometimes that's one code point. Sometimes it's several code points working together.

The é example already proves it. One visible unit can be either:

one precomposed code point, or
two code points: base letter + combining mark

Emoji make the same idea impossible to ignore.

The family emoji 👨‍👩‍👦 is not one atomic symbol in storage. It's a sequence:

man + ZWJ + woman + ZWJ + boy

The zero-width joiner (ZWJ) is invisible glue. It tells the renderer to combine neighboring code points into one displayed unit.

So the same string now has three perfectly valid measurements:

25 bytes in UTF-8
7 code points in Unicode
1 grapheme cluster on screen

If your app limits usernames by bytes, that's one answer.
If your parser iterates code points, that's another answer.
If your text editor moves by user-visible characters, that's a third answer.

The number isn't wrong. The level is.

Why String APIs Feel Inconsistent

Developers often think text APIs are inconsistent because Unicode is complicated. The real issue is that different APIs are answering different questions.

One API is counting bytes because it cares about storage.
Another is counting code points because it cares about encoded symbols.
Another is moving over grapheme clusters because it cares about what a user sees.

They're not disagreeing. They're working at different layers.

Once you see the stack clearly, a lot of "Unicode weirdness" stops being weird:

UTF-8 length bugs are byte-level bugs
é != é bugs are normalization bugs
Broken cursor movement is a grapheme-cluster bug
Emoji limits exploding in databases are "you counted the wrong layer" bugs

Normalization Is Not Optional

Because Unicode allows multiple valid representations of the same visible text, serious text processing usually needs normalization.

The two common forms are:

NFC: prefer single precomposed code points where possible
NFD: decompose into base characters plus combining marks

If two strings need to compare equal, normalize them to the same form first.

Without that step, you're trusting visually identical text to also be byte-identical. That's not safe.

The Real Mental Model

Unicode is not "a bigger ASCII."

It's a layered model:

Bytes — how text is stored
Code points — the abstract symbols Unicode defines
Grapheme clusters — what a human actually perceives as one character

Most production bugs happen when code silently swaps one layer for another.

You ask for "character count."
The runtime gives you code points.
The product manager means user-visible characters.
The database limit is actually bytes.
Now everyone is technically correct, and the software is still broken.

That's Unicode in one sentence:

Text has multiple valid lengths because text has multiple layers.

And once you internalize that, string handling stops feeling arbitrary. It starts feeling precise.

DEV Community

How Unicode Actually Works

Layer 1: Bytes

Layer 2: Code Points

The `é` Problem

Layer 3: Grapheme Clusters

Why String APIs Feel Inconsistent

Normalization Is Not Optional

The Real Mental Model

Top comments (0)

Layer 1: Bytes

Layer 2: Code Points

The é Problem

Layer 3: Grapheme Clusters

Why String APIs Feel Inconsistent

Normalization Is Not Optional

The Real Mental Model

The `é` Problem