https://www.youtube.com/watch?v=Z_LQa_NeA8w
One emoji can have three different lengths at the same time.
In UTF-8 bytes, the family emoji 👨👩👦 is 25. In Unicode code points, it's 7. In grapheme clusters, it's 1.
All three answers are correct.
That's the bug waiting underneath almost every piece of text handling code: we keep asking "how long is this string?" as if the question only has one meaning.
Unicode exists because text actually lives in three different layers.
Layer 1: Bytes
At the bottom, computers only store bytes.
ASCII was the first successful shared mapping: a byte value for A, a byte value for z, a byte value for space. It was clean, simple, and completely insufficient. ASCII only gave you 128 slots. Enough for English. Not enough for the world.
So every region built its own encoding table. Shift-JIS in Japan. KOI8 in Russia. Latin-1 in Western Europe. Each worked locally. None agreed globally. Move text between systems and you got mojibake — garbage symbols where words should be.
Unicode fixed the agreement problem by separating character identity from byte storage.
UTF-8 fixed the storage problem by keeping ASCII as one byte and expanding only when needed:
- 1 byte for ASCII
- 2 bytes for many European and Middle Eastern scripts
- 3 bytes for most modern writing systems
- 4 bytes for everything else, including emoji
That's why UTF-8 won. English stays compact. Old ASCII files still work. And the encoding can represent the full Unicode space.
Layer 2: Code Points
Unicode's core idea is almost boring:
Give every character an abstract number.
That's a code point.
A is U+0041. The Arabic letter alef is U+0627. The Chinese character for water is U+6C34. A snowflake is U+2744.
This is the level most developers think they're working at when they say "character." But a code point is not the same thing as bytes, and it's not the same thing as what a human sees on screen.
A code point answers:
- What symbol is this?
It does not answer:
- How many bytes does it take in memory?
- How many visible characters will a user perceive?
That split is where most Unicode confusion starts.
The é Problem
Take the letter é.
It can be represented in Unicode two different ways:
U+00E9 -> é
U+0065 U+0301 -> e + combining acute accent
Visually, they're the same.
Under the hood, they are different sequences.
So now all the "simple" operations stop being simple:
- Equality checks can fail
- String lengths can differ
- Search can miss identical-looking text
- Cursor movement can behave strangely
This is not a rendering bug. It's a modeling bug. Your code assumed one visible character always equals one code point. Unicode does not make that promise.
Layer 3: Grapheme Clusters
A grapheme cluster is what a human reader experiences as one character.
Sometimes that's one code point. Sometimes it's several code points working together.
The é example already proves it. One visible unit can be either:
- one precomposed code point, or
- two code points: base letter + combining mark
Emoji make the same idea impossible to ignore.
The family emoji 👨👩👦 is not one atomic symbol in storage. It's a sequence:
man + ZWJ + woman + ZWJ + boy
The zero-width joiner (ZWJ) is invisible glue. It tells the renderer to combine neighboring code points into one displayed unit.
So the same string now has three perfectly valid measurements:
- 25 bytes in UTF-8
- 7 code points in Unicode
- 1 grapheme cluster on screen
If your app limits usernames by bytes, that's one answer.
If your parser iterates code points, that's another answer.
If your text editor moves by user-visible characters, that's a third answer.
The number isn't wrong. The level is.
Why String APIs Feel Inconsistent
Developers often think text APIs are inconsistent because Unicode is complicated. The real issue is that different APIs are answering different questions.
One API is counting bytes because it cares about storage.
Another is counting code points because it cares about encoded symbols.
Another is moving over grapheme clusters because it cares about what a user sees.
They're not disagreeing. They're working at different layers.
Once you see the stack clearly, a lot of "Unicode weirdness" stops being weird:
- UTF-8 length bugs are byte-level bugs
-
é != ébugs are normalization bugs - Broken cursor movement is a grapheme-cluster bug
- Emoji limits exploding in databases are "you counted the wrong layer" bugs
Normalization Is Not Optional
Because Unicode allows multiple valid representations of the same visible text, serious text processing usually needs normalization.
The two common forms are:
- NFC: prefer single precomposed code points where possible
- NFD: decompose into base characters plus combining marks
If two strings need to compare equal, normalize them to the same form first.
Without that step, you're trusting visually identical text to also be byte-identical. That's not safe.
The Real Mental Model
Unicode is not "a bigger ASCII."
It's a layered model:
- Bytes — how text is stored
- Code points — the abstract symbols Unicode defines
- Grapheme clusters — what a human actually perceives as one character
Most production bugs happen when code silently swaps one layer for another.
You ask for "character count."
The runtime gives you code points.
The product manager means user-visible characters.
The database limit is actually bytes.
Now everyone is technically correct, and the software is still broken.
That's Unicode in one sentence:
Text has multiple valid lengths because text has multiple layers.
And once you internalize that, string handling stops feeling arbitrary. It starts feeling precise.
Top comments (0)