Naz Quadri

Posted on Mar 31 • Originally published at nazquadri.dev

Your String is Not What You Think It Is

#linux #programming #systems #tutorial

Your String is Not What You Think It Is

A Tour Through the Encoding Wars, and Why `len("café")` Returns 4

Reading time: ~13 minutes

You called len("café") and Python told you 4. You passed that string to a function that encoded it to bytes. The bytes were 5 long. You stared at the screen for longer than you'd admit.

Then you got a bug report from a user in Brazil whose name broke your database. Your colleague on a Windows machine opened the CSV you exported and saw Ã© where there should have been é. You fixed it by guessing — add .encode('utf-8') here, .decode('utf-8') there — and it stopped crashing.

But if someone asked you why, the honest answer is probably: "Something about encodings." Let's fix that gap.

In Pressing a Key, I traced a keypress from the keyboard matrix to your shell's stdin. The scan code 0x04 became the letter a somewhere in the stack. But what is the letter a? It turns out the answer is deeper than you'd expect.

The Core Confusion

Here's the thing that trips everyone up first. There is no such thing as "plain text." ... 🤯.

Every string you handle — in Python, in your database, in your terminal, in the HTTP response your server sends — is encoded. There is no neutral format. The sequence of bits that represents the letter é depends entirely on which encoding you and the other party agreed to use.

The bytes don't tell you their encoding. The encoding is an agreement, and when the two sides of that agreement disagree, you get garbage, or worse occasional weirdness.

That's the whole post, really. But the history is why this mess exists, and the history is actually kind of great.

1963: Everything is Fine

ASCII — the American Standard Code for Information Interchange — was published in 1963. It was designed for English text on teleprinters, and for that job, it's perfect.

128 characters. 7 bits. Letters A–Z, digits 0–9, punctuation, and 33 control codes (things like "ring the bell," "move to the next line," and "move back one space"). Every character fits in a single byte, with one bit to spare.

For American engineers talking to American computers, ASCII was everything they needed. And for about two decades, most software was written by American engineers, for American computers, talking to each other. A closed loop. Perfectly adequate.

Then the rest of the world wanted computers. Funny how that works.

The Encoding Wars (1970s–1990s)

If ASCII uses 7 bits, that 8th bit in a byte is free. So naturally, everyone used it. Differently.

ISO 8859-1 (Latin-1) used it to add accented characters for Western European languages. Code page 437 was the IBM PC standard, with box-drawing characters that powered every DOS UI ever made. Windows-1252 was Microsoft's take on Latin-1, slightly different. JIS X 0208 covered Japanese. GB 2312 covered Simplified Chinese — though China later mandated GB 18030, a different standard that covers the full Unicode range. KOI8-R covered Russian. Big5 handled Traditional Chinese with a completely different approach. 🫠

By the 1990s, there were hundreds of encodings. A byte sequence had no intrinsic meaning without knowing which encoding produced it.

This was the encoding wars: every platform, every country, every vendor with its own system, and data that passed between them would silently corrupt.

The result was mojibake — a Japanese term for the garbled text you get when you decode bytes with the wrong encoding. é becomes Ã©. 한국어 becomes ???. Your file that looked fine on one machine is incomprehensible on another. Anyone over 40 has seen this IRL.

The fix had to be systematic. It had to cover every character in every writing system. It had to be one thing, not hundreds.

Unicode: One Ring to Rule Them All

Unicode is not an encoding. Read that twice. It's the most important sentence in this post.

Unicode is a catalog. It assigns every character in every writing system a unique number called a code point. That's all. It doesn't specify how to store those numbers in bytes. It says: the letter A is U+0041. The letter é is U+00E9. The Korean character 한 is U+D55C. The emoji¹ 🐙 is U+1F419 (yes we have unicode code points for an octopus glyph).

As of Unicode 16.0, there are over 154,000 assigned code points (this grows with each release — check unicode.org for the current count). They cover Latin, Cyrillic, Arabic, Hebrew, CJK ideographs, emoji, ancient scripts, musical notation, and an extraordinary number of symbols for things you didn't know needed standardizing.

>>> ord('A')
65          # 0x0041 — same in ASCII and Unicode
>>> ord('é')
233         # 0xE9
>>> ord('한')
54620       # 0xD55C
>>> ord('🐙')
128025      # 0x1F419

Unicode says: é is code point 233. But 233 still has to become bytes somehow, because bytes are what your disk stores, what your network transmits, what your terminal displays.

That conversion is encoding, and Unicode has several: UTF-32, UTF-16, and UTF-8. They all represent the same catalog of characters, but they store the numbers differently.

UTF-8: The Clever Variable-Width Trick

UTF-8 won. It's what the web runs on, what Linux runs on, what almost everything new uses. Ken Thompson (who also co-created Unix — you'll meet him again in File Descriptors) and Rob Pike designed it in 1992 — allegedly on the back of a placemat in a New Jersey diner — with one decisive constraint: ASCII compatibility. That single decision is why it beat everything else.

UTF-32 is the brute-force approach: 4 bytes per code point, always. U+0041 (A) stores as 00 00 00 41. Works fine, wastes enormous amounts of memory for any text that's mostly ASCII, and is incompatible with every ASCII-native tool ever written. Nobody wants this.

UTF-8 is variable-width. Here's the elegant part: code points below 128 — the entire ASCII range — are stored as a single byte, identical to their ASCII encoding. Code point 65 (A) is still 0x41. Your old ASCII files are already valid UTF-8. You didn't have to do anything. That backward compatibility is the reason UTF-8 won the Unix/Linux world: you could adopt it without breaking a single existing tool.

For everything above 127, the first byte's leading bits tell the decoder how many bytes to read. The remaining bits carry the actual code point — the payload:

See it? The prefix is a unary count — the number of leading 1 bits tells you the total byte count. A 0 means one byte (ASCII). 110 means two. 1110 means three. 11110 means four. Every continuation byte starts with 10, which means a decoder can jump into the middle of a stream and find the next character boundary by scanning for a byte that doesn't start with 10. That's not an accident. That's design.

And here's a trick that could make this fast. Modern CPUs have a "count leading zeros" instruction (CLZ, or LZCNT on x86). Flip the bits, count the leading zeros, and you have your byte count:

// Extract UTF-8 byte count from the first byte
// ~byte flips bits: 11110xxx → 00001xxx
// __builtin_clz counts leading zeros in a 32-bit int
int utf8_byte_count(uint8_t byte) {
    if (byte < 0x80) return 1;              // ASCII: 0xxxxxxx
    return __builtin_clz(~byte << 24);      // flip, shift to top, count
}

One bitwise NOT, one shift, one hardware instruction. Elegant — but in practice, most production UTF-8 libraries don't use this. Rust's standard library uses a 256-byte lookup table. So does Git's utf8.c. Lookup tables are simpler, branchless, and the table lives in L1 cache after the first access. The real speed demons like simdutf skip scalar decoding entirely and classify 16–64 bytes at once using SIMD vector comparisons. The point isn't which approach wins — it's that Thompson and Pike designed the prefix structure so that all of these approaches work. The encoding cooperates with the hardware.

é is U+00E9, which lands in the 2-byte range. The prefix 110 says "two bytes, five payload bits in this byte." The continuation 10 says "six more payload bits." Together: 0xC3 0xA9. That's why "café".encode('utf-8') gives you 5 bytes — three ASCII letters at 1 byte each, plus é at 2.

And our friend 🐙? U+1F419. That's above U+10000, so it hits the 4-byte row: prefix 11110, then three continuation bytes. 0xF0 0x9F 0x90 0x99. Four bytes for one tentacled glyph.

>>> s = "café"
>>> len(s)
4              # 4 Unicode code points
>>> b = s.encode('utf-8')
>>> b
b'caf\xc3\xa9'
>>> len(b)
5              # 5 bytes

len() in Python counts code points, not bytes. This is correct behavior, but it means "string length" and "byte length" are different things, and conflating them is a bug waiting to happen. That's why your database column that should hold 100 characters might reject a 60-character string containing emoji — the byte count exceeds the column's byte-width limit.

The `é` That Isn't the Same `é`

Here's where it gets properly weird.

Unicode has a concept called combining characters. Instead of representing é as a single code point (U+00E9, "Latin Small Letter E with Acute"), you can represent it as two: the base letter e (U+0065) followed by a combining acute accent (U+0301).

Both are valid Unicode. Both look identical on screen. Copy-paste either one and you can't tell which is which.

They are not equal.

>>> composed = "é"       # U+00E9, one code point
>>> decomposed = "e\u0301"  # e + combining accent, two code points
>>> composed == decomposed
False
>>> len(composed)
1
>>> len(decomposed)
2

(The decomposed form is written as "e\u0301" here because both strings look identical on a rendered page — that's the point. Copy both forms into Python and check len() if you don't believe it.)

Text coming from one source might use composed forms. Text from another source might use decomposed forms. String comparison fails silently.

The fix is Unicode normalization — specifically, NFC (Canonical Decomposition followed by Canonical Composition) collapses decomposed sequences back into composed forms before comparison.

>>> import unicodedata
>>> unicodedata.normalize('NFC', decomposed) == composed
True

That's why your input validation, deduplication logic, and search index have a latent bug if they compare strings without normalizing first. Two strings that look identical can be unequal. No error. No exception. Just a False where you expected True, and a user who swears they typed it correctly. They did.

Why Emoji Are 4 Bytes

The emoji 🐙 is U+1F419. That's above U+FFFF, which means it needs 4 bytes in UTF-8.

This surprises people. "It's a modern character, surely the encoding handles it." It does — by using more bytes. The encoding was designed before emoji existed, and the 4-byte range was always there for code points above the Basic Multilingual Plane (the first 65,536 code points). Emoji just happen to live there.

>>> "🐙".encode('utf-8')
b'\xf0\x9f\x90\x99'   # 4 bytes
>>> len("🐙")
1                       # 1 code point (Python 3)

Python 3 handles this correctly. Python 2 would either represent emoji as two "surrogate" code points or explode, depending on how it was compiled. Just remember friends don't let friends use Python 2.

When JavaScript Lies About String Length

JavaScript's String.length counts UTF-16 code units, not code points. For Basic Multilingual Plane characters, a UTF-16 code unit is one 16-bit value. For code points above U+FFFF — including most emoji — UTF-16 uses a surrogate pair: two 16-bit values.

"🇬🇧".length   // 4

That flag emoji is actually two separate emoji combined: the regional indicator letters G (U+1F1EC) and B (U+1F1E7). Each is above U+FFFF, so each takes a surrogate pair in UTF-16. Two surrogate pairs = four UTF-16 code units = length of 4. One grapheme. One visible symbol. Length 4.

That's why str.length in JavaScript is a lie for emoji and combining characters. "Number of characters" is actually three different questions:

How many bytes?
How many Unicode code points?
How many user-perceived characters (grapheme clusters)?

The answers are all different for any text involving non-ASCII. Use [...str].length in JavaScript for code-point-aware length, or a library like graphemer for full grapheme cluster support.

MySQL's utf8 Is Not UTF-8

MySQL has a character set called utf8. You'd think that means UTF-8. It does not.

MySQL's utf8 stores up to 3 bytes per character. Real UTF-8 needs up to 4. So MySQL's utf8 silently discards or errors (depending on your SQL mode) on any code point in the 4-byte range — which includes most emoji, some Chinese characters, and a handful of historic scripts.

The actual UTF-8 character set in MySQL is called utf8mb4. If you have a MySQL database with utf8 columns, and a user tries to set a display name that includes an emoji, you'll get either a silent truncation (non-strict mode) or Incorrect string value: '\xF0\x9F...' (strict mode).

The good news: MySQL finally fixed this. As of MySQL 8.0, utf8mb4 is the default character set. In MySQL 9.6, utf8 is officially deprecated as an alias for the broken 3-byte variant (utf8mb3), and future versions will make utf8 an alias for utf8mb4 instead. The 20-year naming mistake is being unwound.

The bad news: if you're maintaining anything that was created before MySQL 8.0 — and statistically, you are — the columns are still utf8mb3 under the hood. Migrating them to utf8mb4 means ALTER TABLE on every affected table, which takes a lock and rewrites the data. On a large table, that's a maintenance window. On a large table with foreign keys, that's a weekend.

The BOM

BOM stands for Byte Order Mark. It's the code point U+FEFF, placed at the very beginning of a file or stream.

It was designed for UTF-16, where byte order matters. In big-endian UTF-16, the bytes FE FF mean BOM. In little-endian UTF-16, the bytes are reversed: FF FE. A decoder can read the first two bytes and know which byte order to use.

UTF-8 has no byte order — UTF-8's variable-length encoding makes byte order irrelevant — but some systems (particularly anything involving Windoze) prepend a UTF-8 BOM anyway: the three bytes EF BB BF.

This causes problems. A UTF-8 BOM is invisible in most text editors, but it's physically there. If you read that file and look at the first character, it's not what you think:

>>> with open('bom_file.txt', 'rb') as f:
...     f.read(4)
b'\xef\xbb\xbfH'   # BOM + "H" from "Hello"

CSV parsers that don't handle UTF-8 BOM will silently include \ufeff as the first column name. SQL scripts that begin with a BOM will fail to parse. Headers parsed with the BOM character attached will fail equality checks.

The BOM is a compatibility kludge that keeps causing fresh pain. If you're generating text files and you can avoid the BOM, avoid it. If you're reading files that might have one, use utf-8-sig in Python, which strips it transparently:

>>> open('bom_file.txt', encoding='utf-8-sig').read(1)
'H'    # BOM stripped

The Encoding is Not in the Bytes

Here's the thing that ties everything together.

There is nothing in the bytes 0xC3 0xA9 that says "I am UTF-8." That's true. But there's also nothing that says "I am Latin-1." In Latin-1, 0xC3 0xA9 decodes as Ã© — two characters, not one.

The bytes are just bytes. The encoding is the agreement.

That's why Content-Type: text/html; charset=utf-8 exists. When you open a file, something outside the bytes has to declare the encoding. Maybe it's a BOM (fragile, optional). Maybe it's an HTTP Content-Type header. Maybe it's a <meta charset="utf-8"> tag. Maybe it's database column metadata. Maybe it's oral tradition passed down from the developer who wrote the export script in 2009 and has since left the company, the country, and possibly this plane of existence. Without that declaration, the bytes have no intrinsic meaning.

When that agreement breaks — when the sender wrote Latin-1 and the receiver assumed UTF-8, or vice versa — you get mojibake. Not an error. Not a crash. Silent garbage, rendered faithfully.

This is why "it works on my machine" is a whole genre of encoding bugs — and why everyone hates that guy. (He says it for everything. It's never helpful. It's especially not helpful here.) Your machine, your editor, and your database are all configured to agree. Add one party — an API, a CSV from a client, a legacy data import — that was configured differently, and the silent corruption begins.

Practical Rules

A few things that would have saved me considerable time:

Be explicit at every I/O boundary. When you read bytes (files, sockets, database results), decode them to strings immediately, explicitly naming the encoding. When you write strings, encode them explicitly. The middle of your code should be working with strings, not bytes.

# Explicit is better than implicit
with open('data.csv', encoding='utf-8') as f:
    content = f.read()   # str, not bytes

Always use UTF-8. Unless you're reading a file you didn't produce, you have no reason to use anything else. Latin-1 is for legacy systems. Windows-1252 is for legacy systems. If you're creating something new, it's UTF-8.

Normalize before comparing. If your strings come from multiple sources and you're comparing them, run them through unicodedata.normalize('NFC', s) first. This is especially relevant for names, search indices, and deduplication logic.

MySQL: use utf8mb4. And utf8mb4_unicode_ci for case-insensitive comparison. Learn this before a production incident involving Korean user names.

JavaScript: use [...str].length for code-point-aware length. It handles surrogate pairs correctly but still won't count grapheme clusters (flag emoji¹, skin tone sequences). For that, use Intl.Segmenter or a library like graphemer. Either way, str.length is a lie.

Quick Recap

ASCII: 128 characters, 7 bits, works beautifully for English, nothing else
The encoding wars: hundreds of competing 8-bit extensions, incompatible everywhere
Unicode: a catalog assigning every character a code point — not an encoding
UTF-8: variable-width encoding, backward-compatible with ASCII, designed by Ken Thompson and Rob Pike in 1992 — backward compatibility is why it won
len("café") is 4 because Python counts code points; len(b"caf\xc3\xa9") is 5 because bytes are bytes (Capt. Tautology strikes 🫡!)
Two strings can look identical and fail == due to composed vs. decomposed forms
Emoji need 4 UTF-8 bytes; JavaScript's .length counts UTF-16 code units; MySQL's legacy utf8 was 3-byte only (fixed in 8.0+)
The BOM exists because of UTF-16 byte order; in UTF-8 it's a historical nuisance
The encoding is never in the bytes — it's an agreement between sender and receiver

What's Next

You now know how bytes become characters. But when your program reads those bytes from a file, or writes them to a socket, or pipes them between processes — those operations all flow through a single abstraction: a number. An integer. A file descriptor.

Next up in File Descriptors: The Integers That Run Your System, I'm going to show you the table behind every open(), every read(), every socket you've ever created. It's simpler than you think, and it explains half the Unix mysteries you've ever encountered.

DEV Community

Your String is Not What You Think It Is

Your String is Not What You Think It Is

A Tour Through the Encoding Wars, and Why `len("café")` Returns 4

The Core Confusion

1963: Everything is Fine

The Encoding Wars (1970s–1990s)

Unicode: One Ring to Rule Them All

UTF-8: The Clever Variable-Width Trick

The `é` That Isn't the Same `é`

Why Emoji Are 4 Bytes

When JavaScript Lies About String Length

MySQL's utf8 Is Not UTF-8

The BOM

The Encoding is Not in the Bytes

Practical Rules

Quick Recap

What's Next

Further Reading

Top comments (0)

Your String is Not What You Think It Is

A Tour Through the Encoding Wars, and Why len("café") Returns 4

The Core Confusion

1963: Everything is Fine

The Encoding Wars (1970s–1990s)

Unicode: One Ring to Rule Them All

UTF-8: The Clever Variable-Width Trick

The é That Isn't the Same é

Why Emoji Are 4 Bytes

When JavaScript Lies About String Length

MySQL's utf8 Is Not UTF-8

The BOM

The Encoding is Not in the Bytes

Practical Rules

Quick Recap

What's Next

Further Reading

A Tour Through the Encoding Wars, and Why `len("café")` Returns 4

The `é` That Isn't the Same `é`