A few years ago I was building an API that accepted user input in multiple languages. Everything worked fine in English. Then a user submitted a form in Japanese and the database stored garbage characters. The classic mojibake problem. I had assumed UTF-8 everywhere but one middleware component was silently converting to Latin-1, which cannot represent Japanese characters. Fixing it took ten minutes. Finding it took two days.
Understanding how text becomes binary -- and how that binary becomes text again -- is fundamental to avoiding an entire category of bugs that are notoriously difficult to debug.
ASCII: Where It Started
ASCII (American Standard Code for Information Interchange) was published in 1963. It maps 128 characters to 7-bit binary numbers:
Character Decimal Binary
A 65 1000001
B 66 1000010
Z 90 1011010
a 97 1100001
0 48 0110000
space 32 0100000
newline 10 0001010
Some useful patterns to notice:
- Uppercase letters start at 65, lowercase at 97. The difference is exactly 32, which is a single bit flip (bit 5). This is not a coincidence -- it was designed this way so case conversion is a single bitwise operation.
- Digits 0-9 start at 48. To convert an ASCII digit to its numeric value, subtract 48 (or equivalently, AND with 0x0F).
- Control characters (0-31) handle things like tab (9), newline (10), carriage return (13), and escape (27).
ASCII works perfectly for English. It fails completely for everything else.
The Encoding Explosion
In the 1980s and 1990s, hundreds of incompatible character encodings proliferated. ISO 8859-1 (Latin-1) covered Western European languages. Shift_JIS handled Japanese. Big5 handled Traditional Chinese. KOI8-R handled Russian. Windows-1252 was Microsoft's extended ASCII.
The problem was obvious: if you opened a Shift_JIS document with a Latin-1 decoder, every character would be wrong. There was no reliable way to detect which encoding a file used. Email clients guessed. Browsers guessed. Everyone guessed, and everyone was wrong some percentage of the time.
UTF-8: The Solution
UTF-8, designed by Ken Thompson and Rob Pike in 1992, solved the encoding problem with an elegant variable-width scheme:
Unicode Range Bytes Binary Pattern
U+0000 to U+007F 1 0xxxxxxx
U+0080 to U+07FF 2 110xxxxx 10xxxxxx
U+0800 to U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
U+10000 to U+10FFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The genius of UTF-8:
Backward compatibility. Every ASCII character has the same single-byte representation in UTF-8. A valid ASCII file is automatically a valid UTF-8 file. This meant the entire English-language internet could adopt UTF-8 without changing a single existing file.
Self-synchronizing. If you land on a random byte in a UTF-8 stream, you can tell immediately whether it is a start byte (begins with 0, 110, 1110, or 11110) or a continuation byte (begins with 10). You never need to scan backward more than 3 bytes to find the start of a character.
No null bytes. Except for the actual NUL character (U+0000), UTF-8 never produces a zero byte. This means UTF-8 strings are compatible with C's null-terminated string functions.
Let us trace the encoding of a specific character. The letter "e" with an acute accent, used in French:
Character: e (U+00E9)
Unicode code point: 233 (decimal) = 11101001 (binary)
This falls in the range U+0080 to U+07FF, so we need 2 bytes.
Template: 110xxxxx 10xxxxxx
Fill in the 11 bits of 00011101001:
110 00011 10 101001
= 0xC3 0xA9
In binary: 11000011 10101001
That is how a single accented character becomes two bytes of binary data.
Practical Examples in Code
# Encoding text to binary/bytes
text = "Hello"
binary = text.encode('utf-8')
print(binary) # b'Hello'
print(list(binary)) # [72, 101, 108, 108, 111]
# Each byte in binary
for byte in binary:
print(f"{chr(byte)}: {byte:08b}")
# H: 01001000
# e: 01100101
# l: 01101100
# l: 01101100
# o: 01101111
// In JavaScript, TextEncoder/TextDecoder handle this
const encoder = new TextEncoder();
const bytes = encoder.encode("Hello");
// Uint8Array [72, 101, 108, 108, 111]
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(bytes);
// "Hello"
Four Mistakes That Cause Encoding Bugs
1. Assuming one character equals one byte. In UTF-8, characters can be 1-4 bytes. "cafe".length in JavaScript returns 4, but "cafe\u0301" (with a combining accent) returns 5 even though it displays as 4 characters. And emojis are 4 bytes each.
2. Truncating byte strings at arbitrary positions. If you cut a UTF-8 string at a byte boundary that falls in the middle of a multi-byte character, you produce invalid UTF-8. Always truncate at character boundaries, not byte boundaries.
3. Using the wrong encoding declaration. If your HTML says <meta charset="utf-8"> but your server sends Content-Type: text/html; charset=iso-8859-1, the browser will use the HTTP header. The mismatch corrupts any non-ASCII characters.
4. Double-encoding. Encoding an already-encoded UTF-8 string as UTF-8 again produces garbage. This happens more often than you would think, especially when data passes through multiple middleware layers.
When I need to quickly inspect how a string converts to binary representation, or convert binary back to readable text for debugging encoding issues, I use the binary-text converter at zovo.one. It shows the binary representation of each character, which makes encoding problems visible immediately.
I am Michael Lip. I build free developer tools at zovo.one. 350+ tools, all private, all free.
Top comments (0)