How Text Becomes Binary: Character Encoding From ASCII to UTF-8

#webdev #beginners #programming #computerscience

A few years ago I was building an API that accepted user input in multiple languages. Everything worked fine in English. Then a user submitted a form in Japanese and the database stored garbage characters. The classic mojibake problem. I had assumed UTF-8 everywhere but one middleware component was silently converting to Latin-1, which cannot represent Japanese characters. Fixing it took ten minutes. Finding it took two days.

Understanding how text becomes binary -- and how that binary becomes text again -- is fundamental to avoiding an entire category of bugs that are notoriously difficult to debug.

ASCII: Where It Started

ASCII (American Standard Code for Information Interchange) was published in 1963. It maps 128 characters to 7-bit binary numbers:

Character  Decimal  Binary
A          65       1000001
B          66       1000010
Z          90       1011010
a          97       1100001
0          48       0110000
space      32       0100000
newline    10       0001010

Some useful patterns to notice:

Uppercase letters start at 65, lowercase at 97. The difference is exactly 32, which is a single bit flip (bit 5). This is not a coincidence -- it was designed this way so case conversion is a single bitwise operation.
Digits 0-9 start at 48. To convert an ASCII digit to its numeric value, subtract 48 (or equivalently, AND with 0x0F).
Control characters (0-31) handle things like tab (9), newline (10), carriage return (13), and escape (27).

ASCII works perfectly for English. It fails completely for everything else.

The Encoding Explosion

In the 1980s and 1990s, hundreds of incompatible character encodings proliferated. ISO 8859-1 (Latin-1) covered Western European languages. Shift_JIS handled Japanese. Big5 handled Traditional Chinese. KOI8-R handled Russian. Windows-1252 was Microsoft's extended ASCII.

The problem was obvious: if you opened a Shift_JIS document with a Latin-1 decoder, every character would be wrong. There was no reliable way to detect which encoding a file used. Email clients guessed. Browsers guessed. Everyone guessed, and everyone was wrong some percentage of the time.

UTF-8: The Solution

UTF-8, designed by Ken Thompson and Rob Pike in 1992, solved the encoding problem with an elegant variable-width scheme:

Unicode Range          Bytes  Binary Pattern
U+0000 to U+007F      1      0xxxxxxx
U+0080 to U+07FF      2      110xxxxx 10xxxxxx
U+0800 to U+FFFF      3      1110xxxx 10xxxxxx 10xxxxxx
U+10000 to U+10FFFF   4      11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The genius of UTF-8:

Backward compatibility. Every ASCII character has the same single-byte representation in UTF-8. A valid ASCII file is automatically a valid UTF-8 file. This meant the entire English-language internet could adopt UTF-8 without changing a single existing file.

Self-synchronizing. If you land on a random byte in a UTF-8 stream, you can tell immediately whether it is a start byte (begins with 0, 110, 1110, or 11110) or a continuation byte (begins with 10). You never need to scan backward more than 3 bytes to find the start of a character.

No null bytes. Except for the actual NUL character (U+0000), UTF-8 never produces a zero byte. This means UTF-8 strings are compatible with C's null-terminated string functions.

Let us trace the encoding of a specific character. The letter "e" with an acute accent, used in French:

Character: e (U+00E9)
Unicode code point: 233 (decimal) = 11101001 (binary)

This falls in the range U+0080 to U+07FF, so we need 2 bytes.
Template: 110xxxxx 10xxxxxx

Fill in the 11 bits of 00011101001:
110 00011  10 101001
= 0xC3 0xA9

In binary: 11000011 10101001

That is how a single accented character becomes two bytes of binary data.

Practical Examples in Code

# Encoding text to binary/bytes
text = "Hello"
binary = text.encode('utf-8')
print(binary)  # b'Hello'
print(list(binary))  # [72, 101, 108, 108, 111]

# Each byte in binary
for byte in binary:
    print(f"{chr(byte)}: {byte:08b}")
# H: 01001000
# e: 01100101
# l: 01101100
# l: 01101100
# o: 01101111

// In JavaScript, TextEncoder/TextDecoder handle this
const encoder = new TextEncoder();
const bytes = encoder.encode("Hello");
// Uint8Array [72, 101, 108, 108, 111]

const decoder = new TextDecoder('utf-8');
const text = decoder.decode(bytes);
// "Hello"

Four Mistakes That Cause Encoding Bugs

1. Assuming one character equals one byte. In UTF-8, characters can be 1-4 bytes. "cafe".length in JavaScript returns 4, but "cafe\u0301" (with a combining accent) returns 5 even though it displays as 4 characters. And emojis are 4 bytes each.

2. Truncating byte strings at arbitrary positions. If you cut a UTF-8 string at a byte boundary that falls in the middle of a multi-byte character, you produce invalid UTF-8. Always truncate at character boundaries, not byte boundaries.

3. Using the wrong encoding declaration. If your HTML says <meta charset="utf-8"> but your server sends Content-Type: text/html; charset=iso-8859-1, the browser will use the HTTP header. The mismatch corrupts any non-ASCII characters.

4. Double-encoding. Encoding an already-encoded UTF-8 string as UTF-8 again produces garbage. This happens more often than you would think, especially when data passes through multiple middleware layers.

When I need to quickly inspect how a string converts to binary representation, or convert binary back to readable text for debugging encoding issues, I use the binary-text converter at zovo.one. It shows the binary representation of each character, which makes encoding problems visible immediately.

I am Michael Lip. I build free developer tools at zovo.one. 350+ tools, all private, all free.