Tokenization: The Cursed Trick that Unlocked LLMs

#tokenization #bpe #bytepairencoding #gpttokens

https://www.youtube.com/watch?v=zduSFxRajkE

Type the word "ChatGPT" into ChatGPT. To you, that's one word. To the model, it's three separate objects: Chat, G, PT. The model doesn't read your text. It reads tokens — chunks of characters learned from training data. And those chunks don't always line up with words.

What Is a Token?

A token is the smallest unit an LLM can see. Not a character. Not a word. Something in between.

On average, one token is about four characters — roughly three-quarters of an English word. Common words like the or is get their own token. Rarer words get broken into pieces. "Tokenization" might be one token. "SolidGoldMagikarp" becomes five.

Every token has a numeric ID. The word " the" (with a leading space) is token 262. Capital "The" is token 464 — a completely different object to the model.

GPT-4's vocabulary has about 100,000 tokens. Each one maps to a unique ID, and each ID maps to a learned embedding vector — a point in high-dimensional space that represents what the model "knows" about that chunk of text.

When you type a sentence, the model doesn't see English. It sees a sequence of numbers. And everything it generates comes out as numbers too — decoded back into text only at the very end.

How BPE Builds the Vocabulary

So how does the model decide which chunks to use? The answer is an algorithm from 1994 called Byte Pair Encoding (BPE).

You start with the simplest possible vocabulary: every individual byte. 256 entries — one for each possible byte value.

Then you scan a massive text corpus and count every adjacent pair of bytes:

Pair	Frequency
`t-h`	10,000,000
`e-_`	8,000,000
`i-n`	7,000,000

You take the most frequent pair — t and h — and merge them into a single new token: th. Add it to the vocabulary. Now everywhere t-h appeared, it's one token instead of two.

Scan again. Maybe th-e is now the most frequent pair. Merge it. One token: the.

Repeat this 40,000 to 100,000 times. Each round, the most common pair becomes a new token. Common words compress into single tokens early. Rare words stay split into smaller pieces.

After 100,000 merges, you have GPT-4's vocabulary. A dictionary built not by linguists, but by raw frequency statistics. The algorithm doesn't know what a word is. It just knows what appears together.

The Two-Process Gap: Glitch Tokens

Here's where things get strange. The tokenizer and the language model are trained in two completely separate steps.

First, BPE runs on a text corpus and builds the vocabulary. Then, in a totally separate process, the LLM trains on a different corpus using that vocabulary.

This creates a gap. Some tokens exist in the vocabulary because they appeared frequently in the tokenizer's training data. But they barely appeared in the LLM's training data. The model has a slot for these tokens — but it never learned what they mean.

These are called glitch tokens. And the most famous one is SolidGoldMagikarp.

A Reddit user named SolidGoldMagikarp posted over 160,000 comments in counting subreddits. Their username appeared so frequently that BPE compressed it into a single token. But the LLM's training data didn't include nearly as many of those posts.

When researchers asked GPT to simply repeat the word "SolidGoldMagikarp," the model panicked. It output random words. Insults. Religious text. Anything but the actual token. Because it had a slot for it, but no understanding of what it was.

The fix in GPT-4 was simple: the tokenizer now splits "SolidGoldMagikarp" into five normal tokens — Solid, Gold, Mag, ik, arp. No single glitch token. Problem solved.

Tokenization Explains Everything

Once you understand tokenization, every confusing AI behavior clicks into place.

Why can't GPT count the R's in "strawberry"? Because it sees str, aw, berry. The individual letters are hidden inside token boundaries. It's reading through frosted glass.

Why does GPT struggle with arithmetic? Because the number 1,234,567 gets split into arbitrary chunks that don't align with place values. It's like doing long division with randomly grouped digits.

Why is GPT worse in Japanese than English? Because English text compresses efficiently — "the cat sat on the mat" is six tokens. The same sentence in Japanese can cost three times as many tokens. More tokens means more computation, higher costs, and a smaller effective context window.

Even spelling errors trace back here. The model doesn't see individual letters — it sees subword chunks. Asking it to spell a word backward means reconstructing characters it never directly processed.

The Invisible Translation Layer

The next time an AI does something that seems dumb, don't blame the model. Check the tokenizer first.

Between your text and the AI's mind, there's always this invisible translation layer — turning your words into chunks, your chunks into numbers, and hoping nothing important gets lost in the split.

→ Try it yourself: go to platform.openai.com/tokenizer and type "ChatGPT" — you'll see the split.

References: