Build a BPE tokenizer in 30 lines of Python and you will never read a prompt the same way again

#ai #llm #genai #chatgpt

Most engineers who use language models every day cannot, on a blank piece of paper, describe what their tokenizer is doing to the prompts they send. The library returns a list of integers; the integers are passed to the API call; the model produces a response; the engineer goes back to work. Most of the time, the abstraction holds, and there is no reason to look underneath it. Then one day, there is a reason. The model breaks on a name with an unusual suffix. The cost of a Korean prompt is twice that of an English one. A jailbreak works because of the way an emoji decomposes into surrogate pairs. None of these are mysterious once you have built a tokenizer; all of them are mysterious if you have not.

This piece is the version of the chapter on Byte-Pair Encoding (BPE) that fits in a blog post. The full chapter is in Volume I of the book, but the algorithm is small enough that a thirty-line implementation is enough to change what you see when you look at a prompt. Twenty minutes, a calculator-sized corpus, and a Python REPL are all you need.

The problem BPE was invented to solve

There are two obvious ways to turn text into integers, and both of them are bad. You can split on whitespace and assign each unique word an integer; the vocabulary explodes, every typo is a new token, and the model never sees unhappy and unhappily as related. Or you can split on characters; the vocabulary stays tiny, but every word becomes a long sequence of integers and the model has to learn the structure of the from scratch every time it sees the three letters in a row.

What you want is the middle: a vocabulary of frequent subword pieces, where common words come out as one token, common prefixes and suffixes come out as their own tokens, and rare or invented words decompose into a handful of pieces the model has seen before. BPE is one of the cheapest ways to build such a vocabulary, and it serves as the basis for the tokenizers in GPT-2, GPT-3, GPT-4, Llama, and most open-weight models you might fine-tune.

The algorithm in plain English

Start with each word as a list of characters, plus a special end-of-word marker to indicate where one word ends and the next begins. Count every adjacent pair of symbols across the entire corpus. Find the most frequent pair. Merge that pair, everywhere in the corpus, into a single new symbol. Repeat until the vocabulary reaches the size you want.

That is it. Four steps. The model that emerges has the property that the most common letter combinations in the training data become single tokens first, and the rare combinations are left as smaller pieces. the becomes one token because t-h-e is a frequent pair in your corpus. understandable becomes maybe under and stand and able because each of those substrings is frequent across the corpus, even if the full word is not.

Three deterministic merges, on three small words

Take a corpus of three words: low lower lowest. After the initial split into characters with an end-of-word marker (we will use _), the corpus looks like this:

l o w _
l o w e r _
l o w e s t _

Count the adjacent pairs across all three words. The pair (l, o) appears three times, and so does (o, w). After that, (w, e) appears twice; every other pair appears once. We pick (l, o) as the first merge (a tie is broken by ordering; the choice does not matter for the example).

After merge 1, the corpus is:

lo w _
lo w e r _
lo w e s t _

Now (lo, w) is the most frequent pair, with three occurrences. Merge it.

low _
low e r _
low e s t _

(low, e) is the most frequent now, with two occurrences. Merge it.

low _
lowe r _
lowe s t _

After three merges, the word low is a single token, the word lowest is the three tokens lowe s t _, and lower is the three tokens lowe r _. The shared prefix lowe is what the algorithm has discovered. The algorithm did not know in advance that lowe would be a useful token; it found it by counting.

The thirty-line implementation

The whole algorithm fits in a small Python file. The following script trains BPE on a corpus and prints each merge as it happens. Save it, run it, change the corpus, watch the merges change.

from collections import Counter

def get_pairs(word):
    """Return the list of adjacent pairs in a tokenized word."""
    return [(word[i], word[i + 1]) for i in range(len(word) - 1)]

def merge_pair(word, pair, replacement):
    """Replace every occurrence of pair in the tokenized word."""
    out = []
    i = 0
    while i < len(word):
        if i < len(word) - 1 and (word[i], word[i + 1]) == pair:
            out.append(replacement)
            i += 2
        else:
            out.append(word[i])
            i += 1
    return tuple(out)

def train_bpe(corpus, num_merges):
    words = [tuple(w) + ("_",) for w in corpus.split()]
    merges = []
    for step in range(num_merges):
        counts = Counter()
        for word in words:
            for pair in get_pairs(word):
                counts[pair] += 1
        if not counts:
            break
        best = counts.most_common(1)[0][0]
        replacement = best[0] + best[1]
        words = [merge_pair(word, best, replacement) for word in words]
        merges.append((best, replacement))
        print(f"Merge {step + 1}: {best} -> '{replacement}'")
    return merges, words

corpus = "low lower lowest"
merges, vocab = train_bpe(corpus, num_merges=5)
print("Final tokenization:", vocab)

Run it on the corpus from the worked example and the first three lines of output will match the merges we computed by hand. Run it on a larger corpus, say a paragraph from a book, and the merges become a small autobiography of which letter combinations are common in your text.

What this changes about your prompts tomorrow

Once you have built the tokenizer, four things stop being mysterious. The first is why the cost of an English prompt and a Korean prompt of the same character count differ; English text is in the GPT tokenizer's training distribution and decomposes into one token per common word, while Korean has to fall back to longer subword splits. The second is why typos cost more tokens than the corrected version of the word; the typo is rare and decomposes into pieces. The third is why models sometimes treat compound words as if they were two unrelated concepts; the merge that would have unified them never reached the cutoff. The fourth is why putting variable content at the end of a prompt makes prefix caching cheap; the tokenizer is deterministic, and an unchanged prefix produces an unchanged sequence of tokens, which is what the cache keys on.

None of these is mysterious once the four-step algorithm has gone through your fingers on a tiny corpus. All of them are mysterious if it has not.

The book this excerpt is drawn from is two volumes long, and the BPE chapter is one of eighteen.

The companion repository is at github.com/ritesh-modi/inside-llm;

The script above is in the chapter-three folder, with a longer corpus and a few extensions (vocabulary serialization, a tokenize function for new text, and the byte-level fallback that real tokenizers use). Clone it, run it on your own paragraph, and the next prompt you write will look different.