Understanding Byte Pair Encoding (BPE) by Building It From Scratch
Introduction: Why Tokenization Exists
When working with Large Language Models (LLMs) like GPT, LLaMA, or Mistral, one of the first hidden steps is tokenization.
LLMs do not read raw text. Instead, text must first be converted into tokens — numerical representations that the model can process.
One of the most important tokenization techniques used in modern NLP systems is Byte Pair Encoding (BPE).
In this article, I’ll break it down from intuition to implementation, and show how I built BPE from scratch in Python.
We’ll cover:
- Why tokenization is needed
- Why word-level and character-level tokenization fail
- How BPE solves this trade-off
- How BPE is implemented step-by-step
- Training vs inference behavior
- Strengths and limitations
The Problem With Word-Level Tokenization
A naive approach is to split text by words:
I love machine learning
becomes:
["I", "love", "machine", "learning"]
This seems reasonable — until you realize a major issue.
What happens when the model sees:
machinelearning
or:
hyperlearning
If these words were never in the training data, they become Out-Of-Vocabulary (OOV) tokens.
This is a serious limitation: language is constantly evolving, and we cannot store every possible word.
The Problem With Character-Level Tokenization
To solve OOV, we could split everything into characters:
learning
becomes:
["l", "e", "a", "r", "n", "i", "n", "g"]
This eliminates OOV completely.
However, it introduces new problems:
- Sequences become very long
- Training becomes inefficient
- The model loses meaningful word structure
So we have a trade-off:
| Approach | Problem |
|---|---|
| Word-level | OOV issues |
| Character-level | Too long sequences |
We need something in between.
The Idea Behind Byte Pair Encoding (BPE)
BPE solves this by learning common patterns in text.
Instead of choosing words or characters, it builds subword units.
The idea is simple:
- Start with characters
- Find the most frequent adjacent pair
- Merge it into a new token
- Repeat
Over time, frequent patterns become single tokens.
Example Intuition
Given:
l o w
l o w e r
l o w e s t
BPE might learn:
lo
low
lowe
depending on frequency patterns.
Step 1: Build a Vocabulary
We start with word frequencies:
{
"low": 1,
"lower": 1,
"lowest": 1
}
Then we convert each word into characters and add an end-of-word marker:
{
('l', 'o', 'w', '</w>'): 1,
('l', 'o', 'w', 'e', 'r', '</w>'): 1,
('l', 'o', 'w', 'e', 's', 't', '</w>'): 1
}
The </w> marker helps distinguish word boundaries.
Step 2: Count Adjacent Pairs
We scan all words and count adjacent symbol pairs.
For example:
l o w </w>
produces:
(l, o)
(o, w)
(w, </w>)
Each pair is counted weighted by word frequency.
This gives us a global frequency table of all pairs in the corpus.
Step 3: Select the Most Frequent Pair
We select the most frequent adjacent pair:
('l', 'o')
This becomes a merge rule:
(l, o) → lo
Step 4: Merge Across the Entire Vocabulary
We replace every occurrence of:
l o
with:
lo
So:
('l','o','w','</w>')
becomes:
('lo','w','</w>')
Step 5: Repeat the Process
We recompute pair frequencies and continue merging:
(l,o) → lo
(lo,w) → low
(low,e) → lowe
Each iteration builds more complex tokens from simpler ones.
Training vs Tokenization (CRITICAL CONCEPT)
This is where many people get confused.
During Training
BPE:
- Counts pair frequencies
- Learns merge rules
- Builds a vocabulary
Example:
(l,o) → lo
(lo,w) → low
During Tokenization (Inference)
The tokenizer:
- Does NOT learn anything
- Does NOT count frequencies
- ONLY applies learned merge rules
Example:
Input:
lowly
Start:
l o w l y
Apply merges:
(l,o) → lo
(lo,w) → low
Result:
low l y
No learning happens here — only rule application.
Why BPE Works So Well
1. Solves OOV Problem
Even unseen words can be represented:
playfulness → play + ful + ness
2. Smaller Vocabulary
Instead of storing full words:
play
playing
player
played
we reuse:
play
ing
er
3. Better Generalization
Words with shared roots reuse tokens:
play
playing
player
This allows models to share meaning across related words.
Limitations of BPE
Despite its power, BPE has limitations:
- Sensitive to training corpus
- Greedy merging strategy
- Weak for multilingual balance
- Does not “understand” language — only frequency patterns
Because of this, modern LLMs often use:
- Byte-level BPE (GPT-style tokenizers)
- WordPiece (BERT)
- SentencePiece Unigram (LLaMA, T5)
My Python Implementation (From Scratch)
Below is my simplified BPE implementation:
import sys
def main(av: list[str]) -> int:
ac = len(av)
# Step 1: Build word frequencies
vocab = {}
for i in range(1, ac):
vocab[av[i]] = vocab.get(av[i], 0) + 1
# Step 2: Split into characters
word_dict = {
tuple(word) + ('</w>',): freq
for word, freq in vocab.items()
}
# BPE training loop
num_merges = 200
for merge_step in range(num_merges):
# Step 3: Count pairs
pair_count = {}
for word, freq in word_dict.items():
symbols = list(word)
for i in range(len(symbols) - 1):
pair = (symbols[i], symbols[i + 1])
pair_count[pair] = pair_count.get(pair, 0) + freq
if not pair_count:
break
# Step 4: Best pair
best_pair = max(pair_count, key=pair_count.get)
if pair_count[best_pair] < 2:
break
merged_symbol = ''.join(best_pair)
print(f"Merge {merge_step + 1}: {best_pair} -> {merged_symbol}")
# Step 5: Merge vocabulary
new_word_dict = {}
for word, freq in word_dict.items():
symbols = list(word)
new_symbols = []
i = 0
while i < len(symbols):
if (
i < len(symbols) - 1 and
(symbols[i], symbols[i + 1]) == best_pair
):
new_symbols.append(merged_symbol)
i += 2
else:
new_symbols.append(symbols[i])
i += 1
new_word_dict[tuple(new_symbols)] = freq
word_dict = new_word_dict
print("\nFinal vocabulary:")
print(word_dict)
return 0
if __name__ == "__main__":
sys.exit(main(sys.argv))
This implementation:
- Builds vocabulary
- Computes pair frequencies
- Learns merge rules
- Iteratively updates tokens
Key Insight
BPE is surprisingly simple:
It does not understand language — it only learns which character pairs appear together most frequently.
Yet this simple idea is powerful enough to be used in all modern LLMs.
Conclusion
Byte Pair Encoding sits at the core of modern NLP systems.
It provides a balance between:
- Word-level tokenization (too rigid)
- Character-level tokenization (too long)
By learning frequent subword patterns, BPE allows models to efficiently represent both common and unseen words.
Understanding BPE is one of the best entry points into understanding how LLMs actually process language.
Top comments (0)