Ever wondered how models like GPT understand text? It all starts with tokenization — and one of the most powerful techniques behind it is called Byte Pair Encoding (BPE). In this post, I’ll explain BPE like you’re five, and then show you how to build it from scratch in Python.
🧠 What is a Tokenizer?
Before a machine learning model can work with language, it needs to convert text into numbers.
But how?
By breaking the text into small pieces called tokens, and giving each piece a number.
For example:
"I love cats" → ["I", "love", "cats"] → [101, 204, 999
But here’s the twist:
What if the model has never seen the word "cats"
before?
Should it just give up? Not with BPE.
🔍 What is Byte Pair Encoding (BPE)?
BPE is a clever way to build up tokens from characters, by looking at what combinations show up the most in real text.
It works like this:
Split every word into characters
"hello"
→['h', 'e', 'l', 'l', 'o']
Find the most common pair of symbols
→ merge them (e.g.,('l', 'l')
→'ll'
)Repeat until you reach your vocab size (like 1000 tokens)
What’s cool is that BPE doesn’t understand the meaning of words — but it accidentally learns real words, just because they appear a lot.
🧱 BPE in Plain English
Imagine you're building with LEGO blocks.
At first, you only have tiny bricks — one for each letter.
But as you build more and more sentences, you notice:
-
"h"
and"e"
are often together → make a"he"
block -
"he"
and"l"
→"hel"
-
"hell"
and"o"
→"hello"
Eventually, common words like “hello” become a single token.
Rare words like "circumnavigation"
might stay as chunks like ["circum", "navi", "gation"]
.
🧪 Build Your Own BPE Tokenizer (from Scratch!)
Here’s a minimal BPE tokenizer in Python using just functions — works great in Google Colab too.
from collections import Counter
def get_stats(token_lists):
pairs = Counter()
for tokens in token_lists:
for i in range(len(tokens) - 1):
pairs[(tokens[i], tokens[i+1])] += 1
return pairs
def merge_tokens(token_lists, pair):
merged = []
bigram = ''.join(pair)
for tokens in token_lists:
new_tokens = []
i = 0
while i < len(tokens):
if i < len(tokens)-1 and (tokens[i], tokens[i+1]) == pair:
new_tokens.append(bigram)
i += 2
else:
new_tokens.append(tokens[i])
i += 1
merged.append(new_tokens)
return merged
def train_bpe(corpus, vocab_size=100):
token_lists = [[chr(b) for b in text.encode('utf-8')] for text in corpus]
vocab = set(t for tokens in token_lists for t in tokens)
merges = []
while len(vocab) < vocab_size:
stats = get_stats(token_lists)
if not stats:
break
best = max(stats, key=stats.get)
token_lists = merge_tokens(token_lists, best)
new_token = ''.join(best)
vocab.add(new_token)
merges.append(best)
token_to_id = {token: idx for idx, token in enumerate(sorted(vocab))}
return merges, token_to_id
def apply_bpe(text, merges):
tokens = [chr(b) for b in text.encode('utf-8')]
for pair in merges:
tokens = merge_tokens([tokens], pair)[0]
return tokens
def encode(text, merges, token_to_id):
tokens = apply_bpe(text, merges)
return [token_to_id.get(token, -1) for token in tokens]
def decode(token_ids, token_to_id):
id_to_token = {i: t for t, i in token_to_id.items()}
tokens = [id_to_token[i] for i in token_ids]
byte_str = ''.join(tokens).encode('latin1')
return byte_str.decode('utf-8', errors='replace')
corpus = ["hello world", "hello again", "GPT is powerful"]
merges, token_to_id = train_bpe(corpus, vocab_size=100)
text = "hello world again"
encoded = encode(text, merges, token_to_id)
decoded = decode(encoded, token_to_id)
print("Tokens:", apply_bpe(text, merges))
print("Encoded:", encoded)
print("Decoded:", decoded)
Output:
Tokens: ['hello world', ' ', 'a', 'g', 'a', 'in']
Encoded: [23, 0, 17, 45, 17, 56]
Decoded: hello world again
BPE learned that "hello world" is very common — and made it a token!
🧠 Recap
🔍 Feature | ✅ What BPE Does |
---|---|
Compression | Fewer tokens for common words |
Flexibility | Can break rare words into parts |
Byte-level | Can handle any text (emojis, code, other languages) |
No dictionary | Learns purely from frequency, not meaning |
💬 Final Thoughts
BPE is a simple but brilliant idea that powers some of the most advanced models like GPT. It’s efficient, flexible, and surprisingly effective — all by learning patterns in text without knowing what words “mean”.
👉 Give it a try! Build your own tokenizer and see what kinds of word pieces it discovers.
If you liked this post, follow me for more machine learning and LLM internals! 🧠✨
Top comments (0)