Yaya

Posted on Jun 7

The Tokenizer Behind GPT: Understanding BPE Step by Step

#bpe #ai #llm #machinelearning

Understanding Byte Pair Encoding (BPE) by Building It From Scratch

Introduction: Why Tokenization Exists

When working with Large Language Models (LLMs) like GPT, LLaMA, or Mistral, one of the first hidden steps is tokenization.

LLMs do not read raw text. Instead, text must first be converted into tokens — numerical representations that the model can process.

One of the most important tokenization techniques used in modern NLP systems is Byte Pair Encoding (BPE).

In this article, I’ll break it down from intuition to implementation, and show how I built BPE from scratch in Python.

We’ll cover:

Why tokenization is needed
Why word-level and character-level tokenization fail
How BPE solves this trade-off
How BPE is implemented step-by-step
Training vs inference behavior
Strengths and limitations

The Problem With Word-Level Tokenization

A naive approach is to split text by words:

I love machine learning

becomes:

["I", "love", "machine", "learning"]

This seems reasonable — until you realize a major issue.

What happens when the model sees:

machinelearning

or:

hyperlearning

If these words were never in the training data, they become Out-Of-Vocabulary (OOV) tokens.

This is a serious limitation: language is constantly evolving, and we cannot store every possible word.

The Problem With Character-Level Tokenization

To solve OOV, we could split everything into characters:

learning

becomes:

["l", "e", "a", "r", "n", "i", "n", "g"]

This eliminates OOV completely.

However, it introduces new problems:

Sequences become very long
Training becomes inefficient
The model loses meaningful word structure

So we have a trade-off:

Approach	Problem
Word-level	OOV issues
Character-level	Too long sequences

We need something in between.

The Idea Behind Byte Pair Encoding (BPE)

BPE solves this by learning common patterns in text.

Instead of choosing words or characters, it builds subword units.

The idea is simple:

Start with characters
Find the most frequent adjacent pair
Merge it into a new token
Repeat

Over time, frequent patterns become single tokens.

Example Intuition

Given:

l o w
l o w e r
l o w e s t

BPE might learn:

lo
low
lowe

depending on frequency patterns.

Step 1: Build a Vocabulary

We start with word frequencies:

{
    "low": 1,
    "lower": 1,
    "lowest": 1
}

Then we convert each word into characters and add an end-of-word marker:

{
    ('l', 'o', 'w', '</w>'): 1,
    ('l', 'o', 'w', 'e', 'r', '</w>'): 1,
    ('l', 'o', 'w', 'e', 's', 't', '</w>'): 1
}

The </w> marker helps distinguish word boundaries.

Step 2: Count Adjacent Pairs

We scan all words and count adjacent symbol pairs.

For example:

l o w </w>

produces:

(l, o)
(o, w)
(w, </w>)

Each pair is counted weighted by word frequency.

This gives us a global frequency table of all pairs in the corpus.

Step 3: Select the Most Frequent Pair

We select the most frequent adjacent pair:

('l', 'o')

This becomes a merge rule:

(l, o) → lo

Step 4: Merge Across the Entire Vocabulary

We replace every occurrence of:

l o

with:

lo

So:

('l','o','w','</w>')

becomes:

('lo','w','</w>')

Step 5: Repeat the Process

We recompute pair frequencies and continue merging:

(l,o) → lo
(lo,w) → low
(low,e) → lowe

Each iteration builds more complex tokens from simpler ones.

Training vs Tokenization (CRITICAL CONCEPT)

This is where many people get confused.

During Training

BPE:

Counts pair frequencies
Learns merge rules
Builds a vocabulary

Example:

(l,o) → lo
(lo,w) → low

During Tokenization (Inference)

The tokenizer:

Does NOT learn anything
Does NOT count frequencies
ONLY applies learned merge rules

Example:

Input:

lowly

Start:

l o w l y

Apply merges:

(l,o) → lo
(lo,w) → low

Result:

low l y

No learning happens here — only rule application.

Why BPE Works So Well

1. Solves OOV Problem

Even unseen words can be represented:

playfulness → play + ful + ness

2. Smaller Vocabulary

Instead of storing full words:

play
playing
player
played

we reuse:

play
ing
er

3. Better Generalization

Words with shared roots reuse tokens:

play
playing
player

This allows models to share meaning across related words.

Limitations of BPE

Despite its power, BPE has limitations:

Sensitive to training corpus
Greedy merging strategy
Weak for multilingual balance
Does not “understand” language — only frequency patterns

Because of this, modern LLMs often use:

Byte-level BPE (GPT-style tokenizers)
WordPiece (BERT)
SentencePiece Unigram (LLaMA, T5)

My Python Implementation (From Scratch)

Below is my simplified BPE implementation:

import sys

def main(av: list[str]) -> int:
    ac = len(av)

    # Step 1: Build word frequencies
    vocab = {}

    for i in range(1, ac):
        vocab[av[i]] = vocab.get(av[i], 0) + 1

    # Step 2: Split into characters
    word_dict = {
        tuple(word) + ('</w>',): freq
        for word, freq in vocab.items()
    }

    # BPE training loop
    num_merges = 200

    for merge_step in range(num_merges):

        # Step 3: Count pairs
        pair_count = {}

        for word, freq in word_dict.items():
            symbols = list(word)

            for i in range(len(symbols) - 1):
                pair = (symbols[i], symbols[i + 1])
                pair_count[pair] = pair_count.get(pair, 0) + freq

        if not pair_count:
            break

        # Step 4: Best pair
        best_pair = max(pair_count, key=pair_count.get)

        if pair_count[best_pair] < 2:
            break

        merged_symbol = ''.join(best_pair)

        print(f"Merge {merge_step + 1}: {best_pair} -> {merged_symbol}")

        # Step 5: Merge vocabulary
        new_word_dict = {}

        for word, freq in word_dict.items():
            symbols = list(word)
            new_symbols = []

            i = 0
            while i < len(symbols):
                if (
                    i < len(symbols) - 1 and
                    (symbols[i], symbols[i + 1]) == best_pair
                ):
                    new_symbols.append(merged_symbol)
                    i += 2
                else:
                    new_symbols.append(symbols[i])
                    i += 1

            new_word_dict[tuple(new_symbols)] = freq

        word_dict = new_word_dict

    print("\nFinal vocabulary:")
    print(word_dict)

    return 0


if __name__ == "__main__":
    sys.exit(main(sys.argv))

This implementation:

Builds vocabulary
Computes pair frequencies
Learns merge rules
Iteratively updates tokens

Key Insight

BPE is surprisingly simple:

It does not understand language — it only learns which character pairs appear together most frequently.

Yet this simple idea is powerful enough to be used in all modern LLMs.

Conclusion

Byte Pair Encoding sits at the core of modern NLP systems.

It provides a balance between:

Word-level tokenization (too rigid)
Character-level tokenization (too long)

By learning frequent subword patterns, BPE allows models to efficiently represent both common and unseen words.

Understanding BPE is one of the best entry points into understanding how LLMs actually process language.

DEV Community

The Tokenizer Behind GPT: Understanding BPE Step by Step

Understanding Byte Pair Encoding (BPE) by Building It From Scratch

Introduction: Why Tokenization Exists

The Problem With Word-Level Tokenization

The Problem With Character-Level Tokenization

The Idea Behind Byte Pair Encoding (BPE)

Example Intuition

Step 1: Build a Vocabulary

Step 2: Count Adjacent Pairs

Step 3: Select the Most Frequent Pair

Step 4: Merge Across the Entire Vocabulary

Step 5: Repeat the Process

Training vs Tokenization (CRITICAL CONCEPT)

During Training

During Tokenization (Inference)

Why BPE Works So Well

1. Solves OOV Problem

2. Smaller Vocabulary

3. Better Generalization

Limitations of BPE

My Python Implementation (From Scratch)

Key Insight

Conclusion

Top comments (0)