DEV Community

Yaya
Yaya

Posted on

BYTE PAIR ENCODING

Understanding Byte Pair Encoding (BPE) by Building It From Scratch

Introduction: Why Tokenization Exists

When working with Large Language Models (LLMs) like GPT, LLaMA, or Mistral, one of the first hidden steps is tokenization.

LLMs do not read raw text. Instead, text must first be converted into tokens — numerical representations that the model can process.

One of the most important tokenization techniques used in modern NLP systems is Byte Pair Encoding (BPE).

In this article, I’ll break it down from intuition to implementation, and show how I built BPE from scratch in Python.

We’ll cover:

  • Why tokenization is needed
  • Why word-level and character-level tokenization fail
  • How BPE solves this trade-off
  • How BPE is implemented step-by-step
  • Training vs inference behavior
  • Strengths and limitations

The Problem With Word-Level Tokenization

A naive approach is to split text by words:

I love machine learning
Enter fullscreen mode Exit fullscreen mode

becomes:

["I", "love", "machine", "learning"]
Enter fullscreen mode Exit fullscreen mode

This seems reasonable — until you realize a major issue.

What happens when the model sees:

machinelearning
Enter fullscreen mode Exit fullscreen mode

or:

hyperlearning
Enter fullscreen mode Exit fullscreen mode

If these words were never in the training data, they become Out-Of-Vocabulary (OOV) tokens.

This is a serious limitation: language is constantly evolving, and we cannot store every possible word.


The Problem With Character-Level Tokenization

To solve OOV, we could split everything into characters:

learning
Enter fullscreen mode Exit fullscreen mode

becomes:

["l", "e", "a", "r", "n", "i", "n", "g"]
Enter fullscreen mode Exit fullscreen mode

This eliminates OOV completely.

However, it introduces new problems:

  • Sequences become very long
  • Training becomes inefficient
  • The model loses meaningful word structure

So we have a trade-off:

Approach Problem
Word-level OOV issues
Character-level Too long sequences

We need something in between.


The Idea Behind Byte Pair Encoding (BPE)

BPE solves this by learning common patterns in text.

Instead of choosing words or characters, it builds subword units.

The idea is simple:

  1. Start with characters
  2. Find the most frequent adjacent pair
  3. Merge it into a new token
  4. Repeat

Over time, frequent patterns become single tokens.


Example Intuition

Given:

l o w
l o w e r
l o w e s t
Enter fullscreen mode Exit fullscreen mode

BPE might learn:

lo
low
lowe
Enter fullscreen mode Exit fullscreen mode

depending on frequency patterns.


Step 1: Build a Vocabulary

We start with word frequencies:

{
    "low": 1,
    "lower": 1,
    "lowest": 1
}
Enter fullscreen mode Exit fullscreen mode

Then we convert each word into characters and add an end-of-word marker:

{
    ('l', 'o', 'w', '</w>'): 1,
    ('l', 'o', 'w', 'e', 'r', '</w>'): 1,
    ('l', 'o', 'w', 'e', 's', 't', '</w>'): 1
}
Enter fullscreen mode Exit fullscreen mode

The </w> marker helps distinguish word boundaries.


Step 2: Count Adjacent Pairs

We scan all words and count adjacent symbol pairs.

For example:

l o w </w>
Enter fullscreen mode Exit fullscreen mode

produces:

(l, o)
(o, w)
(w, </w>)
Enter fullscreen mode Exit fullscreen mode

Each pair is counted weighted by word frequency.

This gives us a global frequency table of all pairs in the corpus.


Step 3: Select the Most Frequent Pair

We select the most frequent adjacent pair:

('l', 'o')
Enter fullscreen mode Exit fullscreen mode

This becomes a merge rule:

(l, o) → lo
Enter fullscreen mode Exit fullscreen mode

Step 4: Merge Across the Entire Vocabulary

We replace every occurrence of:

l o
Enter fullscreen mode Exit fullscreen mode

with:

lo
Enter fullscreen mode Exit fullscreen mode

So:

('l','o','w','</w>')
Enter fullscreen mode Exit fullscreen mode

becomes:

('lo','w','</w>')
Enter fullscreen mode Exit fullscreen mode

Step 5: Repeat the Process

We recompute pair frequencies and continue merging:

(l,o) → lo
(lo,w) → low
(low,e) → lowe
Enter fullscreen mode Exit fullscreen mode

Each iteration builds more complex tokens from simpler ones.


Training vs Tokenization (CRITICAL CONCEPT)

This is where many people get confused.

During Training

BPE:

  • Counts pair frequencies
  • Learns merge rules
  • Builds a vocabulary

Example:

(l,o) → lo
(lo,w) → low
Enter fullscreen mode Exit fullscreen mode

During Tokenization (Inference)

The tokenizer:

  • Does NOT learn anything
  • Does NOT count frequencies
  • ONLY applies learned merge rules

Example:

Input:

lowly
Enter fullscreen mode Exit fullscreen mode

Start:

l o w l y
Enter fullscreen mode Exit fullscreen mode

Apply merges:

(l,o) → lo
(lo,w) → low
Enter fullscreen mode Exit fullscreen mode

Result:

low l y
Enter fullscreen mode Exit fullscreen mode

No learning happens here — only rule application.


Why BPE Works So Well

1. Solves OOV Problem

Even unseen words can be represented:

playfulness → play + ful + ness
Enter fullscreen mode Exit fullscreen mode

2. Smaller Vocabulary

Instead of storing full words:

play
playing
player
played
Enter fullscreen mode Exit fullscreen mode

we reuse:

play
ing
er
Enter fullscreen mode Exit fullscreen mode

3. Better Generalization

Words with shared roots reuse tokens:

play
playing
player
Enter fullscreen mode Exit fullscreen mode

This allows models to share meaning across related words.


Limitations of BPE

Despite its power, BPE has limitations:

  • Sensitive to training corpus
  • Greedy merging strategy
  • Weak for multilingual balance
  • Does not “understand” language — only frequency patterns

Because of this, modern LLMs often use:

  • Byte-level BPE (GPT-style tokenizers)
  • WordPiece (BERT)
  • SentencePiece Unigram (LLaMA, T5)

My Python Implementation (From Scratch)

Below is my simplified BPE implementation:

import sys

def main(av: list[str]) -> int:
    ac = len(av)

    # Step 1: Build word frequencies
    vocab = {}

    for i in range(1, ac):
        vocab[av[i]] = vocab.get(av[i], 0) + 1

    # Step 2: Split into characters
    word_dict = {
        tuple(word) + ('</w>',): freq
        for word, freq in vocab.items()
    }

    # BPE training loop
    num_merges = 200

    for merge_step in range(num_merges):

        # Step 3: Count pairs
        pair_count = {}

        for word, freq in word_dict.items():
            symbols = list(word)

            for i in range(len(symbols) - 1):
                pair = (symbols[i], symbols[i + 1])
                pair_count[pair] = pair_count.get(pair, 0) + freq

        if not pair_count:
            break

        # Step 4: Best pair
        best_pair = max(pair_count, key=pair_count.get)

        if pair_count[best_pair] < 2:
            break

        merged_symbol = ''.join(best_pair)

        print(f"Merge {merge_step + 1}: {best_pair} -> {merged_symbol}")

        # Step 5: Merge vocabulary
        new_word_dict = {}

        for word, freq in word_dict.items():
            symbols = list(word)
            new_symbols = []

            i = 0
            while i < len(symbols):
                if (
                    i < len(symbols) - 1 and
                    (symbols[i], symbols[i + 1]) == best_pair
                ):
                    new_symbols.append(merged_symbol)
                    i += 2
                else:
                    new_symbols.append(symbols[i])
                    i += 1

            new_word_dict[tuple(new_symbols)] = freq

        word_dict = new_word_dict

    print("\nFinal vocabulary:")
    print(word_dict)

    return 0


if __name__ == "__main__":
    sys.exit(main(sys.argv))

Enter fullscreen mode Exit fullscreen mode

This implementation:

  • Builds vocabulary
  • Computes pair frequencies
  • Learns merge rules
  • Iteratively updates tokens

Key Insight

BPE is surprisingly simple:

It does not understand language — it only learns which character pairs appear together most frequently.

Yet this simple idea is powerful enough to be used in all modern LLMs.


Conclusion

Byte Pair Encoding sits at the core of modern NLP systems.

It provides a balance between:

  • Word-level tokenization (too rigid)
  • Character-level tokenization (too long)

By learning frequent subword patterns, BPE allows models to efficiently represent both common and unseen words.

Understanding BPE is one of the best entry points into understanding how LLMs actually process language.

Top comments (0)