DEV Community

Cover image for LLM Basics: The weird and wonderful world of Tokenization
Luke Hinds for Stacklok

Posted on

LLM Basics: The weird and wonderful world of Tokenization

Welcome to part two of LLM Basics, where we continue our journey into understanding Large Language Models (LLMs). If you haven’t already, check out Part One: The Transformer Model.

Today, we're covering Tokenization, a key concept in natural language processing (NLP). Tokenization is the process of breaking text into smaller units called tokens—essentially preparing language for machines to process. LLMs don't work with raw text; they need input in numerical form, and tokenization is how we bridge that gap.

It's a fascinating world, of lots of NLP semantics, so fasten your seatbelt and let's get stuck in! As always, lots of good old see by example!

Understanding Tokenization

At its core, tokenization is the process of converting raw text into smaller units called tokens. Think of it as breaking down a sentence into meaningful pieces that a machine learning model can process. While this might sound simple, there's actually quite a bit of complexity involved in doing this effectively.

Consider this old classic:

The quick brown fox jumped over the lazy dog.
Enter fullscreen mode Exit fullscreen mode

Depending on the tokenization approach, this could be broken down in several different ways. Let's explore the main types of tokenizers and see how each would handle this text.

Character-Level Tokenization

The simplest approach is character-level tokenization, where each character becomes a token. Using our example:

text = "The quick brown fox"
char_tokens = list(text)
# Result: ['T', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', ' ', 'b', 'r', 'o', 'w', 'n', ' ', 'f', 'o', 'x']
Enter fullscreen mode Exit fullscreen mode

While simple to implement, this approach creates very long sequences and loses some of the semantic meaning that comes from whole words. However, it has the advantage of having a small vocabulary size and no out-of-vocabulary words.

Word-Level Tokenization

Word-level tokenization splits text at word boundaries, typically using spaces and punctuation as delimiters:

text = "The quick brown fox"
word_tokens = text.split()
# Result: ['The', 'quick', 'brown', 'fox']
Enter fullscreen mode Exit fullscreen mode

This approach is intuitive and preserves word meaning, but it has several drawbacks. First, it creates a huge vocabulary (consider all possible words in English). Second, it struggles with compound words, misspellings, and morphological variants (running/runs/ran would be three separate tokens).

Subword Tokenization

This is where things get interesting. Modern LLMs typically use subword tokenization, which finds a sweet spot between character-level and word-level approaches. Let's look at some popular subword tokenization algorithms:

BPE (Byte-Pair Encoding)

BPE starts with characters and iteratively merges the most frequent pairs. Here's a simplified example of how it works:

# Initial vocabulary: ['l', 'o', 'w', 'e', 'r', ' ']
# Text: "lower lower lower"
# Step 1: Most common pair 'l' 'o' → 'lo'
# Step 2: Most common pair 'lo' 'w' → 'low'
# And so on...
Enter fullscreen mode Exit fullscreen mode

The real power of BPE becomes apparent when dealing with morphologically rich languages or compound words:

"unhappiness" → ["un", "happiness"]
"playing" → ["play", "ing"]
Enter fullscreen mode Exit fullscreen mode

WordPiece

Used by good ole BERT, WordPiece is similar to BPE but uses a different criterion for merging tokens. It calculates the likelihood that two tokens should be merged based on the frequency of their combined occurrence divided by the product of their individual frequencies.

# Example WordPiece tokenization:
"embedding" → ["em", "##bed", "##ding"]
"transformer" → ["trans", "##form", "##er"]
Enter fullscreen mode Exit fullscreen mode

The ## prefix indicates that this piece was part of a larger word.

SentencePiece and Unigram

SentencePiece, used by many modern models, treats the input as a sequence of unicode characters and learns to tokenize based on a unigram language model. It's particularly useful for languages without clear word boundaries (like Thai, Japanese or Chinese):

# English: "Hello world!"
# SentencePiece: ["▁He", "llo", "▁world", "!"]
# (▁ represents space)

# Japanese: "こんにちは世界"
# SentencePiece: ["こん", "にち", "は", "世界"]
Enter fullscreen mode Exit fullscreen mode

Handling Padding and Special Tokens

When processing batches of text, we are going to need to handle sequences of different lengths. This is where padding and special tokens come in. Anyone from the cryptography world will know these well, I got used to the sometimes painful debugging experience of padding crypto schemes such as RSA, ECDSA etc

# Special tokens
[PAD]  # Padding token
[CLS]  # Classification token (start of sequence)
[SEP]  # Separator token
[MASK] # Masked token for MLM tasks
[UNK]  # Unknown token

# Example sequence padding
sequences = [
    [101, 2054, 2003, 2146, 102],   # "This is great"
    [101, 2054, 2003, 102],         # "This is"
]

# After padding (pad_token_id = 0)
padded = [
    [101, 2054, 2003, 2146, 102],   # Original
    [101, 2054, 2003, 102,  0  ],   # Padded
]
Enter fullscreen mode Exit fullscreen mode

Implementation Considerations

When implementing tokenization in practice, there are several important considerations and possible gotchas to be aware of

Vocabulary Size

The vocabulary size affects both model performance and computational efficiency. Too small, and you'll have too many tokens per sequence. Too large, and you'll have sparse embeddings:

# Example vocabulary sizes
character_vocab_size = 256  # ASCII characters
word_vocab_size = 50000    # Common in word-level models
subword_vocab_size = 32000 # Common in modern LLMs
Enter fullscreen mode Exit fullscreen mode

Handling Unknown Tokens

Always, let me say that again and save you a lot of pain, always have a strategy for handling unknown tokens. Modern subword tokenizers rarely encounter truly unknown tokens, but when they do they don't like it!

# Example handling of unknown tokens
def tokenize_with_unknown(text, vocab):
    tokens = []
    for word in text.split():
        if word in vocab:
            tokens.append(word)
        else:
            tokens.append("[UNK]")
    return tokens
Enter fullscreen mode Exit fullscreen mode

Practical Example: Building a Simple BPE Tokenizer

Let's conclude with a practical example of implementing a simple BPE tokenizer

def get_stats(vocab):
    """Get frequencies of pairs of consecutive tokens in the vocabulary."""
    pairs = {}
    for word, freq in vocab.items():
        symbols = list(word)
        for i in range(len(symbols) - 1):
            pair = (symbols[i], symbols[i + 1])
            if pair in pairs:
                pairs[pair] += freq
            else:
                pairs[pair] = freq
    return pairs

def update_vocab(vocab, pair):
    """Merge the given pair in the vocabulary and update frequencies."""
    new_vocab = {}
    pair_str = ''.join(pair)
    for word in vocab:
        # Merge all occurrences of the pair in the word
        new_word = word.replace(''.join(pair), pair_str)
        new_vocab[new_word] = vocab[word]
    return new_vocab

def train_bpe(text, num_merges):
    """Train Byte-Pair Encoding on the given text."""
    # Start with character-level vocabulary
    vocab = {c: text.count(c) for c in set(text)}
    pairs = get_stats(vocab)

    for i in range(num_merges):
        if not pairs:
            break

        # Find the most frequent pair of tokens to merge
        best = max(pairs, key=pairs.get)
        vocab = update_vocab(vocab, best)

        # Recalculate pairs after updating the vocabulary
        pairs = get_stats(vocab)

    return vocab

# Usage example:
text = "This is a sample text for training BPE"
vocab = train_bpe(text, num_merges=10)
print(vocab)

{'t': 3, 'm': 1, 's': 3, 'n': 2, 'e': 2, 'h': 1, 'p': 1, ' ': 7, 'g': 1, 'B': 1, 'a': 3, 'P': 1, 'o': 1, 'T': 1, 'E': 1, 'x': 1, 'r': 2, 'i': 4, 'f': 1, 'l': 1}
Enter fullscreen mode Exit fullscreen mode

Let's break down what each function does, how they interact, and the underlying logic.

The get_stats function analyzes a vocabulary to identify pairs of consecutive tokens (characters, in this case) and calculates their frequencies. It accepts a dictionary (vocab) where each key is a "word" and each value is the corresponding frequency of that word. Within each word, the function iterates over every adjacent pair of tokens, storing and summing up their occurrences in a separate dictionary named pairs. This way, it effectively captures which pairs of characters are most commonly found together.

Next, the update_vocab function is responsible for merging a selected pair of characters across the entire vocabulary. It takes in the current vocabulary and the target pair to merge. The function constructs a new vocabulary (new_vocab) where it replaces all occurrences of the chosen pair in each word with a new string formed by concatenating the two characters. This merged version of the vocabulary is returned at the end, reflecting the latest updates.

The train_bpe function serves as the main function, tying everything together. It initializes a vocabulary based on the characters present in the input text and their counts. It then proceeds with a loop that executes a specified number of merging steps (num_merges). During each step, the function identifies the most frequent pair of adjacent characters using the get_stats function and merges this pair using update_vocab. After each merge, the pair statistics are recalculated to reflect the changes in the vocabulary. This iterative process continues until the desired number of merges is reached or if no frequent pairs remain.

The output as shown in the code snippet above, represents a dictionary that contains the characters (or tokens) extracted from out the input text "This is a sample text for training BPE" along with it's respective counts.

That is all for this week, so you next time!

Top comments (0)