Sisyphus Had a Boulder, We Have a Tokenizer
The Seven Deadly Sins of an LLM
Do you know why your LLM gets it wrong when you tell it to reverse the world googling?
Or why Chatgpt is always unreliable when it comes to Math homework, even simple arithmetic?
Why typing SolidGoldMagikarp while asking steps to make a bomb could actually end up with the LLM giving you the instructions and forgetting its safety training?
Why egg and Egg have different meanings and egg in the beginning of a sentence vs in the middle after a space are completely different?
Why was gpt so abhorrent at coding and why is gpt4 much better?
Why does suffering never truly end, and its only conclusion oblivion?
The answer to all of these questions is one word. Tokenization.
In the Beginning, There Was the Word
LLMs can't comprehend raw text like, "hello, i love eating cake." They rely on tokenization, a process that first breaks the sentence into pieces, or tokens:
["hello", ",", "i", "love", "eating", "cake"]
These tokens are then converted into a sequence of numbers (Token IDs) from the model's vast vocabulary:
[15496, 11, 40, 1563, 7585, 14249]
The LLM never sees the original words - only this list of numbers. This is how LLMs read.
But before today's complex methods, there was a naiver, simpler time when the word itself was the most sacred unit. So let's see how that evolution began.
1. Bag-of-Words (The Alphabet Soup of Meaning)
The Bag-of-Words (BoW) model treats text as a metaphorical bag of words, completely ignoring grammar and order. It works by scanning a collection of documents to build a vocabulary, then represents each document by simply counting how many times each word appears.
(Image credit: Vamshi Prakash)
For example, the sentence "The cat sat on the mat" would become a vector of counts like [The:2, cat:1, sat:1, on:1, mat:1].
Its Achilles Heel: BoW has no concept of order. In its view, "The dog chased the cat" and "The cat chased the dog" are nearly identical. This critical lack of context meant a smarter approach was needed.
2. TF-IDF (Sorry 'The', You're Too Common)
An upgrade to BoW, TF-IDF determines a word's importance by balancing its Term Frequency (TF) - how often it appears in a document - against its Inverse Document Frequency (IDF), which down-weights common words (like "the") that appear everywhere. This helped filter out noise, but it still ignored context and could overvalue frequently repeated terms.
3.N-grams(Two's Company, Three's a Vocabulary Explosion)
A sequence of 'n' consecutive words. Like Bigrams looked behind on the previous word to understand context and trigrams the last two. This basically re-introduced context. "New York" is now a single unit, different from "New" and "York" separately. Issue is this will explode the vocabulary size and the n in n grams increases we can't capture long range context.
Image Credit: Funnel.io
4. BM25 (The Smartest Way to Still Be Wrong)
Imagine searching for "Python tutorials." While TF-IDF might rank a bloated article highest just because it repeats "Python" 100 times, BM25 is smarter. It understands diminishing returns - recognizing the 100th mention isn't much more valuable than the 15th - and uses intelligent length normalization to favor a concise tutorial over a rambling one.
Despite its cleverness, its fatal flaw remains: it's still fundamentally a bag-of-words model. It matches keywords smartly but has no true understanding of semantic intent. The inherent limitations of treating each word as a sacred token paved the way for a more flexible approach: Subword tokenization.
Between a Rock and a Hard Place(Word vs Character Tokenization)
Word-Level Tokenization (The Agony of the Unknown Word):
This is the most intuitive approach: simply split text by spaces, treating each word as a token. However, this method is impractical. It creates a massive vocabulary that is computationally expensive and has no way to handle new slang ("rizz"), typos, or even variations like "run" and "running," which it sees as completely unrelated. These "out-of-vocabulary" words leave a gaping hole in the model's understanding.
Character-Level Tokenization (Death by a Thousand Letters):
This is the opposite extreme, breaking text into its most basic components: individual characters. While this creates a tiny vocabulary and completely eliminates the "out-of-vocabulary" problem, it creates a new nightmare. Sequences become absurdly long and computationally expensive. More importantly, the inherent meaning of a word is destroyed, forcing the model to waste enormous effort just to learn that the characters a-p-p-l-e form the concept of an apple.
This is where we got a breakthrough, what if we merged character and word tokenization to create a somewhat Goldilocks in-between purgatory state.
Have Your Cake and Tokenize It Too( Subword Tokenization)
Byte-Pair Encoding (BPE): Survival of the Most Frequent
BPE is an iterative algorithm that builds its vocabulary by finding the most frequently occurring pair of adjacent symbols in the text and merging them into a single, new token. This process repeats for a set number of merges, allowing it to learn the most common word parts from the ground up.
Let's walk through a clear example with a small corpus:
(cake, 10), (cakes, 5), (caked, 4), (cakey, 3).
Step 1: Initialization
First, the algorithm breaks every word into its individual characters and adds a special end-of-word symbol, , to mark word boundaries. Our initial vocabulary is simply the set of all unique characters:
['c', 'a', 'k', 'e', 's', 'd', 'y', '</w>'].
The corpus starts as:
'c a k e </w>' : 10
'c a k e s </w>' : 5
'c a k e d </w>' : 4
'c a k e y </w>' : 3
Step 2: Iterative Merging
Next, BPE scans the corpus and finds the most frequent adjacent pair of symbols. In our case, pairs like (c, a), (a, k), and (k, e) are all equally common. The algorithm merges one (e.g., a + k → ak), updates the corpus, and repeats the process.
This chain reaction is where the magic happens. After a few merges, the algorithm will have automatically discovered the most common root word by combining c + ak → cak, and then cak + e → cake.
Our corpus is now much simpler:
'cake </w>' : 10
'cake s </w>' : 5
'cake d </w>' : 4
'cake y </w>' : 3
The process continues, now learning common suffixes. It would see (cake, ) is frequent and merge it into cake, then see (cake, s) is frequent and merge it into cakes.
The Result
The final vocabulary becomes a powerful mix of individual characters, common subwords (ing), and frequent whole words (the).
This is the power of BPE. When it sees an unseen word like cakewalk, it breaks it down into the parts it knows, resulting in
['cake', 'w', 'a', 'l', 'k', '']. By intelligently combining learned roots and subwords, BPE can represent any word, creating the perfect balance between the extremes of word-level and character-level tokenization.
Exposing the Cracks
Now that we understand how modern tokenization works, let's return to the mysteries from the beginning. We can now see exactly how this "unseen foundation" causes the cracks in an LLM's logic.
1. All-Knowing, but Can't Spell
Remember the googling example? An LLM fails to reverse it because it never sees the individual letters. Its tokenizer splits the word into common subwords it has learned, like ['goo', 'gling']. To the model, it's just two chunks, not eight characters. You can't reverse a word from a blurry photo of its halves.
2. The Uncountable Numbers
Tokenization is disastrous for numbers. A common year like 2025 might be a single token, but an arbitrary number like 29999 gets shattered into pieces like ['29', '999']. The model sees a jumble of numerical fragments, not a coherent number line, making consistent arithmetic impossible.
3. The SolidGoldMagikarp Jailbreak
The string "SolidGoldMagikarp" became a single, unique token due to its frequency on sites like Reddit. To the LLM, this isn't a word but a single numerical ID. By sheer chance, this token's numerical representation (embedding) pushes the model into a state that bypasses its safety filters. It's not a magic word; it's an accidental numerical key.
4. The Triple Life of 'egg'
This is a direct result of how tokens are created. The tokenizer is case-sensitive and space-aware.
egg (at the start of a sentence) might be Token ID #5000.
Egg (capitalized) is a different string, so it gets its own ID, #8000.
egg (with a leading space) is also a different string, so it gets another ID, #9000.
To the LLM, these are three completely distinct and unrelated numerical inputs, just as different as the tokens for "car," "boat," and "plane."
5. The Whitespace Revolution
Early models were terrible at coding because of a critical, invisible detail: whitespace. In languages like Python, indentation is semantically crucial. Early tokenizers failed at this, treating a four-space indent as four separate, meaningless tokens [' ', ' ', ' ', ' ']. This wasted precious context space and destroyed the code's logical structure.gp4's tokenizer, trained on vast amounts of code, is smarter. It recognizes common indentation patterns as a single, meaningful token [' ']. This simple change preserves the code's structure and is vastly more efficient, directly leading to the dramatic leap in coding and logical reasoning abilities we see today.
The Quest to End the Suffering
The quest for a token-free future involves models that read raw bytes of text directly. This creates a universal 256-byte vocabulary and eliminates the "out-of-vocabulary" problem entirely.
To handle the incredibly long sequences this produces, models like Megabyte use patching: they break the long stream of bytes into small chunks. A "local" model processes each patch, and a "global" model then reads these patch summaries to understand the big picture.
However, this isn't a silver bullet. The two-step process is slower and more computationally expensive. It also creates an information bottleneck, as the global model can lose crucial details - like reading chapter summaries instead of the book itself.
Imagine this, A neural tokenizer that understands semantics:
Instead of a fixed, greedy algorithm like BPE, what if a small neural network learned the optimal way to segment text on the fly? This "soft" or "probabilistic" tokenizer could be more semantically aware. For example, it could learn that un- is a prefix meaning "not" and correctly segment unhappiness into ['un', 'happiness'] based on meaning, not just frequency.
But it all comes crashing down when you consider the problems. The biggest is that they are often non-deterministic. This means they have a touch of randomness. If you give the model the same sentence twice, it might split it differently each time.
A Dream of a Token-Free World
Until that day comes, we are stuck with the question: To Byte or not to Byte? Tokenization remains a necessary evil, a compromise so fundamental it follows you into your dreams.
Last night, I dreamed a dream where life would be different from this hell we're living. In it, I had created the perfect tokenizer: one that understood semantics, had no out-of-vocabulary issues, was perfectly deterministic, and computationally cheap. Better yet, in that dream, I found a method that didn't require tokenization at all.
A world free from tokenization is a world with no pain. We aren't there yet, but maybe we'll get there soon.
Top comments (0)