Why LLMs Can't Count Letters (And How They Fake It)

#ai #llm #machinelearning #beginners

You've probably noticed this:

Ask a language model to count the letters in "strawberry" and it'll confidently give you the wrong answer.
Ask it to rhyme "entrepreneur" and it fumbles.
Ask it to fix a typo in "teh quick brown fox" and it handles it without blinking.

Same model. Wildly different results. Why?

The answer isn't "the model is bad at spelling." It's in how these models actually read text, and it's not what most people assume. Once you see the mechanism, you can predict which of these tasks a model will nail and which it'll botch, before you even run it.

The model never sees your text

When you type a message into an LLM, the model doesn't receive a string. It receives a sequence of integers.

Before your input ever touches the neural network, it passes through a tokenizer: a separate component that converts raw text into token IDs. The string "strawberry" might become [496, 19772]. That's what the model works with.

The tokenizer is trained separately, before the LLM itself, and then frozen. The model never updates it. It just inherits whatever mapping the tokenizer learned. So the very first thing that happens to your prompt is a lossy translation the model has no say in.

How that mapping gets decided: BPE

The algorithm most modern LLMs use is called Byte Pair Encoding (BPE).

Here's the mechanism in one sentence: BPE starts with individual characters as the vocabulary, then repeatedly merges the most frequently co-occurring pair of tokens in the training corpus, until it hits a target vocabulary size.

So it starts with the characters: s, t, r, a, w, b, e, r, r, y

Then it notices which pairs show up together constantly and merges them: s·t·r·a·w collapses into straw, b·e·r·r·y collapses into berry. And so on, until the vocabulary stabilizes at a target size (GPT-4 uses ~100,000 tokens).

The result: "strawberry" might tokenize as ["straw", "berry"], not because those are linguistically meaningful units, but because those byte sequences were frequent enough in training data to survive the merges.

This is the key intuition: the splits track frequency, not grammar. Common words like "the" become a single token. Rare or technical words get shredded into pieces. A word like "thransformer" (a typo) might become 4 or 5 tokens the model has never seen grouped that way.

The information that gets thrown away

Here's the thing that matters: once a sequence of characters becomes a token ID, the characters are gone.

The model sees 19772 (or whatever integer maps to "berry"). It does not see b, e, r, r, y. There's no "look inside the token" operation. No character-level access. The sub-token structure is destroyed at tokenization and never passed forward.

This is why counting letters is hard. When you ask "how many R's in 'strawberry'?", the model has to reconstruct character-level information from statistical patterns in training data, even though strawberry is just two opaque integers to it. It might have seen this exact question answered correctly often enough to get it right. It might not. There's no reliable mechanism, just pattern matching over a destroyed signal.

Same reason rhyming is unreliable. Rhymes are about sound patterns at the character level. The model has no direct access to those. It's working from memory of what it's seen rhymed before, not from actual phoneme analysis.

Quick gut-check: before reading on, predict it yourself. Spelling a word backwards, easy or hard for an LLM? (Hard. Same reason as counting letters: it requires character-level access the token ID doesn't carry.)

So how does it handle typos?

This is the interesting part, and the apparent contradiction.

If the model has no character-level access, how does "teh quick brown fox" get correctly interpreted as "the quick brown fox"?

The answer is attention.

The model attends to every token in relation to every other token in the sequence. "teh", even though it has a different integer ID than "the", appears surrounded by the same context tokens: "quick", "brown", "fox". The model has seen "the" in exactly that context billions of times during training. The attention mechanism builds a contextual representation of "teh" that ends up in the same neighborhood as "the" in the model's internal space.

It's not correcting the typo. It's representing the token based on what surrounds it, and that representation ends up semantically equivalent.

This works because "teh" is a common typo. It appears frequently enough in training data, in the same contexts as "the", that the model has learned to bridge them.

Where this breaks down

Apply the same logic and you can predict where it'll fail. Three places it gets shaky:

A misspelled rare technical word, like "thransformer" instead of "transformer", is much harder to recover. The correct word appears less frequently in training data to begin with, and the broken form has even less signal. Attention can try, but the weights aren't there to make a confident bridge.
A misspelling in an ambiguous context, where multiple words could fit, is also risky. The context constraint that makes "teh" obvious doesn't apply when the surrounding tokens don't narrow things down.
Anything that needs precise character-level reasoning, like counting letters, detecting palindromes, generating anagram lists, or strict rhyming, is fundamentally unreliable. Not because the model is dumb, but because the information was structurally discarded before the model ever ran.

The mental model worth keeping

BPE and attention are complementary by design:

BPE trades character structure for compression efficiency. Fewer tokens means shorter sequences, which means cheaper and faster inference.
Attention partially compensates by using context to recover meaning that the token IDs alone don't carry.

The system works well for what language is mostly made of: common words, common patterns, common errors. It degrades gracefully, not catastrophically, on the edges.

But it's worth knowing where those edges are. Next time a model struggles with something that seems trivially easy (count the letters, spell this word backwards, does this rhyme with that), it's not a reasoning failure. It's a tokenization constraint that no amount of compute fixes.

The information was thrown away before the model ever woke up.

Go deeper

Tokenization is step one of the transformer pipeline. If you want the whole stack (embeddings, positional encoding, attention internals (Q/K/V), the feed-forward network, the residual stream, and the next-token loop), 0xkato wrote an excellent from-the-ground-up walkthrough: How LLMs Actually Work. It picks up right where this post leaves off.

Find me on Twitter