How Many R's in Strawberry? Your AI Has No Idea Why That's Hard

#machinelearning #ai #beginners

How Many R's in Strawberry? Your AI Has No Idea Why That's Hard

How many R's are in the word "strawberry"?

If you ask ChatGPT, there's a decent chance you'll get back a confident "two." Push back, and the model apologizes, says three, and sometimes flips back to two if you keep prodding.

It's funny in a very specific way. Here's a system that can write working code in a dozen languages, summarize a legal brief, and walk a six year old through general relativity. But it can't reliably count the R's in a fruit my niece can spell.

The reason this happens isn't what most people assume. The model isn't "bad at counting." It isn't really looking at the word the way you and I are. And once you understand why, a whole bunch of other weird LLM behavior starts making sense.

The thing nobody tells you about how LLMs read

When you read the word "strawberry," your brain processes it as a sequence of letters. S, t, r, a, w, b, e, r, r, y. Ten characters, three R's, easy enough that a child can do it.

The model does not see it that way at all. The model doesn't see letters. It sees tokens.

A token is a chunk of text that the model treats as a single unit. Sometimes a token is a whole word. Sometimes it's part of a word. Sometimes it's a couple of characters or just one. Every model has a fixed vocabulary of these tokens, usually somewhere between 30,000 and 200,000 of them, and that vocabulary is the entire universe of building blocks it knows about.

When you type "strawberry" to GPT-4, the tokenizer actually breaks it into this:

"str" + "aw" + "berry"

Three tokens. Not ten letters. Three chunks, with token IDs 496, 675, and 15717.

That's the whole ballgame right there. The model never gets to look at individual R's because individual R's aren't what it sees. It sees "str," "aw," and "berry," asks itself "how many R's are in those three chunks?" and has to guess based on patterns it picked up during training. Sometimes it guesses right. Often it doesn't.

You can verify this yourself in about thirty seconds:

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
ids = enc.encode("strawberry")
print(ids)                              # [496, 675, 15717]
print([enc.decode([i]) for i in ids])   # ['str', 'aw', 'berry']

Or just paste the word into OpenAI's tokenizer page and watch it split.

How does the tokenizer decide what's a token?

The most common method is something called Byte Pair Encoding, or BPE. The algorithm is honestly pretty elegant once you see it.

You start with a massive pile of text, like billions of words of internet content. Every character starts as its own token. Then you look for the pair of tokens that appears next to each other most often, and merge them into a single new token. Then you repeat. Thousands of times.

In the early rounds, BPE notices that "t" and "h" show up next to each other constantly because "the" is in pretty much every sentence ever written. So it merges them into "th." A few rounds later it merges "th" and "e" into "the." Eventually common words become single tokens. Less common words stay as fragments. Really rare words might be three or four tokens.

"Berry" shows up everywhere in training data. Strawberry, blueberry, raspberry, blackberry, cranberry. So "berry" earns its place as its own token early on. The rest of "strawberry" is where things get weird. You might expect "straw" to also be a single token, but it isn't. The training process landed on "str" and "aw" instead, probably because "str" shows up in tons of other common words like street, strange, story, string, and strong, and got promoted to its own token before "straw" did. So "strawberry" ends up cut into three pieces in a way that wouldn't be your first instinct.

It's a really clever compression scheme. It lets the model handle any word, even ones it has never seen before, by breaking them into known pieces. The cost is that the model loses direct access to the underlying characters.

Why this breaks way more than letter counting

Once you understand tokenization, a bunch of other LLM quirks start clicking into place.

Math. Numbers get tokenized in surprising ways. "1234" might be a single token. "12345" might be three tokens. The model is doing arithmetic on chunks that have no clean mathematical meaning, which is part of why it can crush word problems but fumble basic multi digit multiplication.

Languages costing more. OpenAI charges by the token. English text averages about one token per word. Languages like Japanese, Thai, or Burmese can take three or four times as many tokens for the same amount of content, because those scripts were underrepresented when the tokenizer vocabulary was being built. Your API bill in Tokyo looks very different from your API bill in Texas.

Typo behavior that doesn't make sense. Sometimes you misspell a word and the model rolls right through it. Other times one extra letter throws it completely off the rails. That's because some misspellings still tokenize into recognizable chunks, while others shatter into a pile of unfamiliar fragments that the model has weaker associations for.

The glitch token problem. A few years back, researchers found that certain rare tokens, like usernames that appeared in scraped Reddit data but almost never showed up in normal text, would cause GPT-3 to behave bizarrely. The token existed in the vocabulary, but the model had basically no training signal for it. So prompts containing the token "SolidGoldMagikarp" would produce gibberish, refusals, or weirdly off topic tangents. People still find new glitch tokens occasionally.

Can you actually fix the strawberry thing?

Kind of, but not really.

The easiest workaround is to ask the model to spell the word out first. Something like "spell strawberry letter by letter, then count the R's." This works because it forces the model to produce one token per letter on its way to the answer, and once the letters are out there in the output, the model can count them. It's a little like asking someone to do long division on paper instead of in their head. The trick works, but you had to know to ask for it.

Some newer models are trained with character level awareness baked in, or use hybrid approaches that mix token level and character level processing. But pure character level models are dramatically slower and more expensive to run, which is why nobody is shipping them at scale.

There's also been progress from a different direction. OpenAI's o1 family of reasoning models, which they internally nicknamed "Strawberry," can usually get the answer right. Not because the tokenization changed, but because the model is trained to think through problems step by step before answering, which is essentially the spelling-it-out workaround happening automatically inside the model. The nickname is the team's inside joke about the whole saga.

The honest answer is that this is a fundamental quirk of how modern LLMs are built. It's not a bug they're going to patch next week. It's closer to a feature of the architecture. The model trades fine grained character awareness for the ability to process language efficiently at scale, and most of the time that trade is a great deal. The strawberry thing is just where the trade leaks.

The bigger lesson here

I think the most useful takeaway isn't "LLMs are dumb." It's that the model is operating on a representation of language that's fundamentally different from the one you're using when you read.

When you read a sentence, you see characters that make words that make meaning. The model sees tokens that have statistical relationships with other tokens. The fact that the output comes back as fluent English is a kind of magic trick. Underneath, the system is shuffling chunks of text it has never really decomposed into characters at all.

That gap is where most of the strange behavior lives. The hallucinations, the confident wrong answers about letter counts, the inability to reverse a string cleanly, the math mistakes that look like a child made them. None of it is the model being stupid. It's the model being asked to do something its architecture was never set up to handle well.

So next time someone laughs at ChatGPT for missing the R count, you can be the person at the party who explains why. Just maybe wait until they ask.

If you found this interesting, let me know in the comments. I'm planning to write more about the weird and surprising things going on under the hood of LLMs, and I'd love to hear what would be useful to dig into next.