DEV Community

Cover image for How many 'r's are in strawberry? And do LLMs know how to spell?
Savannah Norem
Savannah Norem

Posted on

How many 'r's are in strawberry? And do LLMs know how to spell?

Well the short answers are three and kind of… but not really.

Any which way you cut it, there are three ‘r’s in strawberry. But different large language models (LLMs) are evidently struggling with this question. So let’s take a look at what they’re “saying”, how they’re justifying it, what some of the flaws are, and some of the broader implications for LLMs.

All these screenshots are from me simply asking different LLMs the question of “how many ‘r’s are in the word strawberry?”, and since LLMs are not deterministic, you may or may not get the same answers. But it’s definitely not just a me problem, and given that (reportedly) OpenAI’s new model coming this fall is being called Strawberry, it seems that people have taken note.

Can large language models spell?

LLMs from gpt to jamba are saying that “strawberry” has two “r”s in it.

Side note: if you haven’t checked out lmarena.ai, you should. You’re able to pit models against each other anonymously and vote on which is better, which can also be super helpful for showing how bad they all are.

lmarena - two models both responding that

pi.ai also responding with two

To be fair, they're only sometimes wrong. But when they are, they sometimes double down, sometimes correct themselves, but never seem to know where the “r”s actually are.

gpt-4o responds with two, then changes and goes with three

pi.ai tells us there's a second

Claude responds with two, simply skips an

With Pi actually adding a “second e” into strawberry...somewhere...? And Claude simply skips over one.

Claude had a particularly interesting response and said that there is no third “r” because it’s part of a “double r”. To be fair, when prompted about why that didn’t count, Claude backed up and said that for both spelling and counting purposes, a double “r” is in fact two “r”s, and therefore strawberry has three.

Claude responds that there is no third

So why is this hard?

It’s so intuitive to humans to count how many “r”s are in strawberry, or how many “e”s are in timekeeper or “o”s in bookshop, but for LLMs there’s clearly a different story. There are a few different potential reasons for this, and they center around how LLMs decide what to say.

If you’re not familiar with how LLMs work, the briefest explanation is that they’ve looked at a lot of data and have an idea of what words go together and how sentences are supposed to look, but hallucinations and spelling mistakes occur because they’re basically playing a probability game of what word will come next. So for a counting task, like determining the number of "r"s in "strawberry," the model isn't directly calculating this count. Instead, it's predicting what the most likely correct answer would be based on similar patterns it has seen during training.

Probability
The most basic explanation is just that the model determined two is the most likely word to come next. This could be particularly affected with words like strawberry and others with double consonants, since if I was googling how to spell strawberry, I’m probably wondering whether “berry” has one “r” or two, which could lead to their training data being skewed towards two.

If you’re not deep in tech, this is probably the only answer you really need to know. Since LLMs are trained off of data, and asking “how many “r”s are in the word strawberry?” is not a particularly common question, there’s probably not a lot of webpages out there that explicitly state “strawberry has three ‘r’s”, so it’s probably not something LLMs know a lot about.

See how much probability can be involved?

Tokenization
Digging a bit deeper is the issue of tokenization. LLMs work by creating “tokens” to represent words since they can’t actually read the same way you and I do. When it comes to "strawberry," the model might tokenize it in a way that splits the word into chunks like "straw" and "berry." Now, here's where the fun begins (or the confusion, depending on how you look at it). The model might focus on the "berry" part and count the 'r's in that token, potentially ignoring "straw" altogether. Or, it could tokenize "strawberry" into even smaller chunks, like "str," "aw," and "berry," which might lead to even more confusion about how many 'r's are actually in there. And here's another twist—different LLMs might tokenize the word differently depending on how they were trained and what algorithms they use. This means that one model might handle "strawberry" fairly well, while another could completely fumble the task.

So how do you fix tt?
There’s a whole field right now around prompt engineering—basically, the art and science of figuring out how to ask LLMs questions in a way that gets you the best possible answers. When it comes to getting LLMs to count the 'r's in "strawberry" correctly, a few tricks can help.

One approach is to be super specific in your prompt. Instead of just asking, “How many 'r's are in the word strawberry?” you might say something like, “Can you spell out the word 'strawberry' and count each 'r' as you go?” This way, you’re guiding the model to break down the word step by step, reducing the chances of it glossing over those pesky 'r's.

Pressing the models you’re interacting with to verify the information, asking them to prove it with code, or telling the model that they’re wrong to see how they respond are all skills to build up if you’re trying to get the most out of the LLMs you use.

But here’s the thing; even with these strategies, LLMs might still trip up. Not all LLMs can run code in every environment, and so even when code is generated that would objectively give a different answer, the LLM doesn’t necessarily know that.

gpt-4o initially responds with two, but corrects to three after providing a code solution

lmarena shows two models, one with three and one with two that both say their code correctly produces their answer, even though their answers are different and the code is essentially the same

That’s because the root of the problem lies in how these models are built and trained. They’re not perfect, and they weren’t designed to be perfect at tasks like counting letters. So the real fix isn’t just about tweaking your prompts—it’s about understanding what LLMs are good at (and what they’re not so good at) and knowing when to step in and double-check their work.

Final Thoughts

While it might seem like just a fun quirk, these errors underscore some significant challenges in relying on LLMs for more critical tasks. If an LLM struggles with something as simple as counting letters in a word, what does that say about its reliability in more complex, nuanced situations? These quirks highlight the importance of human oversight and the need for users to be aware of the limitations of AI, especially when accuracy is crucial.

The “strawberry” question is a fun, if not slightly concerning, example of how even the most advanced AI can trip up on simple tasks. As developers, users, and enthusiasts, it’s essential to approach LLMs with both excitement and caution. Understanding their strengths and weaknesses allows us to leverage these tools effectively while avoiding potential pitfalls.

Try it yourself: Experiment with other words or phrases and see how LLMs handle them. Ask them to count letters in "bookkeeper," "committee," or any other word with a tricky spelling. Share your findings in the comments—I’d love to see what results you get!
Think critically: As you use LLMs in your work or daily life, keep in mind that they’re not infallible. Use these tools wisely, and always be prepared to double-check their output.
Join the conversation: What do you think about the broader implications of these errors? Have you encountered similar quirks in other AI models? Share your thoughts and experiences in the comments below.

Top comments (0)