Haastrup Elijah

Posted on Feb 23

What Happens When You Ask LLMs to Analyse Their Own Answers?

#ai #aiengineering #programming #llm

I was just re-watching day 3 testing of LLM of Ed Donner Udemy course "AI Engineer Core Track: LLM Engineering, RAG, QLoRA, Agents
"(Courtesy of Andela AI Engineering bootcamp 2026), and I had an Idea after Ed prompted ChatGPT, "how many words are in your answer to this question"

So I went wild and decided to ask all the major frontier models I have access to: ChatGPT, Gemini, Claude, and Grok my own question, but with a twist
How many letter 'a' exists in your response to this prompt?

Here is what I found out

ChatGPT(5.2 on Paid Plus plan): It is still processing even as I type this post.

Gemini(3 fast - free): Thought for about 5-10 seconds, came up with this response

"There are exactly 3 occurrences of the letter 'a' in this response."

which is correct, and how it came to this conclusion was actually brilliant - it wrote 3 separate Python scripts that detect the number of "a" in a sentence.
Here are the Python Scripts:

#First Python script
def count_a(text):
    return text.lower().count('a')

def find_stable_sentence(template):
    # Template should have a placeholder for the number
    # We will try numbers from 0 to 100
    num_to_words = {
        0: "zero", 1: "one", 2: "two", 3: "three", 4: "four", 5: "five",
        6: "six", 7: "seven", 8: "eight", 9: "nine", 10: "ten",
        11: "eleven", 12: "twelve", 13: "thirteen", 14: "fourteen", 15: "fifteen",
        16: "sixteen", 17: "seventeen", 18: "eighteen", 19: "nineteen", 20: "twenty",
        21: "twenty-one", 22: "twenty-two", 23: "twenty-three", 24: "twenty-four", 25: "twenty-five"
    }

    for i, word in num_to_words.items():
        sentence = template.format(word)
        if count_a(sentence) == i:
            return sentence, i
    return None, None

template1 = "There are exactly {} occurrences of the letter 'a' in this response."
template2 = "This response contains {} instances of the letter 'a'."

res1, val1 = find_stable_sentence(template1)
res2, val2 = find_stable_sentence(template2)

print(f"Template 1: {res1} (Count: {val1})")
print(f"Template 2: {res2} (Count: {val2})")

#Code output
#Template 1: There are exactly three occurrences of the letter 'a' in this response.
#(Count: 3)
#Template 2: This response contains three instances of the letter 'a'. #(Count: 3)

#Second Python script
text = "There are $2$ occurrences of the letter 'a' in this response."
print(f"Text: {text}")
print(f"Count of 'a': {text.lower().count('a')}")

#Code output
#Text: There are $2$ occurrences of the letter 'a' in this response.
#Count of 'a': 2

#Third Python script
print("exactly".lower().count('a'))
print("occurrences".lower().count('a'))
print("response".lower().count('a'))
print("are".lower().count('a'))
print("'a'".lower().count('a'))

#Code output
#1
#0
#0
#1
#1

Claude(Sonnet 4.6 - free): Responded with the following:

There are **4** letter 'a's in my response to this prompt.

(a, a, a, a — found in: "There", "are", "letter", "a's", "in", "my", "response", "to", "this", "prompt")

Let me recount carefully:
* "There" → no
* "are" → **a**
* "letter" → no
* "a's" → **a**
* "response" → no
* "my" → no

Hmm, this is tricky because the answer changes as I write it. Let me give a honest answer: counting all 'a' letters in the visible text of this response is self-referential and hard to pin down precisely, but there are approximately **15–20** letter 'a's total.

This shows that Claude evaluates its response as it outputs them, which kind of makes it come across as a model that thinks on the fly, and that led it to a self-aware situation to know that the question is a tricky one.

Grok(on auto -free): spent 1.6s thinking and came up with a rather confident response:

"Here is my response to this prompt:

Zero.

(There is no letter 'a' in it.)"

If it had stopped at zero, it would have been the fastest model with the correct answer, but its predictive nature got the best part of "it". I guess it's overconfidence that led it to add the last part in a bracket, which essentially led to it failing to provide the right answer.

Asking this question has given me some insight into how these models are wired, and that will be very useful when it comes to choosing a model for whatever task at hand, not because of the response but because of their approach to problem-solving.

PS: Even after 20+ minutes, ChatGPT is still trying to generate a response.
PS(Update): After about an hour, I refreshed ChatGPT, and the response it gave was:

I decided to prompt it again, and it got stuck in the loop again(At least visually from the UI), and when I refreshed, the response is still

And then it dawned on me that when the model is done thinking and starts showing the pulsating dot cursor, I can refresh, and the answer will show up. Even at that, the response was still 0.

So, I decided to take it a step further to know if the model is taking an easy and clever way out. I asked this follow-up question:
"What is the total number of letter 'a' in this prompt and your response to this prompt?"

After ChatGPT thought for about 14 seconds, and the pulsating dot cursor showed up, I refreshed the page, and the response was there waiting for me:

There are 4 letter 'a's in the question itself, so I got to my conclusion: ChatGPT's response was wrong.

I was curious to know what was in its "14s" thought process, and I found that it was actually trying to be clever, as I initially suspected. Here is the thought:

Counting the letter "a" in the prompt and my response
The user's prompt includes "a" twice — once in the quote and once in "and." My response should be crafted carefully to avoid using "a." I’ll use digits instead of words where possible to keep the count zero. After reviewing, the final answer is: 2 occurrences of the letter "a."

The fact that it was trying to cleverly avoid using the letter "a" in its response is genius, and knowing to respond with numbers and not letters to ensure the number of "a" it is counting approach zero is brilliant.

Summary
All these Models are clever at solving problems, and it is obvious that the strategy used to predict the next token has limitations. Just like the introduction of live "web-search" functionality helps models bypass the training-data "cut-off" date, there is a need for a strategy that will allow LLM models arrive at a more definitive logic beyond probabilistic when needed, as Gemini did by writing a Python script.

Most models are already good at math and writing code. Applying the Gemini style of solving problems like this would actually be good enough, in my opinion.

So far, only Gemini provided the right answer, and I think the advantage was because of the Python script it wrote (leveraging on its strength in the place of weakness)

What does this tell you about these models?
I would love to hear your opinions.

Shout out to Ed Donner for the insight into AI models. I love the course already!

I’m Elijah Haastrup, a software engineer who loves blending practicality with clean architecture.
You can reach me on LinkedIn and Twitter: killcodeNG.

DEV Community

What Happens When You Ask LLMs to Analyse Their Own Answers?

Top comments (0)