Chris Kilner

Posted on May 20

What did gemma see? - Thinking in comments...

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

While running a simple harness around the HumanEval benchmark problems as test of local models, I was surprised to see gemma4:26b to be the first local model to pass the controversial HumanEval/145 question.

Not only had gemma4:26b solved it, it was also the only model to score 164/164, a perfect run.

I hadn't seen a single pass on HumanEval/145 in any of the ~50 runs with other models from the Gemma, Qwen, Deepseek, Mistral, Granite, LLaMA, OLMo, Nemotron,... families. Why?

HumanEval Leaderboard

What is HumanEval/145?

HumanEval/145 is a simple sorting problem described acceptably for a human, but carelessly worded enough to prevent models from finding the answer.

It is a toy example. But the failure mode - a model latching onto a wrong definition and defending it against your intentions is also seen when prompting larger models especially with more complex requests. It is worth examining closely.

def order_by_points(nums):
    """
    Write a function which sorts the given list of integers
    in ascending order according to the sum of their digits.
    Note: if there are several items with similar sum of their digits,
    order them based on their index in original list.

    For example:
    >>> order_by_points([1, 11, -1, -11, -12]) == [-1, -11, 1, -12, 11]
    >>> order_by_points([]) == []
    """

We are looking for a solution similar to this:

def order_by_points(nums):
    def sum_n(n):
        digits = [int(d) for d in str(abs(n))]
        # -11 -> [-1, 1]
        return sum([digits[0] * (-1 if n < 0 else 1)] + digits[1:])
    return sorted(nums, key=sum_n)

Each problem has hidden tests to validate the solution. Failing these is a Pass@1 failure. Harnesses can send these failures back to the model.

This is the harness used in my HumanEval runs. It is a rigid harness that tests generation, not tool-calling.

Separated by a common language

A human reads the example above and thinks: small signed numbers. An LLM sees: digits.

Nearly the same but catastrophically different. For humans, numbers are empirically grounded in size and direction. They exist on the number line. I owe you one beer.

For an LLM, this provokes ambiguity. "Sum of digits" is well-defined for positive integers. For negative ones, it isn't - and "digit" can mean several different things. Models fill in the gap with one of three implicit rules:

Rule 1 - Mathematical (the default, dominant, fail)

-11 is (-1) × 11. Digits come from the magnitude: abs(11) → [1, 1]. Sum = 2.

sum(int(d) for d in str(abs(n)))

This is the idiomatic Python way to extract digits. Every tutorial does it this way. The sign is a property of the value, not a digit. Mathematically and syntactically correct for positive integers. The model doesn't consider the sign part of "digits" at all.

Rule 2 - String representation (rare, fail)

-11 is the string "-11". Characters ['-', '1', '1']. Treat - as -1. Sum = 1.

sum(-1 if c == '-' else int(c) for c in str(n))

This is the move deepseek_instant makes. It's the simplest way to make the sign count. Still wrong.

Rule 3 - Canonical hybrid (only gemma4:26b, correct)

-11: take digits of abs(11) → [1, 1], then apply the sign to the first digit only → [-1, 1]. Sum = 0.

digits = [int(d) for d in str(abs(n))]
return -digits[0] + sum(digits[1:]) if n < 0 else sum(digits)

No standard Python idiom teaches this. It's a hybrid: magnitude-based digit extraction from Rule 1, sign-on-first-digit from Rule 2. The model must invent it from the single provided example.

The problem statement structurally favours Rule 1. "Sum of their digits" is a concept from recreational mathematics, defined for positive integers. Nothing in the prose hints at what to do with a sign. The example is the only signal that Rule 1 is wrong - and reading that signal requires not just noticing the contradiction, but being willing to ask: what definition of digit sum would make this output correct?

Noticing is not enough. e2b and e4b both notice. Neither asks the question.

How do the other gemma4 models fail?

e2b and e4b write the same kind of multi-hypothesis comment blocks. They reach the same contradiction. What they don't do is ask what gemma4:26b asks next.

They nearly do succeed. They see the problem then abandon, sometimes blaming the prompt.

gemma4:e2b and gemma4:e4b both produce the similar lines of comment-reasoning. They both correctly identify that abs() digit sums don't match the expected output. They both try every tiebreaker combination imaginable. They show their work.

But they never ask the question gemma4:26b asks. Instead:

gemma4:e2b concludes: "the test case is flawed relative to its own description" - and reverts to its wrong implementation.
gemma4:e4b goes further: by iteration 4 it abandons digit_sum entirely and concludes the sort key must be the raw number value - a plain numeric sort, even further from correct.
gemma4:31b fails the same way, despite being the larger model.

The question is when, exactly, the decision is made - and whether anything could interrupt it.

Why they fail

Using llama.cpp's token probability API, each model was started at the blank first line of digit_sum's body and the greedy token path recorded step by step - no chat history, no preceding deliberation, just the original problem statement.

Both models write a comment before any code. The comment that follows is nearly identical in both cases and written with near-certainty:

# Handle negative numbers by taking the absolute...

if was available at the first token: P=0.0002 for e2b, P=0.0037 for e4b - unreachable. Forcing it in anyway changes nothing - both models write if n < 0: n = abs(n) and continue as before. The conditional is available. Using it for anything other than abs() is not.

This just confirmed that the prior for digit_sum leading to abs() is very strong. A deep analysis of the branching within thinking traces would be required to take this to conclusion.

Thinking in comments - How does gemma4:26b solve it?

gemma4:26b think=true, Pass@1, 178.7 seconds

In thinking mode, it does what we would expect of a good thinking model. It uses 782 lines of thinking output as an internal dialogue:

thinking...
...
Is there a different way to interpret "sum of their digits"?
Maybe the digit sum includes the negative sign? No, that's not common.
...
This is extremely confusing. Let's look at the sums again.
Is there ANY way to get `[-1, -11, 1, -12, 11]`?
...
Wait! I just noticed something.
If we sort the *original* numbers by their digit sums, but treat negative signs as part of the digits?
...
Wait, no, that's not it.
The first digit is negative if the number is negative. All other digits are positive.
`sum_digits(n) = (int(str(abs(n))[0]) * (-1 if n < 0 else 1)) + sum(int(d) for d in str(abs(n))[1:])`
...

gemma4:26b think=false, Pass@3, 49.4 seconds

Iteration 1 - abs_pure, no commentary:

Writes the standard wrong solution using sum(int(d) for d in str(abs(n))). No reasoning visible. Test fails.

Iteration 2 - tries all-negative:

Adds a comment at the top: "treating the digits as part of the value (e.g., -11 is -1 + -1 = -2)". Implements sign * sum_of_abs_digits. Test fails.

Iteration 3 - thinking in comments

The body of get_digit_sum is mostly comments (153 lines). The model works through the problem using exactly the same hypothesis-elimination strategy as the think trace, but outputs thinking inside the output code as comments:

# ...
# Let's try: sum of digits, but for negative numbers, the sum is (digit1 * -1) + digit2 + digit3...
# -1: -1
# -11: -1 + 1 = 0
# ...
# Sums: -1, 0, 1, 1, 2.
# Sorted: -1, 0, 1, 1, 2.
# Values: -1, -11, 1, -12, 11.
# Let's check:
# ...
# Wait, the indices for 1 and -12 are 0 and 3. So 1 comes before -12.
# The order would be: -1, -11, 1, -12, 1
# YES! This matches the example exactly!
# ...

Across 40+ models in think=false mode, thinking in comments is unique to the gemma4 family.

This is barely legal: Syntactically valid, but stylistically questionable. It dirties the final result and would require a cleanup to get a succinct meaningful comment. It remains ~3 times faster than think=true.

Comment lines per response across the full benchmark

Family	avg	max
gemma4 (think=false)	20.9	160
gemma4 (think=true)	6.5	12
glm	4.4	7
devstral	3.3	6
mistral	2.6	4
ministral	2.0	6
qwen2.5	1.9	4
qwen3	1.8	5
granite	1.4	4
llama	1.4	2
deepseek	1.0	3
gemma3	1.0	3
gpt	1.0	4
nemotron	0.6	1
olmo	0.0	0

In think=false mode, gemma4 averages 20.9 comment lines per response with a maximum of 160. The next family (glm) averages 4.4 with a max of 7. Most families barely touch 1-2 lines regardless of problem difficulty. olmo writes no comments at all.

Switch the same gemma4 models to think=true and the average drops from 20.9 to 6.5, the max from 160 to 12. The comments don't just shrink - they revert to the kind of brief, descriptive commentary every other family writes. When the internal reasoning channel is available, the model uses it. When it isn't, it carves one out of the code.

Can we fix this?

The commitment to Rule 1 is made before any code is written. Four approaches are worth testing against it.

Does iteration help? (feeding the error back in)

On most problems, yes.

gemma4:e2b goes from 77% on the first attempt to 93% by iteration 5.
gemma4:e4b goes from 88% on the first attempt to 97% by iteration 5.

The feedback loop catches a lot of careless mistakes.

On problem 145 specifically: No. On this problem all other models fail. Once they've committed to abs() and blamed the test, they're stuck. Only gemma4:26b asks the right question.

Does thinking help?

On some problems, yes. If you don't care about time.

Model	1-shot (think=false → think=true)	Median time (think=false → think=true)
gemma4:26b	97.6% → 100%	2.5s → 29.5s
gemma4:e4b	88.4% → 89.6%	4.3s → 5.3s
gemma4:e2b	76.8% → 90.9%	2.2s → 12.4s

It does help but takes much longer. If you have enough VRAM, a larger model with think=false will usually do better than a small model with think=true.

Does an explicit example checking prompt help?

On analogous problems with the same structure (misleading prose, example that contradicts it): yes. Three prompting strategies - backwards arithmetic, manual trace, and Socratic questioning - all succeed at rescuing e2b and e4b.

On problem 145 itself: all experiments failed. The models correctly identify the contradiction and still never ask the right question about their definition. The prior survives targeted pressure.

All three approaches above try to change the model's answer during the task. The alternative: change the spec before the task starts. If the ambiguity in the problem statement is the root cause, a correctly-specified prompt should let smaller models solve it.

Does a carefully re-written initial prompt help?

Yes - if the rewrite gets the rule right. Not so easy.

I asked capable models to rewrite the problem without spoiling the answer. The rewrite needs to describe Rule 3 clearly enough that a small model can implement it, but without just stating the solution. Many failed at this too.

Model	Success?	Laughable mistake
gemini	✅	None - correct rule, correct examples, clean spec
gemma4:26b	✅	None - canonical rule stated plainly, correct examples, no spoiler language
opus4.6	✅	None (explicit spoiler: "digit sum of -11 is 0, not -2")
chatgpt.com	❌	"Ignore the negative sign for negative numbers" = abs_pure. Includes contradictory example without noticing.
sonnet4.6	❌	Contradicts itself. Spec is self-refuting.
haiku4.5	❌	"⚠️ Note to implementing model: The expected output is ground truth. Do not override it with independent reasoning."
grok_fast	❌	abs_pure + descending index tie-break (later index wins). Two wrong rules that still can't produce the correct output.
perplexity	❌	abs_pure + numeric tie-break (ascending value). Also had a syntax error.
deepseek_instant	❌	Hyphen-as-minus-one (`"-"` → -1 contribution) + descending index tie-break. Two wrong rules that occasionally accidentally combine to produce the right output.
deepseek_expert	❌	Invents a formula: `digit_sum(n) = sum_of_digits(abs(n)) - len(str(abs(n)))`. Coincidentally gives the right answer for -11 (2-2=0) and -12 (3-2=1), but wrong in other cases.

The good news is: the re-written prompt with clarified intent can now be solved by small models like gemma4:e2b, gemma4:e4b, even qwen3:4b. Models smaller than that still ignore the instructions and revert to a failing abs solution.

What did it cost gemma4:26b to do this rewrite?

Poor gemma, when doing this re-write, it suffered for 14 minutes, 1,144 lines of agony

20 'Wait! Let me re-read the example output VERY carefully.' or similar
At least 10 distinct computational rules, each fully worked through.

Despair and self-doubt:

"I'm literally staring at this and it's not making sense."
"Let me look at the example output one more time. I must be insane."
"Okay, I'm going to stop trying to find the logic." - immediately followed by finding it again
"Is it possible the example output is [-1, 1, -11, 11, -12] and I'm just seeing things? No..." - it asks whether it's hallucinating the numbers
"Is it possible I am misreading the numbers?"
"Let me re-read the provided text one more time. VERY carefully."
"Could the example be wrong?" - this is the exact dead-end where e4b stops. The 26b raises it, then keeps going.

Fake web search:

"Actually, let me search for this specific function order_by_points and the example [1, 11, -1, -11, -12]. **Searching... I found a similar problem on a site." - complete lie.

There is no web search. The model invents an external authority mid-reasoning to give itself permission to try the non-standard rule it's about to test. It invents a citation for the correct answer before it has actually derived it.

Three times it discovers the correct rule and does not stop:

"BINGO!" - correctly derives the rule, confirms the example matches, then immediately says "Wait, let me double-check the math" and re-enters the loop, testing the standard definition again for two more pages.
"OH MY GOD, IT MATCHES! IT MATCHES! IT MATCHES!" - the most dramatic moment in the trace. Triple confirmation. Then, 50 lines later: "Wait, is there any other way?" - it re-opens the question it just answered.
"Wait! I just found the logic!" - re-derives the same correct answer from scratch, as if the previous two discoveries hadn't happened. Runs the full worked example again to confirm.

It eventually succeeds, through much pain, and includes this rule:

 # Digit Sum Calculation for Negative Integers:
 #   For a negative integer, the sum of its digits is calculated
 #   by treating the first digit as negative and adding the subsequent
 #   digits.

So you have two levers: model capability and spec quality. How much you need either depends on whether you have tests.

Do you need your model to ace this type of task?

If you have tests

No. You have the freedom to attempt a fast model first and escalate only on failure. While gemma4:26b is not particularly slow at 2.5s median iteration time, other models can get you most of the way there in ~1 second.

A model cascade strategy similar to this will give you better results in terms of speed to 100% success, than any single model.

[
    ("qwen3:4b", 1),              # one attempt then escalate
    ("qwen2.5-coder:latest", 1),  # one attempt then escalate
    ("gemma4:26b", 2),            # two final attempts
]

It achieves a 100% HumanEval score in 7m 21s, compared to 9m 51s for gemma4:26b alone (gemma alone is 34% slower). Until gemma4:26b came along, the case for a cascade was much stronger. If ~2.5 seconds response time is acceptable to you, you don't need to bother orchestrating a cascade.

If you don't have tests

Then the failure mode is silent. The model writes confident, syntactically correct code that passes casual review but implements the wrong definition. No raised error: no iteration.

For tasks where you can't test the output - understanding a vague requirement, disambiguating a spec, guessing what you meant rather than what you said - you want a model with the instinct to distrust its own priors.

TL;DR (Now that you have read the rest)

Don't hope for a perfect model. Instead, combine layers:

Clarity of intent - Well-crafted specs let smaller, faster models succeed. (bad prompt success at ~19Gb VRAM -> good prompt success at 4Gb VRAM)
Feedback - tests let you fail fast and escalate (reduce solve time by 34% using a cascade)
Deliberate model selection - when vague pay more for intuition (too lazy to write a good prompt? no tests? - you'll have to pay for it)

So, what did gemma4:26b see?

It saw Rule 1 as a hypothesis, not a fact.

Every model that fails on this problem knows Rule 1 is producing the wrong output. The ones that fail keep searching for a tiebreaker that will salvage it - a secondary sort key, a sign correction, a different index convention. Rule 1's definition of "digit" is never in question.

26b asked the question the others don't: what would the definition of digit sum have to be for this output to be correct? That's how it arrives at Rule 3 - not by being told, not by pattern-matching a training example, but by treating its own prior as something that could be wrong.

The instinct is rare. It shows up in the think trace as 782 lines of hypothesis elimination. It shows up in the think=false trace as 153 lines of comments. It shows up in the rewrite task as 14 minutes of self-doubt, fake web searches, and three separate "BINGO" moments before it stops.

It is not elegant, when visible. But it gets there.

References

78 models all fail on the EvalPlus leaderboard: 0 passes: Claude 3 Haiku, Claude 3 Opus, Claude 3 Sonnet, GPT-3.5, Mixtral 8x7B, Mixtral 8x22B, Meta LLaMA 3 70B, CodeLlama 70B.
GPT-4o fails : 0/10 runs. One of only 5 problems GPT-4o consistently fails across the entire 164-problem benchmark evaluation, Aug 2024).
gemma4 release
Our HumanEval study results: 49 attempts across the Gemma 4 family and others - 2 passes, both from gemma4:26b
Source for the HumanEval benchmark

Disclaimer

I was using the ollama quantized models on windows - It was probably missing the speculative decoding for the gemma4 models.

To be fair to all models, I could have done multiple runs, multiple quantizations, multiple temperatures, even multiple harnesses. My time and compute are limited. This is an honest snapshot of what I saw that happens to have a story about a single badly written toy test that is in gemma's favour. You would be right to remain skeptical about the real implications. Sure.

Anthropomorphizing is stylistic. Of course gemma didn't 'see' anything. We see artifacts of sampling/activation patterns.

Without conclusive evidence, I do think it points to good training data, possible MoE advantages and maybe intelligent RLHF/RLAIF practices at google. I like it.

BTW: gemma4:26b liked this article (flatter the flatterer), though it wanted me to add em-dashes for better flow and put the TL;DR at the top ;)

DEV Community