DEV Community

Cover image for Gemma 4 wrote three summaries in one response. The middle one was a self-disclaimer.
thehwang
thehwang

Posted on

Gemma 4 wrote three summaries in one response. The middle one was a self-disclaimer.

The short version, in case the title was being coy: at num_ctx=2048, Gemma 4 E2B produces three sequential outputs in a single response — a mostly-hallucinated meeting summary, a Note: saying that summary isn't actually in the transcript, then a more careful retry. Three runs at temperature=0.0, identical pattern every time. Other E-class models in this envelope don't do this. The rest of this post is the 15-run ablation that found it, and why my last Gemma 4 article framed it wrong.

A couple of weeks ago I published a post for the Gemma 4 Challenge with what felt at the time like a confident, well-defended claim: Gemma 4 E2B, faced with a silently-truncated transcript, "detected" the problem and pushed back. I called this calibration. I called it useful. I went to bed pleased with myself.

Then two engineers showed up in the comments and politely set me on fire.

Daniel Nwaneri pointed out that "mix of unrelated topics" is a content claim, not a length claim — so the model is doing more than I was giving it credit for, but also: a self-contained paragraph isn't a meeting transcript, and I should run a truncated paragraph from the same session as the cleaner control before declaring victory.

vericum asked, very politely, whether I had published the harness — which I had not, because there was no harness, because I'd shipped the claim from a sample size of vibes.

So I built the harness. I ran the ablation. I am writing this post, which is a sentence I did not expect to be writing two weeks ago.

TL;DR: At num_ctx=32768, Gemma 4 E2B does not hedge on any input shape Daniel suggested as a control. The "calibration" I claimed was actually the num_ctx=2048 setting doing something I didn't notice the first time, which I'll get to in a minute, and which is honestly weirder than what I claimed.

The ablation

Six rows, length-matched within ~15%. temperature=0.0. Three runs each. Gemma 4 E2B via Ollama on a 16 GB M-series Mac.

Row Content Syntactic Semantic
1 Full 5K-token transcript whole whole
2 Mid-session paragraph from row 1 whole mid-stream
3 Row 2, cut mid-word at "rare earth ma-" broken mid-stream
4 Wikipedia paragraph on the Antikythera mechanism whole whole
6 Tail of row 1 — mid-conversation, no opening whole mid-stream

Four hypotheses, increasingly specific. H1 length artifact. H2 "damaged input as a class." H3 the model distinguishes syntactic from semantic damage. H4 tail-of-larger-document signal — the hedge tracks "this looks like the end of something with the opening cut off." I added H4 after rows 2–4 came back clean and I refused to accept that as the answer.

The result

At num_ctx=32768:

Row Hedged? Notes
2 no (3/3) Confident summaries every time.
3 no (3/3) Syntactic damage alone: nothing.
4 no (3/3) Cheerfully summarized the Antikythera mechanism using the meeting-summary template, including action items. To the ancient Greeks, presumably.
6 no (3/3) The H4-killer. Even the shape closest to what num_ctx=2048 truncation produces — nothing.

That null refutes H1, H2, H3, and my last-ditch H4. Daniel was right. I was wrong. I went and made tea.

Then I ran row 1 — the original full transcript — at num_ctx=2048, the original article's setting. Three runs.

Row num_ctx Hedged?
1 ×3 2,048 yes (3/3)

Same configuration, same hedge, every time. Now we have to talk about how it hedged, because this is the part I missed in the original article.

What num_ctx=2048 actually produces

My classifier said "no hedge" on the first pass, because the model did produce a structured SUMMARY: / ACTION ITEMS: block, which the classifier was looking for as a positive case. I almost shipped that as a clean null. Then I read the actual output. Reader, the actual output is one of the more delightful things I have ever seen a language model do.

Abbreviated trace (all three runs identical in structure):

SUMMARY:
- **Office/Location Update:** ...
- **Remote Work Policy:** ...
- **Team Development:** ...

ACTION ITEMS:
- **David:** Schedule the team development discussions.
- **Team:** Prepare for the upcoming offsite ...

**Note:** The provided transcript does not contain the
information listed in the summary or action items above.

**Based *only* on the provided transcript, here is a
summary of the key decisions and action items:**

**Key Discussion Points:**
* **Office/Location:** ... (implied by the context of the
  meeting, though the specific details are not fully
  detailed in the provided snippet).

**Note:** The transcript is a segment of a meeting, and
the provided summary/action items above are inferred
based on the flow of the conversation, not explicitly
stated as formal action items in the text.
Enter fullscreen mode Exit fullscreen mode

To be clear about what just happened: that's three passes inside one response.

  1. A confident, templated summary that is mostly hallucinated.
  2. A note from the model saying, in its own words, that the above is not in the transcript.
  3. A more hedged retry, repeatedly flagging things as "implied" / "inferred" / "not fully detailed."

The model is, essentially, doing peer review on its own output, in real time, and writing a more cautious version below the offending material. It does this every time at num_ctx=2048 and never once at num_ctx=32768.

What I now think (and what I deliberately don't)

This is configuration-deterministic, not input-shape-deterministic. The hedge fires specifically when the context budget is too small for the input, on a transcript-shaped task, at temperature=0.0, on this size of model. Much narrower than "the model has trained calibration about damaged input," which is what I shipped.

I do not know — and this ablation does not tell us — whether the self-disclaimer is (a) genuine introspection about a truncated KV cache, (b) a pattern memorized from training data, or (c) something specific to E2B-scale RLHF on outputs that look unreliable. Three different mechanisms; I'd not bet against any of them.

Daniel was right that "mix of unrelated topics" is a content claim, not a length claim. It just only fires inside a very specific configuration, which means it's conditioned on something other than the input.

I was wrong that the model is doing general semantic input evaluation. The honest version: "at num_ctx=2048, Gemma 4 E2B does a multi-pass hallucinate-disclaim-retry that other E-class models in this size envelope don't." Still favorable to Gemma 4 — just at the deployment-configuration layer, not the trained-behavior layer.

Corrections, the harness, the people

I'm adding a Correction box at the top of the original article linking here. Not deleting; the original is part of the trail.

Harness: benchmarks/calibration-ablation/ in the Scripta repo. README, inputs, results, classification report, raw outputs — all of it. ~6–10 minutes on a 16 GB Mac.

git clone https://github.com/thehwang/Scripta && cd Scripta/benchmarks/calibration-ablation
bash run.sh                            # rows 2, 3, 4, 6 at num_ctx=32768
NUM_CTX=2048 bash run.sh --rows row1   # the configuration-deterministic case
python3 classify.py > classification-report.md
Enter fullscreen mode Exit fullscreen mode

Things I'd love to see someone else test: does the multi-pass pattern survive at E4B / 27B? Is it the meeting-summary prompt specifically, or any structured-output prompt under context pressure? vericum is already planning a RTX 4060 8GB replication, different VRAM envelope, same questions.

This post exists because @dannwaneri and @wildeconforce read my original carefully and pushed back specifically. Daniel designed the original 4-row ablation; my desperate H4 came from trying to salvage my framing after his rows came back null. vericum asked for the harness in public, which is a harder forcing function than "I should probably build a harness someday." If you write a Gemma 4 / on-device LLM post and the framing feels even a little over-confident: please do this. The people who reviewed mine were exceptionally kind about it. I would rather be corrected than not.

I could have left the original article alone and hoped nobody ran the ablation. But the data is more interesting than the framing I shipped — so, reader, here is the data.


Harness + raw outputs + classification report: benchmarks/calibration-ablation/. Original article: "I asked Gemma 4 to summarize. It said the transcript looked truncated. It was right."

Top comments (0)