Gemma 4 wrote three summaries in one response. The middle one was a self-disclaimer.

#gemma #llm #ollama #ablation

The short version, in case the title was being coy: at num_ctx=2048, Gemma 4 E2B produces three sequential outputs in a single response — a mostly-hallucinated meeting summary, a Note: saying that summary isn't actually in the transcript, then a more careful retry. Three runs at temperature=0.0, identical pattern every time. Other E-class models in this envelope don't do this. The rest of this post is the 15-run ablation that found it, and why my last Gemma 4 article framed it wrong.

A couple of weeks ago I published a post for the Gemma 4 Challenge with what felt at the time like a confident, well-defended claim: Gemma 4 E2B, faced with a silently-truncated transcript, "detected" the problem and pushed back. I called this calibration. I called it useful. I went to bed pleased with myself.

Then two engineers showed up in the comments and politely set me on fire.

Daniel Nwaneri pointed out that "mix of unrelated topics" is a content claim, not a length claim — so the model is doing more than I was giving it credit for, but also: a self-contained paragraph isn't a meeting transcript, and I should run a truncated paragraph from the same session as the cleaner control before declaring victory.

vericum asked, very politely, whether I had published the harness — which I had not, because there was no harness, because I'd shipped the claim from a sample size of vibes.

So I built the harness. I ran the ablation. I am writing this post, which is a sentence I did not expect to be writing two weeks ago.

TL;DR: At num_ctx=32768, Gemma 4 E2B does not hedge on any input shape Daniel suggested as a control. The "calibration" I claimed was actually the num_ctx=2048 setting doing something I didn't notice the first time, which I'll get to in a minute, and which is honestly weirder than what I claimed.

The ablation

Six rows, length-matched within ~15%. temperature=0.0. Three runs each. Gemma 4 E2B via Ollama on a 16 GB M-series Mac.

Row	Content	Syntactic	Semantic
1	Full 5K-token transcript	whole	whole
2	Mid-session paragraph from row 1	whole	mid-stream
3	Row 2, cut mid-word at "rare earth ma-"	broken	mid-stream
4	Wikipedia paragraph on the Antikythera mechanism	whole	whole
6	Tail of row 1 — mid-conversation, no opening	whole	mid-stream

Four hypotheses, increasingly specific. H1 length artifact. H2 "damaged input as a class." H3 the model distinguishes syntactic from semantic damage. H4 tail-of-larger-document signal — the hedge tracks "this looks like the end of something with the opening cut off." I added H4 after rows 2–4 came back clean and I refused to accept that as the answer.

The result

At num_ctx=32768:

Row	Hedged?	Notes
2	no (3/3)	Confident summaries every time.
3	no (3/3)	Syntactic damage alone: nothing.
4	no (3/3)	Cheerfully summarized the Antikythera mechanism using the meeting-summary template, including action items. To the ancient Greeks, presumably.
6	no (3/3)	The H4-killer. Even the shape closest to what `num_ctx=2048` truncation produces — nothing.

That null refutes H1, H2, H3, and my last-ditch H4. Daniel was right. I was wrong. I went and made tea.

Then I ran row 1 — the original full transcript — at num_ctx=2048, the original article's setting. Three runs.

Row	num_ctx	Hedged?
1 ×3	2,048	yes (3/3)

Same configuration, same hedge, every time. Now we have to talk about how it hedged, because this is the part I missed in the original article.

What `num_ctx=2048` actually produces

My classifier said "no hedge" on the first pass, because the model did produce a structured SUMMARY: / ACTION ITEMS: block, which the classifier was looking for as a positive case. I almost shipped that as a clean null. Then I read the actual output. Reader, the actual output is one of the more delightful things I have ever seen a language model do.

Abbreviated trace (all three runs identical in structure):

SUMMARY:
- **Office/Location Update:** ...
- **Remote Work Policy:** ...
- **Team Development:** ...

ACTION ITEMS:
- **David:** Schedule the team development discussions.
- **Team:** Prepare for the upcoming offsite ...

**Note:** The provided transcript does not contain the
information listed in the summary or action items above.

**Based *only* on the provided transcript, here is a
summary of the key decisions and action items:**

**Key Discussion Points:**
* **Office/Location:** ... (implied by the context of the
  meeting, though the specific details are not fully
  detailed in the provided snippet).

**Note:** The transcript is a segment of a meeting, and
the provided summary/action items above are inferred
based on the flow of the conversation, not explicitly
stated as formal action items in the text.

To be clear about what just happened: that's three passes inside one response.

A confident, templated summary that is mostly hallucinated.
A note from the model saying, in its own words, that the above is not in the transcript.
A more hedged retry, repeatedly flagging things as "implied" / "inferred" / "not fully detailed."

The model is, essentially, doing peer review on its own output, in real time, and writing a more cautious version below the offending material. It does this every time at num_ctx=2048 and never once at num_ctx=32768.

What I now think (and what I deliberately don't)

This is configuration-deterministic, not input-shape-deterministic. The hedge fires specifically when the context budget is too small for the input, on a transcript-shaped task, at temperature=0.0, on this size of model. Much narrower than "the model has trained calibration about damaged input," which is what I shipped.

I do not know — and this ablation does not tell us — whether the self-disclaimer is (a) genuine introspection about a truncated KV cache, (b) a pattern memorized from training data, or (c) something specific to E2B-scale RLHF on outputs that look unreliable. Three different mechanisms; I'd not bet against any of them.

Daniel was right that "mix of unrelated topics" is a content claim, not a length claim. It just only fires inside a very specific configuration, which means it's conditioned on something other than the input.

I was wrong that the model is doing general semantic input evaluation. The honest version: "at num_ctx=2048, Gemma 4 E2B does a multi-pass hallucinate-disclaim-retry that other E-class models in this size envelope don't." Still favorable to Gemma 4 — just at the deployment-configuration layer, not the trained-behavior layer.

Corrections, the harness, the people

I'm adding a Correction box at the top of the original article linking here. Not deleting; the original is part of the trail.

Harness: benchmarks/calibration-ablation/ in the Scripta repo. README, inputs, results, classification report, raw outputs — all of it. ~6–10 minutes on a 16 GB Mac.

git clone https://github.com/thehwang/Scripta && cd Scripta/benchmarks/calibration-ablation
bash run.sh                            # rows 2, 3, 4, 6 at num_ctx=32768
NUM_CTX=2048 bash run.sh --rows row1   # the configuration-deterministic case
python3 classify.py > classification-report.md

Things I'd love to see someone else test: does the multi-pass pattern survive at E4B / 27B? Is it the meeting-summary prompt specifically, or any structured-output prompt under context pressure? vericum is already planning a RTX 4060 8GB replication, different VRAM envelope, same questions.

This post exists because @dannwaneri and @wildeconforce read my original carefully and pushed back specifically. Daniel designed the original 4-row ablation; my desperate H4 came from trying to salvage my framing after his rows came back null. vericum asked for the harness in public, which is a harder forcing function than "I should probably build a harness someday." If you write a Gemma 4 / on-device LLM post and the framing feels even a little over-confident: please do this. The people who reviewed mine were exceptionally kind about it. I would rather be corrected than not.

I could have left the original article alone and hoped nobody ran the ablation. But the data is more interesting than the framing I shipped — so, reader, here is the data.

Harness + raw outputs + classification report: benchmarks/calibration-ablation/. Original article: "I asked Gemma 4 to summarize. It said the transcript looked truncated. It was right."