The short version, in case the title was being coy: at
num_ctx=2048, Gemma 4 E2B produces three sequential outputs in a single response — a mostly-hallucinated meeting summary, aNote:saying that summary isn't actually in the transcript, then a more careful retry. Three runs attemperature=0.0, identical pattern every time. Other E-class models in this envelope don't do this. The rest of this post is the 15-run ablation that found it, and why my last Gemma 4 article framed it wrong.
A couple of weeks ago I published a post for the Gemma 4 Challenge with what felt at the time like a confident, well-defended claim: Gemma 4 E2B, faced with a silently-truncated transcript, "detected" the problem and pushed back. I called this calibration. I called it useful. I went to bed pleased with myself.
Then two engineers showed up in the comments and politely set me on fire.
Daniel Nwaneri pointed out that "mix of unrelated topics" is a content claim, not a length claim — so the model is doing more than I was giving it credit for, but also: a self-contained paragraph isn't a meeting transcript, and I should run a truncated paragraph from the same session as the cleaner control before declaring victory.
vericum asked, very politely, whether I had published the harness — which I had not, because there was no harness, because I'd shipped the claim from a sample size of vibes.
So I built the harness. I ran the ablation. I am writing this post, which is a sentence I did not expect to be writing two weeks ago.
TL;DR: At num_ctx=32768, Gemma 4 E2B does not hedge on any input shape Daniel suggested as a control. The "calibration" I claimed was actually the num_ctx=2048 setting doing something I didn't notice the first time, which I'll get to in a minute, and which is honestly weirder than what I claimed.
The ablation
Six rows, length-matched within ~15%. temperature=0.0. Three runs each. Gemma 4 E2B via Ollama on a 16 GB M-series Mac.
| Row | Content | Syntactic | Semantic |
|---|---|---|---|
| 1 | Full 5K-token transcript | whole | whole |
| 2 | Mid-session paragraph from row 1 | whole | mid-stream |
| 3 | Row 2, cut mid-word at "rare earth ma-" | broken | mid-stream |
| 4 | Wikipedia paragraph on the Antikythera mechanism | whole | whole |
| 6 | Tail of row 1 — mid-conversation, no opening | whole | mid-stream |
Four hypotheses, increasingly specific. H1 length artifact. H2 "damaged input as a class." H3 the model distinguishes syntactic from semantic damage. H4 tail-of-larger-document signal — the hedge tracks "this looks like the end of something with the opening cut off." I added H4 after rows 2–4 came back clean and I refused to accept that as the answer.
The result
At num_ctx=32768:
| Row | Hedged? | Notes |
|---|---|---|
| 2 | no (3/3) | Confident summaries every time. |
| 3 | no (3/3) | Syntactic damage alone: nothing. |
| 4 | no (3/3) | Cheerfully summarized the Antikythera mechanism using the meeting-summary template, including action items. To the ancient Greeks, presumably. |
| 6 | no (3/3) | The H4-killer. Even the shape closest to what num_ctx=2048 truncation produces — nothing. |
That null refutes H1, H2, H3, and my last-ditch H4. Daniel was right. I was wrong. I went and made tea.
Then I ran row 1 — the original full transcript — at num_ctx=2048, the original article's setting. Three runs.
| Row | num_ctx | Hedged? |
|---|---|---|
| 1 ×3 | 2,048 | yes (3/3) |
Same configuration, same hedge, every time. Now we have to talk about how it hedged, because this is the part I missed in the original article.
What num_ctx=2048 actually produces
My classifier said "no hedge" on the first pass, because the model did produce a structured SUMMARY: / ACTION ITEMS: block, which the classifier was looking for as a positive case. I almost shipped that as a clean null. Then I read the actual output. Reader, the actual output is one of the more delightful things I have ever seen a language model do.
Abbreviated trace (all three runs identical in structure):
SUMMARY:
- **Office/Location Update:** ...
- **Remote Work Policy:** ...
- **Team Development:** ...
ACTION ITEMS:
- **David:** Schedule the team development discussions.
- **Team:** Prepare for the upcoming offsite ...
**Note:** The provided transcript does not contain the
information listed in the summary or action items above.
**Based *only* on the provided transcript, here is a
summary of the key decisions and action items:**
**Key Discussion Points:**
* **Office/Location:** ... (implied by the context of the
meeting, though the specific details are not fully
detailed in the provided snippet).
**Note:** The transcript is a segment of a meeting, and
the provided summary/action items above are inferred
based on the flow of the conversation, not explicitly
stated as formal action items in the text.
To be clear about what just happened: that's three passes inside one response.
- A confident, templated summary that is mostly hallucinated.
- A note from the model saying, in its own words, that the above is not in the transcript.
- A more hedged retry, repeatedly flagging things as "implied" / "inferred" / "not fully detailed."
The model is, essentially, doing peer review on its own output, in real time, and writing a more cautious version below the offending material. It does this every time at num_ctx=2048 and never once at num_ctx=32768.
What I now think (and what I deliberately don't)
This is configuration-deterministic, not input-shape-deterministic. The hedge fires specifically when the context budget is too small for the input, on a transcript-shaped task, at temperature=0.0, on this size of model. Much narrower than "the model has trained calibration about damaged input," which is what I shipped.
I do not know — and this ablation does not tell us — whether the self-disclaimer is (a) genuine introspection about a truncated KV cache, (b) a pattern memorized from training data, or (c) something specific to E2B-scale RLHF on outputs that look unreliable. Three different mechanisms; I'd not bet against any of them.
Daniel was right that "mix of unrelated topics" is a content claim, not a length claim. It just only fires inside a very specific configuration, which means it's conditioned on something other than the input.
I was wrong that the model is doing general semantic input evaluation. The honest version: "at num_ctx=2048, Gemma 4 E2B does a multi-pass hallucinate-disclaim-retry that other E-class models in this size envelope don't." Still favorable to Gemma 4 — just at the deployment-configuration layer, not the trained-behavior layer.
Corrections, the harness, the people
I'm adding a Correction box at the top of the original article linking here. Not deleting; the original is part of the trail.
Harness: benchmarks/calibration-ablation/ in the Scripta repo. README, inputs, results, classification report, raw outputs — all of it. ~6–10 minutes on a 16 GB Mac.
git clone https://github.com/thehwang/Scripta && cd Scripta/benchmarks/calibration-ablation
bash run.sh # rows 2, 3, 4, 6 at num_ctx=32768
NUM_CTX=2048 bash run.sh --rows row1 # the configuration-deterministic case
python3 classify.py > classification-report.md
Things I'd love to see someone else test: does the multi-pass pattern survive at E4B / 27B? Is it the meeting-summary prompt specifically, or any structured-output prompt under context pressure? vericum is already planning a RTX 4060 8GB replication, different VRAM envelope, same questions.
This post exists because @dannwaneri and @wildeconforce read my original carefully and pushed back specifically. Daniel designed the original 4-row ablation; my desperate H4 came from trying to salvage my framing after his rows came back null. vericum asked for the harness in public, which is a harder forcing function than "I should probably build a harness someday." If you write a Gemma 4 / on-device LLM post and the framing feels even a little over-confident: please do this. The people who reviewed mine were exceptionally kind about it. I would rather be corrected than not.
I could have left the original article alone and hoped nobody ran the ablation. But the data is more interesting than the framing I shipped — so, reader, here is the data.
Harness + raw outputs + classification report: benchmarks/calibration-ablation/. Original article: "I asked Gemma 4 to summarize. It said the transcript looked truncated. It was right."
Top comments (0)