Jun0

Posted on May 7 • Originally published at jun0-ds.dev

GPT-4 said strawberry has two R's. The word has three.

#claudecode #ai #agents #productivity

"How many R's are in 'strawberry'?"

By 2024 every developer had seen the screenshot. GPT-4 confidently insisting strawberry has two R's. The word has three. The fix eventually landed — but for a moment it captured something cleaner than any benchmark: a thing a human does in half a second, that the model gets confidently wrong.

That's the picture most people have when they hear "hallucination." sonmat v0.8.0 (April 11, 2026) dealt with hallucinations. Just not that kind.

What the 7% actually was

The trigger was a 2,700-question wiki QA evaluation on a 24B model. Hallucination rate: 7%. Looking at the number you'd shrug — "yeah, LLMs hallucinate, that's life." But once I went through the actual flagged responses one by one, the picture was different.

Strawberry-style cases — the model fabricating something that wasn't in its training distribution — were a minority. What showed up more often was this:

User: "Facility management is in table A."
Reality: it's in table B.
Model dutifully searched table A.
Found nothing, got confused, ended up extrapolating something plausible.
This response landed in the 7% bucket.

Is this hallucination? From the user's seat, yes. The answer was wrong, that's all that matters. But put a human in the same situation and the result is the same. An intern handed a wrong manual, sent off to find the facilities lead, comes back with a confused report. The model isn't broken. The input was.

Two sources got tangled together

Here's where I had to draw a line. The user experiences hallucination as one event, but its source splits in two.

Source	Where it starts	Treatment
Model-side	Plausible combinations get assembled inside the weights (the strawberry case)	Model researcher territory. Has to be fixed at the weights level
Context-side	The input was wrong; the model dutifully followed	Doubt the input. System designer territory

The literature isn't unanimous either. Under faithfulness (does the output stay loyal to the input?), the context-side case is "loyal, so not a hallucination." Under factuality (does the output match reality?), it's "wrong, so yes, a hallucination." Ji et al.'s NLG hallucination survey (2023) splits intrinsic vs. extrinsic — and the wrong-manual case fits neither cleanly. Input-faithful and reality-unfaithful at the same time.

The reason researchers can't agree is simple: from where the user sits, both look like the same event. "The AI was wrong." The split only matters if you're building tools — because different sources need different treatments. Model-side, we can't touch. Context-side, we can.

Strawberry isn't a one-off

The same model-side pattern shows up wherever an LLM lands inside a rule-bound environment. Ask one to play chess and watch it confidently slide a rook diagonally, or move through another piece. The rule violation is obvious to any player. The model has no world model — just a learned distribution of plausible-looking continuations.

Every code agent inherits this risk. It'll eventually do the equivalent of sliding a rook diagonally, with full confidence, and you won't catch it unless you're looking. Strawberry was a single screenshot. The pattern is structural.

Model-side hallucination isn't sonmat's territory. v0.8 only dealt with the side we can touch.

I was stuck in this frame for a while

The split looks self-evident written down. It wasn't, for me. I spent a long stretch nodding along with "hallucination = model problem" — figuring sonmat could add all the doubt tools it wanted, none of them would touch the statistical combinations made inside the weights. I'd parked hallucination as out of scope for sonmat.

The 7% breakdown was what cracked that. The frame wasn't wrong, the scope was just way too tight. I was building a tool that says doubt the context you're given, while standing on a piece of context I'd never doubted. Embarrassing place to be — but that's where v0.8 actually started.

Both of the changes that followed were that one realization, pushed into discipline (reasoning rules) and into a skill (an action tool) at the same time.

Six places in core

I touched the discipline file first. discipline/core.md is sonmat's short prescription for how Claude should think. Up through v0.8, the doubt was almost entirely turned inward — "are my assumptions actually solid? am I jumping to a conclusion?" That kind of question.

v0.8 widened the doubt by one notch. Not just your reasoning — the context you received is suspect too. Same line, planted in six places.

The received context can be broken in three flavors:

incomplete — left unsaid
imprecise — said loosely
incorrect — said wrong

All three coexist. Fixate on one and the others slip past. One-beat pause, for example, picked up this in v0.8:

 ### One-beat pause
 Before agreeing with anything — is there something worth doubting here?
 If the question even crosses your mind, that's the signal. Check before you nod.
+This includes the context itself — it may be incomplete (left unsaid),
+imprecise (said loosely), or incorrect (said wrong).
+All three coexist; don't fixate on one.

Same pattern landed in Strip to essentials, Predict before acting, Ground it, Pace it, and Weight it. Weight it got an extra line on top — split the source of your confidence: verified fact / user statement / inference / guess. Not "I'm 80% sure" but "I'm 80% sure based on a user statement, which is not the same as a verified fact."

A bunch of one-line additions that look tiny. The actual move was widening sonmat's territory of doubt from "inside the model's own reasoning" to "the inputs the model was handed." A tool that only doubts its own reasoning gets dragged the moment a user says "facility management is in table A" and is wrong.

Same realization, other face — `/punch`

If the core changes were one face of v0.8, the other face was the new /punch skill in the same release.

Background: a quantitative pattern from communication-error research. Aviation CRM (Helmreich), surgical teams (Lingard 2004), software engineering (Boehm/Firesmith). Different domains, suspiciously similar splits:

Error type	Share
Omission	40–55%
Imprecision	20–25%
Incorrect	10–15%
Context/timing	10–20%

Caveat up front. This is the human-to-human distribution. There's no direct evidence LLM hallucinations follow the same ratios. Borrowed assumption, not measured result. But the qualitative pattern — omissions vastly outnumber outright wrongs — does seem to track on the LLM side. Models hallucinate by filling in what you didn't say far more often than by contradicting what you did say.

So the highest-ROI move is to find what's missing. Existing sonmat skills weren't doing that:

/guard — "is this safe?"
/inspect — "what could break?"
/devil — "is this reasoning sound?"

All three inspect what's there. "What's not there but should be?" wasn't being asked by anyone. That's the slot /punch fills.

guard asks "is this safe?"
inspect asks "what could break?"
devil asks "is this reasoning sound?"
punch asks "is anything missing?"

The name is from a construction punch list. You walk a finished building with the contractor and note every outlet that was on the plan but not in the wall, every door that won't close, every fixture missing entirely. That walk.

Why punch stands on two legs

Method is short: reconstruct + domain checklist. Two legs.

1. Reconstruct

Code alone doesn't reveal intent. There's always something the user had in their head that never made it into the file, and that's where omission leaks the hardest. So /punch doesn't analyze unilaterally. It opens a dialogue:

[punch] Inferred intent from the implementation:
  User stories: [...]
  Contracts: [...]
  Constraints: [...]
  Uncertain: [things I couldn't infer — input needed]
  Anything missing, off, or wrong here?

Output at this point isn't a verdict. It's a checkpoint. The valuable round happens when the user replies "oh, forgot that," "that's not what I meant." Aviation challenge-and-response, surgical Time Out, military brief-back — verification traditions across very different fields converge on the same shape. The maker and someone else, immediately after the work, run a quick alignment.

2. Domain checklist

Reconstruction alone isn't enough. The bits the user themselves forgot don't surface in reconstruction. (The "missing bathroom" case.) So the second leg is a domain checklist:

Domain	Core items
Web app	Auth/session, input validation, error pages, loading states, responsive, a11y, CORS, rate limiting
API	Versioning, error format, auth, pagination, timeout, idempotency, docs
Data pipeline	Schema validation, null/empty, dedup, retry, monitoring, backfill
CLI	Help, exit codes, stdin/stdout, error messages, config, --dry-run
ML/AI	Baseline, eval, data leakage, latency, fallback on failure

The checklist won't catch everything. Project-specific requirements aren't on it. But the territory the checklist covers and the territory reconstruction covers are orthogonal. One leg asks "what was specifically intended for this project," the other asks "what does any project in this domain usually need." Run only one and the other half walks out the door.

The limit, plainly stated

Where this frame is solid and where it leans on hope, separated honestly:

The model-side hallucinations stay. Strawberry, chess rooks, the lot. v0.8 doesn't dent them. Model-side comes out of the weights, weights belong to model researchers. sonmat doesn't touch it.
The 7% number is one person's one test. A 24B model, 2,700 wiki QA's. No guarantee the same distribution holds on a different model, a different domain, a different evaluation prompt.
The error-rate table is from human-to-human research. Aviation CRM, surgical teams, software engineering retrospectives. No direct evidence LLM hallucinations split into the same ratios. They look qualitatively similar — that's the most I can honestly say.
Sources don't always split cleanly. A user mumbles half a requirement, the model fills in the rest from its learned distribution, and now context-side and model-side are tangled inside one response. This frame catches half of those at best.

With all that conceded — what did v0.8 actually do? One sentence.

The one-line lesson

It pulled apart two events that had been bundled under the single word "hallucination," and started treating each one according to where it actually started. One source (model-side) we can't fix. The other (context-side) we can. The fix split in two — six lines in discipline/core.md extending doubt outward to the input context, and a new tool, /punch, that goes looking for what's missing.

The same realization landed in discipline (rules of reasoning) and skill (an action tool) at once. Not coincidence — two faces of one finding. v0.8 didn't solve hallucination. It picked the events that had been miscategorized as hallucination apart from the rest, and started treating them on their own terms.

Move the direction of doubt one notch outward — from your own reasoning to the context you were handed. That was sonmat's step.

Release notes: v0.8.0
Repo: https://github.com/jun0-ds/sonmat

GitHub · LinkedIn

DEV Community

GPT-4 said strawberry has two R's. The word has three.

"How many R's are in 'strawberry'?"

What the 7% actually was