DEV Community

Cover image for Why 'green build' without the raw output has zero evidentiary value
Michel Faure
Michel Faure

Posted on

Why 'green build' without the raw output has zero evidentiary value

Comic strip — Michel pushes a

If you have 30 seconds. When an AI agent declares that a build is green, that tests pass, that drift has been detected, or that a contact "doesn't exist" in the database, the sentence isn't a proof, it's an assertion. In most cases it's right. When it's wrong, it's wrong the same way, model-internal confidence decoupled from verifiable external state. Survival rule after 118,808 lines produced in 32 effective days with Claude Code, every factual claim comes with its material proof in the same message, or has zero evidentiary value.


Four false positives in two hours

April 10th, 2026, afternoon, overhaul of a sensitive ERP module. I chain five blocks of changes with a Claude Code agent, and at every step the agent returns the same line, "Compiled successfully." I push. The runtime crashes. I reread the output, I see QRCodeSVG referenced when the import has been removed, isSeancePassed passed to a component that no longer accepts it, types not regenerated after a Supabase schema change, orphan JSX refs after a revert. Four consecutive announcements of green with four very real TypeScript errors behind them.

Niran is at the desk next to mine, dark hoodie, the folded wrapper of his burger balanced on a corner of his laptop. He's reading a PDF in silence. On the fourth Compiled successfully that turns out false, I turn toward him without saying anything; he looks up for a second, gives a small nod, goes back to his screen. The verbosity of the agent and the economy of gestures of the judoka are having the same conversation, and he's the one who's right.

I don't blame the agent for lying. It summarized what it thought it had seen. The summary is consistent with its internal state and out of phase with the matter. Not one of those four crashes would have happened if the agent had pasted the raw output of pnpm build rather than its summarized reading.

The epistemic slip

The problem isn't truthfulness, it's evidentiary value. An assertion without backing material can't be verified, it's believed or it isn't. The mechanism is known: an agent trained by reinforcement learning from human feedback summarises by default because it learned humans prefer brief answers over verbose logs; what is a service in chat becomes a trap in production. "The build is green" without the raw compiler output can be neither denied nor confirmed by a third party, it's a state of mind of the system, not a fact of the world. As long as we don't distinguish these two regimes, we operate blind in the gray zone where silent regressions accumulate that no monitoring detects. The function of summary isn't to be false, it's not to be verifiable.

Five forms of the same evasion

The evasion is always the same. A state verb in the present tense ("is", "passes", "confirms", "missing") without anchoring to an external verifiable artifact. Five variants I encounter in the Rembrandt repo.

  1. "The build is green." Without the raw output of pnpm build or tsc --noEmit.
  2. "Tests pass, CI is green." Without the runner report or the run URL.
  3. "Drift detected between DB enum and TS enum", or its inverse, "the contact does not exist in the database". Without the executed SQL query or its raw rows.
  4. "EXPLAIN ANALYZE confirms the index is used." Without the raw plan, on the exact query the application sends. Not the target table in isolation, an intermediate view with a CASE COALESCE can kill the index without the isolation revealing it.
  5. "The webhook returns 400 / 422." Without the raw payload from the partner side, when nine times out of ten that's where the diagnosis starts.

All share the same grammar. The countermeasure is uniform. Don't ask the agent to reason better, ask it to attach the command and its raw output, or the SQL query and its rows, or the payload, in the same message as the assertion. Without that, the assertion is worth nothing. Not little, nothing.

The rule, codified in the root CLAUDE.md

# Material verification (excerpt from root CLAUDE.md)
- Any claim "build green / tests pass / CI green / drift detected /
  contact not found / OK" must be accompanied *in the same message*
  by the verification command and its raw output.
- Any number relayed to a human must be verified by SQL query before
  being relayed.
- EXPLAIN ANALYZE on a production query: execute on the exact query
  the application sends (view/RPC included), not on the target table
  in isolation. Two consecutive runs before judging.
- On 400/422 from an external partner: demand the raw payload before
  proposing a fix.
- `tsc --noEmit` CLI = authority, IDE panel = potentially stale.
Enter fullscreen mode Exit fullscreen mode

Human-side, thirty seconds of friction per exchange, zero broken pushes on the workstreams where the rule holds. The written rule isn't enough on its own, though. A discipline that relies on memory erodes. You have to harden.

Hardening with a script

scripts/verify-head-builds.sh stashes the working tree, runs tsc --noEmit on HEAD, and restores. That's what lets me demand a green build on the commit I'm about to push, not on the working tree mixed with non-staged changes.

# scripts/verify-head-builds.sh — load-bearing extract
HAS_LOCAL_CHANGES=0
if ! git diff --quiet HEAD 2>/dev/null \
   || [[ -n "$(git ls-files --others --exclude-standard)" ]]; then
  HAS_LOCAL_CHANGES=1
fi

if [[ $HAS_LOCAL_CHANGES -eq 1 ]]; then
  git stash push -u -q --message "verify-head-builds-autostash-$$"
fi

# typecheck on pure HEAD, no working copy contamination
if npx tsc --noEmit 2>&1; then
  echo "✓ HEAD compiles cleanly — safe to push"
else
  echo "✗ HEAD does NOT compile — fix before push"
  exit 1
fi
Enter fullscreen mode Exit fullscreen mode

And a writing convention on the agent side: every number relayed to a human comes with the SQL query that produced it, as a copy-paste-ready block. No orphan numbers, in prose or in commit messages. Noisy the first week, transparent after.

What you can copy into your project

Full snippets (CLAUDE.md evidentiality rule excerpt, full verify-head-builds.sh script) in the material-verification/ folder of the series companion repo, MIT.

Three directly applicable practices for working with a coding agent:

  1. An evidentiality rule in the root CLAUDE.md (five bullets above). Without this written anchor, the rest is cosmetic.

  2. The verify-head-builds.sh script: stash + tsc --noEmit on HEAD + restore. Thirteen lines of bash, MIT.

  3. A traceable-numbers convention. Every number paired with its SQL query in the same message. No orphan numbers in prose, none in commit messages.

And you, which green claim did you stop believing first? I read the comments.

What you stop believing

After a few weeks of practice, you hear "Compiled successfully" differently. Not as an observation, as a claim whose value depends strictly on what follows. If nothing follows, the claim is null. The rule applies outside AI too. A human saying "I just checked, the count is 1247" without pasting the query lives in the same gray zone. Niran has never needed someone to explain that difference to him. In judo, the fall isn't an opinion about the fall, it's the floor that answers.


Companion code, rembrandt-samples/material-verification/, CLAUDE.md.snippet + verify-head-builds.sh, MIT.

The evidentiality rule above is now R1 of the Counterpart Toolkit: github.com/michelfaure/doctrine-counterpart, 14 operational rules, install in 1 command, CC-BY-4.0.

Top comments (0)