Confront, Don't Assert

#ai #codereview #testing #tooling

Why your AI code auditor should be able to tell you how it's wrong

A docstring told me the write was safe. So did the architecture doc. Both had been reviewed; both had been trusted. They described a liveness update — the kind that flips a record to "in progress" — and they were specific about the guard: the write, they said, was conditional on the record's status, so a finished record could never be re-opened by a late, racing writer. Compare-and-swap, in the filter. Textbook.

The code matched the record by its id. Only its id. The status predicate the docs promised wasn't there.

Nothing was on fire. In practice there was effectively one writer, so the race the missing guard would have allowed had never been provoked. But the guard the documentation described — the one a future maintainer would read, trust, and then decline to add because "it's already handled" — did not exist. The doc wasn't stale. It had never been true. And it was confident enough that no reviewer had thought to check it against the single line of code it claimed to describe.

I didn't find that by reading carefully. I found it because I'd built a tool whose entire job is to assume a confident claim is guilty until reconciled against reality — and to tell me, in the same breath, how it could be wrong about that.

That second clause is the whole point of this essay.

The thing we built backwards

Here is the shape of AI code tooling in 2026. The field already found the axis that matters — it just enshrined the wrong end of it as the default.

On one side, the precision-first reviewers, tuned to stay quiet: catch less, almost never cry wolf. On the other, the recall-first camp, willing to eat noise to catch more. The split is real and measurable — run them against the same bugs and the spread is enormous. On Greptile's own 50-PR benchmark — real production bugs, reconstructed from the commits that fixed them — Greptile catches around 82% to CodeRabbit's 44%: the recall-tuned tool against the precision-tuned one that trades catch rate for a quieter signal. Two philosophies, both shipping, both defensible.

The quiet one has a real case, and I'll concede it its strongest version. If you're interrupting a developer mid-flow at pull-request time, quiet is correct. Noise has a cost and a human pays it. OpenAI saw it building CriticGPT: trainers preferred its critiques to the baseline's 63% of the time, partly because it produced fewer unhelpful nitpicks. Fewer false positives is not a dumb goal. In the right room it's the only goal.

So I'm not here to call the low-noise instinct stupid. Two things are still wrong, and both are worse than a mistuned threshold.

The first is the variable. Confidence is the wrong thing to threshold on, whichever end you set it to. A model's confidence is not a measurement of correctness — it's a property of the text it trained on, overwhelmingly written by people who sound sure. A 2026 study called TRACE, out of UT Dallas, measured this on the exact task in my opening story — does the code honor the documentation? — across seven models, and found their confidence poorly calibrated for six of the seven. The number you'd filter on doesn't track whether the finding is true. So filtering by confidence doesn't filter for truth. It filters for fluency — the same failure mode that produced the doc that lied.

And look at what that filter lets through. The measured weakness of LLM judges isn't volume; it's recall. In summarization-consistency tests from the GPT-3.5 era, judges caught only 30 to 60 percent of the inconsistent summaries — they miss real defects far more readily than they invent fake ones. The dominant defect of an AI auditor is the false negative, not the false positive. "Surface only high-confidence findings" trims the loud false positives and leaves the silent false negatives exactly where they were. It optimizes the one number that was already fine.

It gets sharper. The same study put numbers on where these auditors go blind, and they land on my story with uncomfortable precision. When the documentation is plainly wrong, the models catch it well — 67 to 94 percent of the time; a false claim is a loud, local contradiction. The failure case is the inverse: documentation that stays plausible while the code quietly fails to honor it. Detection of that case falls by more than forty points in the worst case — and because the confidence is miscalibrated, the model can't even tell you when it's standing in the blind spot. That is the doc that lied. It didn't recommend a bad guard. It described a good guard as done — plausible on its face, false only against the code — and a claim that's right on the merits and wrong only about reality is the hardest thing there is for an auditor, or a human reviewer, to see. Which is why it sat untouched while everyone who read it nodded.

The second is the room. Every one of these tools, loud or quiet, fires at pull-request time — into a developer's flow, where noise genuinely costs something. Point the same posture at code that's been sitting still for a year and the economics invert. No flow to interrupt, no reviewer mid-thought, nothing to protect from noise. The reason to be quiet evaporates — and what's left is a tool still trained to hush about the one place, standing and trusted and unexamined code, where silence costs the most.

The doc that lied is what that quiet looks like from the inside. Not an exotic bug. The ordinary one: a claim that sounds true, that everyone trusts, that nothing was built to falsify.

Confront, don't assert

So here is the principle the tool is built on, and the one I'd defend against the whole low-noise consensus:

A claim is worth nothing until something could falsify it. Not the code's claims — the auditor's. If my tool tells you a write is unguarded, it has to tell you, in the same finding, the condition under which it's wrong: "This is incorrect if an upstream caller already restricts this path to in-flight records, making the filter predicate redundant." That sentence is not a hedge. It is the most useful line in the report, because it tells you exactly where to look to refute the finding — and if you can't refute it, you've learned something real. It's the physics habit, ported: you don't trust a result because the computation produced it — you trust it because you tried to break it and couldn't.

Three commitments fall out of that, and they're the opposite of how the field is trending:

Every finding states how it could be wrong. A falsifiability clause per finding. This sounds like a small UI nicety. It is actually a different epistemics: it forces the auditor to hold a model of the world specific enough to be attacked, instead of emitting a vibe with a confidence score stapled to it.

Silence is not coverage. If an auditor finds nothing, that is not the same as "there is nothing." It might mean it didn't look. So the report has a first-class section for what was checked and found clean — the guarded sibling of the unguarded write, named, with the reason it's fine ("the predicate is in the filter; this one is correct"). An empty findings list is a claim too, and it should carry its evidence. The audit standards humans have used for decades already work this way: scope, what was examined, what was found. AI review quietly dropped that and kept only the alarms.

A finding you can't refute is decoration. Before a finding ships, an adversary runs against it whose only job is to kill it: does the racing path actually co-execute? Is the missing guard present one layer up? Is the cited path even reachable? Only the findings that survive the attack are reported. This is the same move that makes a courtroom adversarial rather than a press release — and it's astonishing how much "AI found a problem" content would not survive it.

The mechanism here isn't unheard of, and I won't pretend it is. Greptile's newer versions run an agent that chases a flagged issue down its call chain to decide whether it's real before it ever reaches you — the same instinct, kill the finding before the human sees it. The difference I'm drawing isn't that an adversary exists. It's that the adversary's verdict belongs in the report. The surviving finding ships with the refutation it survived — the condition someone tried to meet and couldn't — so the reader inherits the attack, not just its verdict. A verification you run and discard protects the tool's precision. A verification you hand the reader protects the reader's judgment. Only one of those survives you walking away from the keyboard.

None of these are about making the auditor smarter. They're about making it honest — and honest in a way you can check.

The auditor has to be falsifiable too

There's a trap one level up, and it's the one that makes most of this rhetoric worthless in practice. An auditor that prizes falsifiability can still be vacuous: it can "pass" by finding nothing, or by emitting confident-sounding findings that don't actually depend on the code. How would you know? The auditor sounds great either way.

So the discipline has to apply to the tool itself. Mine ships with planted-bug fixtures: tiny pieces of code with one known defect each. The rule is red-green — fix the planted bug, and the finding must vanish. If it doesn't, the auditor wasn't detecting the defect; it was pattern-matching something incidental. And a second corpus strips the explanatory comments out, because if your "analysis" is really just paraphrasing a comment that hands it the answer, removing the comment should break it — and if it doesn't break, the finding was never analysis.

This isn't novel as an idea; it's how OpenAI validated CriticGPT, by paying people to insert subtle bugs and measuring whether the critic caught them. It's the same instinct as mutation testing: a test suite that can't be made to fail by a deliberately broken program isn't protecting anything. What's strange is how rarely it's applied to the AI auditors we're shipping. We test the code. We don't test the thing that judges the code. A test that can't go red is inert — and an auditor whose findings can't be made to vanish is decoration with a progress bar.

I'll say the uncomfortable version plainly: most of the "AI reviewed your code" you can buy today cannot show you a single planted defect it reliably catches and reliably stops catching when fixed. Ask for that demonstration. It's a fair question, and it's the whole game.

If it only works when you're in the room

There's a test for whether a discipline is real or just a preference: hand it to something that can't ask you anything, and see if it holds. I write prompts for remote agents — no memory of my standards, no shared context, no way to come back with a question — and the rule I put above all others is the same blade this essay is about, aimed at a machine: reading the bug-report text counts as zero evidence; a verified reproduction counts as one. Don't act on zero. The prompt has to teach the agent to verify its own work, because nobody else will — I've watched one process authoritative-looking data straight into a confident, wrong conclusion for want of exactly that check. A conviction you can compress into an instruction a stranger obeys — confront the claim before you act on it — is a conviction, not a taste. If it only works when you're in the room, it was never a discipline.

Why retrospective, and why per-pillar

One more design choice, because it's where the doc that lied actually lived.

The crowded, well-funded part of this market reviews diffs — it gates the change at pull-request time. That's valuable and I'm not competing with it. But the doc that lied wasn't in a diff. It had been sitting in the repository, unexamined, the whole time. There was no change to gate. It would have sailed through every pull-request reviewer on earth, because none of them were looking at code that wasn't moving.

The rest of the market splits two ways, and the doc that lied slips both. The static analyzers — Sonar and its lineage — read the whole codebase, but with rules: deterministic checks for complexity, duplication, known-bad shapes. "The docstring claims a guard the code doesn't implement" is not a rule you can write in advance. Neither is "this lock ordering can deadlock," or "this abstraction has one caller and is pre-paying for a future that never arrived," or "this retry has no backoff and will become a stampede under load." Those are judgments, not patterns.

The newer codebase-aware reviewers — Greptile, Qodo — can make judgments like that. Full-repository context, prose reasoning, not a rulebook. So that ground isn't empty, and I won't pretend it is. But their whole machinery is pointed at a change. The standing code only gets read insofar as a diff reaches into it; leave the file alone and the file goes unexamined.

So the empty seam isn't "whole-codebase, prose-reasoned" — that's occupied. It's the conjunction the silent failures actually require: retrospective, triggered by nothing, run against code that isn't moving — one quality dimension at a time — every finding carrying the condition that would refute it — and the auditor itself held to the red-green discipline above. Not any single one of those. The stack. That specific stack is where almost nobody is, and it's exactly where the expensive, silent failures hide.

What's free, and what isn't

The framework is meant to be free — and that's a stance, not a footnote. The laws, the phase structure, the eval harness, the falsifiability discipline: that's method, and method is meant to travel. What isn't free, and shouldn't be, is the calibration — the invariant catalogs and failure-mode libraries tuned against real systems, the part that turns a general method into something that knows what tends to go wrong in your kind of codebase. That line — method open, calibration earned — is the whole shape of the thing, and I'm clear about it up front, because what poisons "open" tools is moving that line later. It was never the other way around. The method is the gift; the calibration is the craft. Reading how I'd audit a system is not the same as being able to do it on yours — which is, not coincidentally, the entire reason a published methodology has ever been worth publishing.

The invitation

Point an auditor at your own code and watch what it does with uncertainty.

If it finds nothing, make it tell you what it checked. If it finds something, make it tell you how it could be wrong — and then try to make that finding vanish by fixing the thing it pointed at. If you can't make it go quiet by fixing the bug, it was never really watching the bug.

That's the whole idea, and it's smaller than it sounds. Confront, don't assert. Silence is not coverage. A finding you can't refute is decoration. The field has plenty of tools that sound sure. I'd rather ship one that can tell you, precisely and falsifiably, where it might be wrong — because the doc that lied was confident too, and confidence is exactly what fooled everyone who read it.

Sources — every figure above is checkable, which is the only honest way to end an essay called this.

Catch rates (≈82% / ≈44%) and the precision-vs-recall split: Greptile's own 50-PR benchmark. Read it as a vendor benchmark — it crowns itself — and note the bugs are real production bugs traced from their fix commits, not synthetic.
CriticGPT, 63% preference (partly for fewer nitpicks): OpenAI · paper.
LLM judges catching 30–60% of inconsistent summaries: scoped to GPT-3.5-era summarization-consistency evaluations — survey of the literature. The direction generalizes; that exact figure does not.
Documentation auditing vs. code-drift detection — the six-of-seven miscalibration and the detection collapse when code drifts under a plausible doc: TRACE, arXiv:2604.03447. The paper reports that collapse twice and disagrees with itself — 7–42 points in the abstract, 21–43 in the results — so the line above states only what holds under both. Which is, fittingly, the whole point.