DEV Community

Cover image for The Never‑Ending AI Code Review: Why One Pass Isn’t Enough

The Never‑Ending AI Code Review: Why One Pass Isn’t Enough

Victoria on May 15, 2026

The Hook I ran an AI code review. It found 12 issues. I fixed them. Ran it again — it found 8 more. Fixed those. Ran it again — 5 more. ...
Collapse
 
itskondrat profile image
Mykola Kondratiuk

honestly the problem isn't runs per se - it's that AI reviewers have no stake in shipping. no done signal means it'll keep finding things indefinitely. define acceptable criteria before running, not after.

Collapse
 
brightgir profile image
Victoria

fair point, I didn't — just kept running until it "felt done" which is exactly the trap you're describing. stealing the criteria-first idea for next time :)

Collapse
 
itskondrat profile image
Mykola Kondratiuk

the "felt done" loop is the whole trap. criteria-first shifts who owns the exit condition — you instead of the agent. let me know how it lands after a few runs.

Thread Thread
 
brightgir profile image
Victoria

will do! — "you instead of the agent" is a clean way to put it

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

glad it landed — the first time the agent tries to argue its own completion criteria is always a moment. curious what you find.

Collapse
 
nexadiag_nexa_312a4b5f603 profile image
NEXADiag Nexa

One pass from one model isn't enough — agreed. But two passes from the same model isn't enough either.

The real shift happens when you run passes through models from different training families. A single model reviewing its own output is pattern-matching against the same biases that produced the code. When two models from different lineages disagree on the same file, that's where the real bugs hide.

I've been experimenting with running 4 models in parallel on the same codebase and only surfacing issues where at least 2 agree. The agreement is noise. The disagreement is signal.

Curious what your multi-pass setup looks like — same model rerun, or cross-model comparison?

Collapse
 
brightgir profile image
Victoria

Great framing — the disagreement-as-signal idea is genuinely underrated. I've been mostly doing same-model reruns with different prompting angles (reviewer persona, security hat, etc.), which helps but yeah, still the same underlying weights.

The cross-family approach makes sense architecturally — different training lineages = different blind spots. The tricky part is calibrating what "disagreement" means across models that have different verbosity and confidence thresholds.

What's your aggregation layer — custom script, or something off-the-shelf?

Collapse
 
nexadiag_nexa_312a4b5f603 profile image
NEXADiag Nexa

The aggregation is custom — I built it as a local desktop app (Python/Tkinter → CustomTkinter). Four providers scan in parallel via httpx + ThreadPoolExecutor, raw findings land in a Consensus Engine that does fuzzy deduplication (SequenceMatcher on descriptions), cross-provider voting, and severity-weighted scoring.

Disagreement detection works like this: if a finding appears from 2+ providers with high confidence, it's "validated." If 1 provider flags something none of the others saw, it's flagged as a "conflict" and shown separately — not hidden. The report surfaces both agreements AND disagreements so the human can decide.

The part I didn't expect: when I run it on its own source code, the consensus pattern finds real bugs that single-model reviews missed, including a silent crash that had been in production for six weeks. Agree with your take that human judgment is the final filter — the tool's job is to make the human's list shorter and better prioritized, not to replace the decision.

Three questions for you:

  1. In your 6-run experiment, did you ever try running the same module through two different models and comparing their disagreement patterns?
  2. For the integration pass agent, how do you define "boundaries between modules" concretely — is it interface files, API contracts, or something else?
  3. Have you tested whether the anchoring effect weakens or strengthens when switching from same-model reruns to cross-model passes?
Thread Thread
 
brightgir profile image
Victoria

Thanks for the breakdown! The fuzzy deduplication with SequenceMatcher is a clever move — handling different verbosity levels is always a pain.

1 & 3) Actually, for that specific experiment, I stuck with the same model for all runs. I wanted to see if I could break the "anchoring effect" just by switching personas. But honestly, you’re right — the anchoring is still there. Even with different hats, the model tends to gravitate toward its initial "hunch." That’s why your cross-model approach is definitely the next logical step.

2) For the integration pass, I don't just give it the code. I use a prompt that explicitly defines the "Map of Boundaries" (showing the flow from main.go to api.Router and then to store, gemini, etc.).
The core instruction is: "You are an Integration Reviewer. Do NOT look at the internals — focus only on the points where modules meet." I give it a checklist to hunt for things like:

  • Contract mismatches: E.g., the Store returns an error, but does the API layer actually handle it or just panic?
  • Hidden dependencies: Like checking if all vars in .env.example actually exist in the code.
  • API Asymmetry: Searching for cases where we have Create/Update but forgot Delete. Assumptions: Does Module A assume Module B will never return a nil or an empty string when that’s not guaranteed?

It’s basically a "Top-Down" view. It actually found a silent issue where the API assumed the Store would handle authentication, while the Store assumed the API had already done it.

The human filter is still the final step, but this "boundary-first" approach makes the signal-to-noise ratio much better.

Thread Thread
 
nexadiag_nexa_312a4b5f603 profile image
NEXADiag Nexa

That completely validates why cross-family evaluation is necessary — breaking that "semantic gravity" requires entirely different architecture baselines, not just a change of hat.

Your checklist approach for the integration pass is excellent. Restricting the context to structural boundaries and handshakes rather than deep internal logic is definitely the most efficient way to keep the signal high. For my setup, I actually feed the engine the interface definitions and routing contracts instead of full files to enforce this exact constraint.

Have you tried running these modular prompts asynchronously in parallel, or are you executing them sequentially to build the context step-by-step?

Thread Thread
 
brightgir profile image
Victoria

Actually parallel — I run the modular prompts concurrently, not sequentially.

The downside is you have to do the synthesis yourself at the end. But honestly that's where the interesting judgment calls are anyway.

Does your consensus engine handle the case where parallel findings contradict each other on severity — like one provider says critical, another says low for the same issue?

Collapse
 
innovationsiyu profile image
Siyu

When merging results from 5 to 10 passes, how do you handle the noise floor? Do you use majority voting or a separate deduplication and filtering pass to keep false positives from overwhelming developers?

Collapse
 
brightgir profile image
Victoria

Honestly, in my experiment the deduplication was simple: each agent reviewed a separate module, so there wasn't much overlap to merge. The noise problem is bigger when you run the same pass multiple times.

My take: severity threshold beats majority voting here. If something appears once at CRITICAL — it's worth investigating regardless. If it appears once at LOW — probably skip it. Majority voting works better for flagging borderline cases, but it can also bury real issues that only one agent caught because it had a fresh angle. The real filter is still a human reading the output — structured review just makes that list shorter and better prioritized.

Collapse
 
mininglamp profile image
Mininglamp

The anchoring effect in LLM code review is real — once the model latches onto one category of issues (say null safety), it under-reports everything else. Cross-model validation is one fix, but another practical approach is to constrain each review pass to a specific concern: security in pass 1, performance in pass 2, readability in pass 3. Narrows the attention window and produces more consistent results across runs.

Collapse
 
brightgir profile image
Victoria

not quite the same — you're doing multiple full passes with different focus, I'm splitting by module so each agent gets a small context. your approach still hits the attention dilution problem on large codebases no?

Collapse
 
shogun444 profile image
shogun 444

Single-pass AI review feels useful for “surface hygiene,” but deep bugs usually appear only after forcing the model into narrower scopes with fresh context.

The weird part is that the more code you feed at once, the more authoritative the output sounds even while coverage gets worse.

Collapse
 
brightgir profile image
Victoria

Yeah, noticed the same — wider context, more confident output, worse actual coverage. Splitting by module with fresh context per pass was exactly what fixed it for me.

Collapse
 
yune120 profile image
Yunetzi

If AI reviews code, who owns the judgment—the human or the machine?

Collapse
 
brightgir profile image
Victoria

The human. AI surfaces patterns, humans judge context. A false positive in a security-critical path looks identical to a real bug in the output — only the human who knows the system can tell the difference.

The structured approach just makes the human's job more focused, not obsolete.

Collapse
 
ggle_in profile image
HARD IN SOFT OUT

the machine. because we don't deal with garbage code, that's why you need to build the judgement protocol at first. if the code doesn't match the judgement protocol criteria it doesn't need for review, ask for better code. loop it. real coder knows how to do the real code.