The Hook
I ran an AI code review. It found 12 issues. I fixed them. Ran it again — it found 8 more. Fixed those. Ran it again — 5 more. ...
For further actions, you may consider blocking this person and/or reporting abuse
honestly the problem isn't runs per se - it's that AI reviewers have no stake in shipping. no done signal means it'll keep finding things indefinitely. define acceptable criteria before running, not after.
fair point, I didn't — just kept running until it "felt done" which is exactly the trap you're describing. stealing the criteria-first idea for next time :)
the "felt done" loop is the whole trap. criteria-first shifts who owns the exit condition — you instead of the agent. let me know how it lands after a few runs.
will do! — "you instead of the agent" is a clean way to put it
glad it landed — the first time the agent tries to argue its own completion criteria is always a moment. curious what you find.
One pass from one model isn't enough — agreed. But two passes from the same model isn't enough either.
The real shift happens when you run passes through models from different training families. A single model reviewing its own output is pattern-matching against the same biases that produced the code. When two models from different lineages disagree on the same file, that's where the real bugs hide.
I've been experimenting with running 4 models in parallel on the same codebase and only surfacing issues where at least 2 agree. The agreement is noise. The disagreement is signal.
Curious what your multi-pass setup looks like — same model rerun, or cross-model comparison?
Great framing — the disagreement-as-signal idea is genuinely underrated. I've been mostly doing same-model reruns with different prompting angles (reviewer persona, security hat, etc.), which helps but yeah, still the same underlying weights.
The cross-family approach makes sense architecturally — different training lineages = different blind spots. The tricky part is calibrating what "disagreement" means across models that have different verbosity and confidence thresholds.
What's your aggregation layer — custom script, or something off-the-shelf?
The aggregation is custom — I built it as a local desktop app (Python/Tkinter → CustomTkinter). Four providers scan in parallel via httpx + ThreadPoolExecutor, raw findings land in a Consensus Engine that does fuzzy deduplication (SequenceMatcher on descriptions), cross-provider voting, and severity-weighted scoring.
Disagreement detection works like this: if a finding appears from 2+ providers with high confidence, it's "validated." If 1 provider flags something none of the others saw, it's flagged as a "conflict" and shown separately — not hidden. The report surfaces both agreements AND disagreements so the human can decide.
The part I didn't expect: when I run it on its own source code, the consensus pattern finds real bugs that single-model reviews missed, including a silent crash that had been in production for six weeks. Agree with your take that human judgment is the final filter — the tool's job is to make the human's list shorter and better prioritized, not to replace the decision.
Three questions for you:
Thanks for the breakdown! The fuzzy deduplication with SequenceMatcher is a clever move — handling different verbosity levels is always a pain.
1 & 3) Actually, for that specific experiment, I stuck with the same model for all runs. I wanted to see if I could break the "anchoring effect" just by switching personas. But honestly, you’re right — the anchoring is still there. Even with different hats, the model tends to gravitate toward its initial "hunch." That’s why your cross-model approach is definitely the next logical step.
2) For the integration pass, I don't just give it the code. I use a prompt that explicitly defines the "Map of Boundaries" (showing the flow from main.go to api.Router and then to store, gemini, etc.).
The core instruction is: "You are an Integration Reviewer. Do NOT look at the internals — focus only on the points where modules meet." I give it a checklist to hunt for things like:
It’s basically a "Top-Down" view. It actually found a silent issue where the API assumed the Store would handle authentication, while the Store assumed the API had already done it.
The human filter is still the final step, but this "boundary-first" approach makes the signal-to-noise ratio much better.
That completely validates why cross-family evaluation is necessary — breaking that "semantic gravity" requires entirely different architecture baselines, not just a change of hat.
Your checklist approach for the integration pass is excellent. Restricting the context to structural boundaries and handshakes rather than deep internal logic is definitely the most efficient way to keep the signal high. For my setup, I actually feed the engine the interface definitions and routing contracts instead of full files to enforce this exact constraint.
Have you tried running these modular prompts asynchronously in parallel, or are you executing them sequentially to build the context step-by-step?
Actually parallel — I run the modular prompts concurrently, not sequentially.
The downside is you have to do the synthesis yourself at the end. But honestly that's where the interesting judgment calls are anyway.
Does your consensus engine handle the case where parallel findings contradict each other on severity — like one provider says critical, another says low for the same issue?
When merging results from 5 to 10 passes, how do you handle the noise floor? Do you use majority voting or a separate deduplication and filtering pass to keep false positives from overwhelming developers?
Honestly, in my experiment the deduplication was simple: each agent reviewed a separate module, so there wasn't much overlap to merge. The noise problem is bigger when you run the same pass multiple times.
My take: severity threshold beats majority voting here. If something appears once at CRITICAL — it's worth investigating regardless. If it appears once at LOW — probably skip it. Majority voting works better for flagging borderline cases, but it can also bury real issues that only one agent caught because it had a fresh angle. The real filter is still a human reading the output — structured review just makes that list shorter and better prioritized.
The anchoring effect in LLM code review is real — once the model latches onto one category of issues (say null safety), it under-reports everything else. Cross-model validation is one fix, but another practical approach is to constrain each review pass to a specific concern: security in pass 1, performance in pass 2, readability in pass 3. Narrows the attention window and produces more consistent results across runs.
not quite the same — you're doing multiple full passes with different focus, I'm splitting by module so each agent gets a small context. your approach still hits the attention dilution problem on large codebases no?
Single-pass AI review feels useful for “surface hygiene,” but deep bugs usually appear only after forcing the model into narrower scopes with fresh context.
The weird part is that the more code you feed at once, the more authoritative the output sounds even while coverage gets worse.
Yeah, noticed the same — wider context, more confident output, worse actual coverage. Splitting by module with fresh context per pass was exactly what fixed it for me.
If AI reviews code, who owns the judgment—the human or the machine?
The human. AI surfaces patterns, humans judge context. A false positive in a security-critical path looks identical to a real bug in the output — only the human who knows the system can tell the difference.
The structured approach just makes the human's job more focused, not obsolete.
the machine. because we don't deal with garbage code, that's why you need to build the judgement protocol at first. if the code doesn't match the judgement protocol criteria it doesn't need for review, ask for better code. loop it. real coder knows how to do the real code.