The Hook
I ran an AI code review. It found 12 issues. I fixed them. Ran it again — it found 8 more. Fixed those. Ran it again — 5 more. After the sixth run, I started to suspect that something was wrong.
If you are a tech lead or a developer using AI to review large projects, you probably know this feeling. Every new run finds something the previous one missed. Your tokens are draining, but the confidence that "now it's finally clean" never comes.
This isn't a bug in your prompt. It’s a structural feature of how LLMs work. And there is something we can do about it.
Why LLMs Always Find Something New
An LLM doesn’t read code deterministically like a compiler or a linter, which checks every single line in sequence. Instead, the model generates its response probabilistically: each token is chosen with a certain degree of randomness (this is called temperature). Even with the exact same code and prompt, the result will vary.
On top of this, there is the anchoring effect. If the model "hooks" onto null-safety issues during its first pass, it continues to look for similar problems—and might completely overlook, for example, race conditions. The next run might anchor onto something else entirely.
This doesn't mean AI is a bad reviewer. It means that a single run does not provide full coverage—much like a single QA engineer won't find every bug in a massive product in one day.
What the Science Says
It turns out this isn't just a hunch—it’s a well-documented phenomenon. Here is what recent research (2025) has found:
The Consistency Gap (Semgrep, 2025)
Researchers at Semgrep directly addressed the issue of non-determinism. In their experiments, running the exact same security prompt on the same codebase multiple times yielded wildly different results. In one case, three identical runs produced 3, 6, and then 11 distinct findings. They attributed this to "context compaction"—as the model tries to process large amounts of code, it uses lossy compression, inevitably losing track of specific details. (link)The 118% Recall Boost (SWR-Bench, 2025)
The SWR-Bench study quantified exactly how much we miss in a single pass. They found that by using a "Self-Aggregation" strategy—running the review 10 times and merging the results—the Recall (the percentage of real bugs found) increased by 118%. This proves that a single-pass review finds less than half of the actual issues lurking in the code. (link)Specialized Lenses vs. General Prompts (Ericsson, 2025)
Software engineers at Ericsson found that a "naive" approach (one big prompt) fails in practice. Instead, they moved toward a strategy of specialized prompts (Security, Logic, Design, and I/O) and restricted the model's focus to the "enclosing method" of the changes. This targeted approach significantly reduced hallucinations and improved precision. (link)
The conclusion is clear: one giant request for the entire project is the worst possible strategy.
My Experiment: 6 Runs, 6 Different Realities
I tested this on a real-world project, running a review on the exact same repository six times in a row. Every single time, it found something new, purely because it "anchored" on different themes:
- Run 1 (Architecture): It flagged that the WebSocket Hub was storing clients in-process (killing scalability) and missed a
readPumpforCloseMessageevents. - Run 2 (Performance): It pivoted to N+1 Redis queries and missing TTLs, noting that Redis would grow infinitely.
- Run 3 (Security): It finally noticed auth tokens leaking into logs via query params and a forgotten
dump.rdbfile committed to git.
This is the anchoring effect in action. The model "latches onto" a theme and hunts for similar patterns while ignoring the rest. Six runs weren't just repetitions; they were six different, incomplete lenses on the same code.
The Showdown: 6 Traditional Runs vs. 1 Structured Pass
To see if "just running it more" was a viable strategy, I compared the two approaches.
First, I aggregated the results of six traditional reviews (feeding the entire project at once). Despite the volume, the combined reports still missed two critical vulnerabilities.
Then I switched to the Structured Approach: targeted, module-based passes. On the very first pass, the agent caught both critical bugs. These were not typos—they were deep, cross-file logic failures:
- The Invisible Bypass: A
docker-composefile injected a default API key via${API_KEY:-default-value}, silently overriding all app-level security checks. - The Open Door: Redis was exposed externally without a password—a "day-zero" disaster waiting to happen.
Does this mean traditional reviews are useless?
Not at all. After fixing the critical bugs, a final traditional run found several "tactical" issues, like a service using the wrong Gemini key. The structured agent had seen these too, but because they were flagged as LOW priority, they were buried in the specialized reports.
Solution: Divide and Conquer
From research and experiments, a clear strategy emerges. Instead of one big request, use two levels:
- Tier 1 — Parallel Module Review: Divide the project into independent modules (handlers, store, infra) and run a separate agent on each in a fresh session. This forces the model's attention to stay focused on one chunk of code, preventing "context rot."
- Tier 2 — The Integration Pass: A dedicated agent looks only at the boundaries between modules: interfaces, contracts, and shared assumptions.
Bonus: Run each module through specific categories separately (Security, Performance, Logic). As Ericsson showed, this adds extra layers of accuracy.
Verdict: Precision over Volume
Six shallow passes are not equivalent to one deep, structured pass. The traditional "big picture" review is like a generalist doing reconnaissance—good for surface-level tactical problems. But for high-risk, critical vulnerabilities, you need a Structured Approach that forces the AI to stop skimming and start investigating.
Practical Recommendations
If you want to apply this to your project, here are the concrete steps:
- Break the code into modules by functional boundaries. Not by files, but by areas of responsibility: handlers, store, AI clients, WebSocket, infrastructure.
- Run a separate agent in a fresh session for each module. A new session is a "reset" for the anchor. The agent doesn't know what previous runs found, so it looks with fresh eyes.
- Give each agent a specific checklist by category. Security, performance, reliability, code quality — in order. But ask the agent to do one more sweep after the checklist to find what wasn't covered.
- Add an integration pass. A separate agent that looks strictly at the boundaries between modules. The deadliest bugs live there.
- Don't feed the entire project into one prompt. This doesn't just kill accuracy; it burns tokens on a massive context that the model cannot effectively "digest."
One run finds less than half of your real problems. But the point isn't just to run it more — the point is to make sure each run looks at less code, with a sharp focus, in a fresh session.
Top comments (24)
honestly the problem isn't runs per se - it's that AI reviewers have no stake in shipping. no done signal means it'll keep finding things indefinitely. define acceptable criteria before running, not after.
fair point, I didn't — just kept running until it "felt done" which is exactly the trap you're describing. stealing the criteria-first idea for next time :)
the "felt done" loop is the whole trap. criteria-first shifts who owns the exit condition — you instead of the agent. let me know how it lands after a few runs.
will do! — "you instead of the agent" is a clean way to put it
glad it landed — the first time the agent tries to argue its own completion criteria is always a moment. curious what you find.
One pass from one model isn't enough — agreed. But two passes from the same model isn't enough either.
The real shift happens when you run passes through models from different training families. A single model reviewing its own output is pattern-matching against the same biases that produced the code. When two models from different lineages disagree on the same file, that's where the real bugs hide.
I've been experimenting with running 4 models in parallel on the same codebase and only surfacing issues where at least 2 agree. The agreement is noise. The disagreement is signal.
Curious what your multi-pass setup looks like — same model rerun, or cross-model comparison?
Great framing — the disagreement-as-signal idea is genuinely underrated. I've been mostly doing same-model reruns with different prompting angles (reviewer persona, security hat, etc.), which helps but yeah, still the same underlying weights.
The cross-family approach makes sense architecturally — different training lineages = different blind spots. The tricky part is calibrating what "disagreement" means across models that have different verbosity and confidence thresholds.
What's your aggregation layer — custom script, or something off-the-shelf?
The aggregation is custom — I built it as a local desktop app (Python/Tkinter → CustomTkinter). Four providers scan in parallel via httpx + ThreadPoolExecutor, raw findings land in a Consensus Engine that does fuzzy deduplication (SequenceMatcher on descriptions), cross-provider voting, and severity-weighted scoring.
Disagreement detection works like this: if a finding appears from 2+ providers with high confidence, it's "validated." If 1 provider flags something none of the others saw, it's flagged as a "conflict" and shown separately — not hidden. The report surfaces both agreements AND disagreements so the human can decide.
The part I didn't expect: when I run it on its own source code, the consensus pattern finds real bugs that single-model reviews missed, including a silent crash that had been in production for six weeks. Agree with your take that human judgment is the final filter — the tool's job is to make the human's list shorter and better prioritized, not to replace the decision.
Three questions for you:
Thanks for the breakdown! The fuzzy deduplication with SequenceMatcher is a clever move — handling different verbosity levels is always a pain.
1 & 3) Actually, for that specific experiment, I stuck with the same model for all runs. I wanted to see if I could break the "anchoring effect" just by switching personas. But honestly, you’re right — the anchoring is still there. Even with different hats, the model tends to gravitate toward its initial "hunch." That’s why your cross-model approach is definitely the next logical step.
2) For the integration pass, I don't just give it the code. I use a prompt that explicitly defines the "Map of Boundaries" (showing the flow from main.go to api.Router and then to store, gemini, etc.).
The core instruction is: "You are an Integration Reviewer. Do NOT look at the internals — focus only on the points where modules meet." I give it a checklist to hunt for things like:
It’s basically a "Top-Down" view. It actually found a silent issue where the API assumed the Store would handle authentication, while the Store assumed the API had already done it.
The human filter is still the final step, but this "boundary-first" approach makes the signal-to-noise ratio much better.
That completely validates why cross-family evaluation is necessary — breaking that "semantic gravity" requires entirely different architecture baselines, not just a change of hat.
Your checklist approach for the integration pass is excellent. Restricting the context to structural boundaries and handshakes rather than deep internal logic is definitely the most efficient way to keep the signal high. For my setup, I actually feed the engine the interface definitions and routing contracts instead of full files to enforce this exact constraint.
Have you tried running these modular prompts asynchronously in parallel, or are you executing them sequentially to build the context step-by-step?
Actually parallel — I run the modular prompts concurrently, not sequentially.
The downside is you have to do the synthesis yourself at the end. But honestly that's where the interesting judgment calls are anyway.
Does your consensus engine handle the case where parallel findings contradict each other on severity — like one provider says critical, another says low for the same issue?
Yes — that's actually the most valuable signal in the whole pipeline.
The engine uses a severity spread metric: when two providers see the same finding but disagree on severity by more than one level (critical vs low), that finding gets flagged as a "severity conflict" and pushed to a separate section of the report with both verdicts visible. It's not averaged or merged.
The rationale: severity disagreement usually means one model sees context the other missed. It's not noise — it's the exact moment where a human should look.
In practice, I've found that security findings produce the most severity spread across models, while logic errors tend to be more consistently rated. Have you noticed the same pattern, or does it vary by module in your setup?
Honestly, I didn't notice that pattern — severity ratings were fairly consistent across passes in my setup, mostly landing in the critical range regardless of the finding type. Security, logic errors — both came back with similar weight.
Not sure if that's a function of the codebase I was testing on, or the models I used. What providers are you running in your four-model setup?
I'm currently running Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, and DeepSeek-V3 (via local APIs/wrappers)—and I’m actually adding support for local LLMs in future versions to make the whole consensus engine run 100% on-device. If you want to see how it handles this, I have the project linked on my dev.to profile—I’m constantly refining its arbitration layer.
The consistency you're seeing is exactly the "semantic gravity" of a single-model setup. When you use the same weights, the model's baseline evaluation of what is "critical" doesn't change, even if you switch the persona.
Once you mix training lineages, the spread widens. For instance, Gemini might flag an exposed route as an immediate high-severity threat, while Sonnet focuses deeply on the internal logic handling that route and rates it lower if the data input is sanitized. That's where the real signal starts.
That local-first approach is genuinely exciting — running a full consensus engine on-device without any cloud dependency changes the privacy story completely. Will definitely check out your project.
When merging results from 5 to 10 passes, how do you handle the noise floor? Do you use majority voting or a separate deduplication and filtering pass to keep false positives from overwhelming developers?
Honestly, in my experiment the deduplication was simple: each agent reviewed a separate module, so there wasn't much overlap to merge. The noise problem is bigger when you run the same pass multiple times.
My take: severity threshold beats majority voting here. If something appears once at CRITICAL — it's worth investigating regardless. If it appears once at LOW — probably skip it. Majority voting works better for flagging borderline cases, but it can also bury real issues that only one agent caught because it had a fresh angle. The real filter is still a human reading the output — structured review just makes that list shorter and better prioritized.
The anchoring effect in LLM code review is real — once the model latches onto one category of issues (say null safety), it under-reports everything else. Cross-model validation is one fix, but another practical approach is to constrain each review pass to a specific concern: security in pass 1, performance in pass 2, readability in pass 3. Narrows the attention window and produces more consistent results across runs.
not quite the same — you're doing multiple full passes with different focus, I'm splitting by module so each agent gets a small context. your approach still hits the attention dilution problem on large codebases no?
Single-pass AI review feels useful for “surface hygiene,” but deep bugs usually appear only after forcing the model into narrower scopes with fresh context.
The weird part is that the more code you feed at once, the more authoritative the output sounds even while coverage gets worse.
Yeah, noticed the same — wider context, more confident output, worse actual coverage. Splitting by module with fresh context per pass was exactly what fixed it for me.
If AI reviews code, who owns the judgment—the human or the machine?
The human. AI surfaces patterns, humans judge context. A false positive in a security-critical path looks identical to a real bug in the output — only the human who knows the system can tell the difference.
The structured approach just makes the human's job more focused, not obsolete.
the machine. because we don't deal with garbage code, that's why you need to build the judgement protocol at first. if the code doesn't match the judgement protocol criteria it doesn't need for review, ask for better code. loop it. real coder knows how to do the real code.