The Never‑Ending AI Code Review: Why One Pass Isn’t Enough

#ai #softwaredevelopment #codereview #agents

The Hook

I ran an AI code review. It found 12 issues. I fixed them. Ran it again — it found 8 more. Fixed those. Ran it again — 5 more. After the sixth run, I started to suspect that something was wrong.

If you are a tech lead or a developer using AI to review large projects, you probably know this feeling. Every new run finds something the previous one missed. Your tokens are draining, but the confidence that "now it's finally clean" never comes.

This isn't a bug in your prompt. It’s a structural feature of how LLMs work. And there is something we can do about it.

Why LLMs Always Find Something New

An LLM doesn’t read code deterministically like a compiler or a linter, which checks every single line in sequence. Instead, the model generates its response probabilistically: each token is chosen with a certain degree of randomness (this is called temperature). Even with the exact same code and prompt, the result will vary.

On top of this, there is the anchoring effect. If the model "hooks" onto null-safety issues during its first pass, it continues to look for similar problems—and might completely overlook, for example, race conditions. The next run might anchor onto something else entirely.

This doesn't mean AI is a bad reviewer. It means that a single run does not provide full coverage—much like a single QA engineer won't find every bug in a massive product in one day.

What the Science Says

It turns out this isn't just a hunch—it’s a well-documented phenomenon. Here is what recent research (2025) has found:

The Consistency Gap (Semgrep, 2025)

Researchers at Semgrep directly addressed the issue of non-determinism. In their experiments, running the exact same security prompt on the same codebase multiple times yielded wildly different results. In one case, three identical runs produced 3, 6, and then 11 distinct findings. They attributed this to "context compaction"—as the model tries to process large amounts of code, it uses lossy compression, inevitably losing track of specific details. (link)
The 118% Recall Boost (SWR-Bench, 2025)

The SWR-Bench study quantified exactly how much we miss in a single pass. They found that by using a "Self-Aggregation" strategy—running the review 10 times and merging the results—the Recall (the percentage of real bugs found) increased by 118%. This proves that a single-pass review finds less than half of the actual issues lurking in the code. (link)
Specialized Lenses vs. General Prompts (Ericsson, 2025)

Software engineers at Ericsson found that a "naive" approach (one big prompt) fails in practice. Instead, they moved toward a strategy of specialized prompts (Security, Logic, Design, and I/O) and restricted the model's focus to the "enclosing method" of the changes. This targeted approach significantly reduced hallucinations and improved precision. (link)

The conclusion is clear: one giant request for the entire project is the worst possible strategy.

My Experiment: 6 Runs, 6 Different Realities

I tested this on a real-world project, running a review on the exact same repository six times in a row. Every single time, it found something new, purely because it "anchored" on different themes:

Run 1 (Architecture): It flagged that the WebSocket Hub was storing clients in-process (killing scalability) and missed a readPump for CloseMessage events.
Run 2 (Performance): It pivoted to N+1 Redis queries and missing TTLs, noting that Redis would grow infinitely.
Run 3 (Security): It finally noticed auth tokens leaking into logs via query params and a forgotten dump.rdb file committed to git.

This is the anchoring effect in action. The model "latches onto" a theme and hunts for similar patterns while ignoring the rest. Six runs weren't just repetitions; they were six different, incomplete lenses on the same code.

The Showdown: 6 Traditional Runs vs. 1 Structured Pass

To see if "just running it more" was a viable strategy, I compared the two approaches.

First, I aggregated the results of six traditional reviews (feeding the entire project at once). Despite the volume, the combined reports still missed two critical vulnerabilities.

Then I switched to the Structured Approach: targeted, module-based passes. On the very first pass, the agent caught both critical bugs. These were not typos—they were deep, cross-file logic failures:

The Invisible Bypass: A docker-compose file injected a default API key via ${API_KEY:-default-value}, silently overriding all app-level security checks.
The Open Door: Redis was exposed externally without a password—a "day-zero" disaster waiting to happen.

Does this mean traditional reviews are useless?

Not at all. After fixing the critical bugs, a final traditional run found several "tactical" issues, like a service using the wrong Gemini key. The structured agent had seen these too, but because they were flagged as LOW priority, they were buried in the specialized reports.

Solution: Divide and Conquer

From research and experiments, a clear strategy emerges. Instead of one big request, use two levels:

Tier 1 — Parallel Module Review: Divide the project into independent modules (handlers, store, infra) and run a separate agent on each in a fresh session. This forces the model's attention to stay focused on one chunk of code, preventing "context rot."
Tier 2 — The Integration Pass: A dedicated agent looks only at the boundaries between modules: interfaces, contracts, and shared assumptions.

Bonus: Run each module through specific categories separately (Security, Performance, Logic). As Ericsson showed, this adds extra layers of accuracy.

Verdict: Precision over Volume

Six shallow passes are not equivalent to one deep, structured pass. The traditional "big picture" review is like a generalist doing reconnaissance—good for surface-level tactical problems. But for high-risk, critical vulnerabilities, you need a Structured Approach that forces the AI to stop skimming and start investigating.

Practical Recommendations

If you want to apply this to your project, here are the concrete steps:

Break the code into modules by functional boundaries. Not by files, but by areas of responsibility: handlers, store, AI clients, WebSocket, infrastructure.
Run a separate agent in a fresh session for each module. A new session is a "reset" for the anchor. The agent doesn't know what previous runs found, so it looks with fresh eyes.
Give each agent a specific checklist by category. Security, performance, reliability, code quality — in order. But ask the agent to do one more sweep after the checklist to find what wasn't covered.
Add an integration pass. A separate agent that looks strictly at the boundaries between modules. The deadliest bugs live there.
Don't feed the entire project into one prompt. This doesn't just kill accuracy; it burns tokens on a massive context that the model cannot effectively "digest."

One run finds less than half of your real problems. But the point isn't just to run it more — the point is to make sure each run looks at less code, with a sharp focus, in a fresh session.