DEV Community

LayerZero
LayerZero

Posted on

One AI code review pass isn't enough. Here's the loop that actually catches bugs.

You ran the AI reviewer. It said "LGTM." You shipped. Then production caught fire.

This is happening more and more this year. Teams adopt Claude, Copilot, or Cursor for code review, get a clean response on the first pass, and merge with confidence they haven't earned.

Here's the part nobody is telling you: one pass of AI review is statistically worse than a tired human's first pass. Not because the model is dumb, but because of how reviewing works.

The good news is the fix is small. It just isn't "use a better model."

Why one pass fails

When an AI reviews a diff, it does roughly what a human does on the first read: scan for obvious smells. Wrong indentation. Unused vars. A missing await. The cheap stuff.

The expensive stuff — the bugs that cost you real money — lives somewhere else:

  • Cross-file invariants. A change in auth.ts quietly breaks an assumption in billing.ts.
  • Race conditions. Two requests can now hit the same row at the same time.
  • Silent regressions. A refactor preserves behavior in 99% of cases and corrupts data in the 1%.
  • Security holes that look like features. An ID is now passed in the URL because "the frontend needed it."

A single review pass treats the diff like a closed system. It cannot see what it cannot see. And the model, like a junior dev, gets one shot — then says "LGTM" because that is the polite default when nothing obvious is wrong.

That is the trap.

What a real review loop looks like

Think of it the way a senior engineer reviews: not one read, but five passes with different glasses on.

The AI version of that is just five prompts in a loop, each looking at the same diff with a different question:

Pass 1: "What does this PR actually change? Summarize behavior."
Pass 2: "What invariants in the rest of the codebase could this break?"
Pass 3: "What inputs would make this crash, hang, or corrupt data?"
Pass 4: "What does this leak? Auth, PII, secrets, internal IDs, error stacks."
Pass 5: "If this ships and is wrong, how do we find out? Are the logs/tests enough?"
Enter fullscreen mode Exit fullscreen mode

Each pass is a fresh context window. No memory of "LGTM" from the last one. Each one is forced to find something or explicitly state "nothing applies."

Here's a minimal harness you can run today:

import anthropic

client = anthropic.Anthropic()
MODEL = "claude-opus-4-7"

PASSES = [
    ("behavior",  "Summarize what this diff changes in plain English."),
    ("impact",    "List specific files or functions OUTSIDE the diff that may break."),
    ("failure",   "Give 5 concrete inputs that would crash or corrupt data."),
    ("security",  "Find any new leak: auth, PII, secrets, internal IDs, stack traces."),
    ("observability", "If this is wrong in prod, how would we detect it? Are tests/logs enough?"),
]

def review(diff: str) -> dict[str, str]:
    findings = {}
    for name, question in PASSES:
        msg = client.messages.create(
            model=MODEL,
            max_tokens=1024,
            system="You are a senior engineer. Be concrete. No 'LGTM' allowed.",
            messages=[{
                "role": "user",
                "content": f"{question}\n\nDIFF:\n{diff}"
            }],
        )
        findings[name] = msg.content[0].text
    return findings
Enter fullscreen mode Exit fullscreen mode

That's it. Five API calls. Costs a few cents. Catches the bugs a one-shot reviewer waves through.

The non-obvious part: forbid "LGTM"

The single most important line in that prompt is No 'LGTM' allowed.

LLMs default to agreement when nothing screams at them. You have to actively forbid the polite-out. Better prompts:

  • "You must list at least two concerns, even if they are minor. If the change is genuinely safe, explain why — don't just assert it."
  • "Rate severity 1-5. If everything is 1, justify it against the file's history."
  • "Imagine this PR ships and breaks. What is the post-mortem headline?"

These are not tricks. They are how you make the model do the work instead of pattern-matching to "approve."

What this fixes in your workflow

If you're a solo dev or small team shipping AI-assisted code at speed, the loop above does three things:

  1. Forces the model to imagine failure. Most one-pass reviews implicitly assume success.
  2. Spreads attention across the codebase. Cross-file bugs are where money dies.
  3. Leaves an audit trail. Five named passes give you something to point to when something goes wrong — way better than one "LGTM" in your Git history.

The cost of running this in CI is real but small. A 200-line PR through 5 passes on Claude is roughly $0.10 today. The cost of not running it is one bad migration, one leaked admin endpoint, one corrupted invoice batch.

Do the math.

The deeper lesson

AI code review isn't broken. The way most teams use it is broken. They treat the model like an oracle that knows the answer and ask it once. The model is not an oracle. It's a junior engineer with infinite stamina and zero ego.

The right mental model is: use the AI like you'd run a code review checklist — multiple structured passes, different focus each time, never satisfied on the first "looks fine."

One pass is a sanity check. A loop is a review. Most of the bugs you care about live in the gap between those two things.


If this is the kind of practical AI-engineering content you want more of, follow LayerZero. We break down what actually changes in your workflow when you take AI tools seriously — not the hype, the parts that ship code or break it. Next post: why your CI should run the AI reviewer on its own PRs.

Top comments (0)