DEV Community

Nova Elvaris
Nova Elvaris

Posted on

Why Your AI Code Review Misses Real Bugs (and the 3-Prompt Fix)

I used to have one prompt for code review. Something like: "Review this diff, find bugs, suggest improvements." It gave me back a confident list of nitpicks — naming suggestions, "consider extracting this function", occasional style comments — while happily missing the actual null-pointer waiting to crash in production.

Here's why that happens, and the 3-prompt pipeline I use now instead.

Why one-shot code review fails

A single "review this code" prompt forces the model to do three completely different jobs at once:

  1. Understand what the code is supposed to do (intent)
  2. Find where it doesn't do that (bugs)
  3. Suggest how to fix it (improvements)

Each job uses different reasoning. Mashing them together means the model takes the path of least resistance: surface-level style nits, because those are easy to spot and sound authoritative. The hard bugs — the ones that require actually tracing execution paths — get skipped because the model already has enough to say.

This isn't a prompting skill issue. It's a task-structure issue.

The 3-prompt pipeline

Split the review into three isolated prompts. Each one gets focused context and a narrow job.

Prompt 1: Intent extraction

You are reading a git diff. Do NOT review it yet.

For each changed function or block, answer:
1. What is this code trying to do? (one sentence)
2. What preconditions does it assume? (bulleted list)
3. What postconditions should hold after it runs?

Diff:
<paste diff>

Output as JSON: {"blocks": [{"name": ..., "intent": ..., "preconditions": [...], "postconditions": [...]}]}
Enter fullscreen mode Exit fullscreen mode

This prompt does nothing but extract understanding. No bug hunting. No suggestions. Just: "what is this code supposed to be doing?"

Save the output. You'll need it in prompts 2 and 3.

Prompt 2: Bug hunting (guided by intent)

You are hunting for bugs in a git diff. Here is the AUTHOR'S INTENT for each block (extracted separately):

<paste output from prompt 1>

Here is the diff:
<paste diff>

For each block, answer:
1. Does the code actually satisfy the stated postconditions? If not, where does it fail?
2. Are the preconditions checked? If not, what input breaks this?
3. What edge cases (null, empty, off-by-one, concurrent access, unicode, timezone) are unhandled?
4. Are there any silent failures (swallowed exceptions, default fallbacks, retries without limit)?

For each bug, output: {severity: high|medium|low, location, bug, concrete failing input}.

If a block has no bugs, say "no bugs found" explicitly. Do NOT suggest style changes.
Enter fullscreen mode Exit fullscreen mode

The magic is that the model now has something to compare the code against. "Does the code satisfy the postconditions?" is a concrete, checkable question. Much better than "find bugs," which is an open invitation to hallucinate.

Prompt 3: Fix suggestions (only for real bugs)

Here are confirmed bugs in a diff:
<paste output from prompt 2>

For each bug marked high or medium, propose:
1. The minimal fix (diff-style if possible)
2. A test case that would have caught this bug
3. Whether this needs a broader refactor or can be patched in place

Do NOT propose fixes for low-severity bugs unless the fix is one line.
Enter fullscreen mode Exit fullscreen mode

By isolating the fix step, the model stops bundling "here's a bug" with a 40-line rewrite you'd never actually accept.

What this catches that one-shot misses

Real examples from the last two weeks:

  • A cache lookup that silently fell back to null when the key was missing, without logging. Prompt 1 surfaced that the postcondition was "returns user data." Prompt 2 noticed the code returned null on cache miss. One-shot review had called out variable naming.
  • An async function that awaited inside a loop but didn't handle partial failures. Prompt 1 extracted "all items must be processed." Prompt 2 noticed a thrown exception mid-loop would leave the remainder unprocessed.
  • A SQL query built with string interpolation inside a helper that looked safe. The one-shot review missed it because "this is a helper, nothing weird here" is easy to infer. Prompt 2 caught it because it explicitly asked "what input breaks this?"

The cost

Yes, this is 3x the tokens. For any non-trivial diff, it's worth it:

  • One-shot finds maybe 30% of real bugs and ~5 nitpicks
  • 3-prompt finds ~70% of real bugs and zero nitpicks (because you never asked for them)

For a hot code path or a release candidate, I run all three. For a trivial refactor PR, I skip to prompt 2 alone with a short intent note.

Why splitting helps

The deeper principle: LLMs are better at checking claims than generating them. When you give the model a specific claim to verify ("does this code meet postcondition X?"), it becomes a much sharper critic. When you ask it to generate a freeform review, it defaults to pattern-matching on what reviews usually look like — which is mostly nits.

Every good multi-step LLM workflow I've built has this shape: generate claims cheaply, then verify them carefully in a separate step.


Question for you: What's the worst bug an AI code reviewer ever missed for you? I'm collecting examples — the more embarrassing, the better.

Top comments (0)