AI Code Review That Actually Catches Bugs: A 3-Step Prompt Chain

#programming #codereview #productivity #ai

Most AI code reviews are useless. "Looks good, consider adding error handling." Thanks, I could have gotten that from a linter.

Here's a 3-step prompt chain that makes AI code review actually catch real bugs — the kind humans miss at 4 PM on a Friday.

Why Single-Prompt Reviews Fail

When you paste code and say "review this," the AI does what any overworked reviewer does: it skims, finds surface-level issues, and approves.

The problem isn't the AI. It's the process. A single pass can't do deep analysis, check edge cases, and verify logic — all at once.

The fix: break review into three focused passes.

Step 1: The Bug Hunter

The first prompt looks only for bugs. Not style. Not naming. Not "consider using X instead." Just bugs.

You are reviewing this code for BUGS ONLY.

Ignore style, naming, performance, and best practices.
Focus exclusively on:
- Logic errors
- Off-by-one errors
- Null/undefined access
- Race conditions
- Unhandled error paths
- Incorrect assumptions about input

Code:
[paste code]

For each bug found, output:
- Line (approximate)
- Bug description
- Why it's a bug (what breaks)
- Minimal fix

Why it works: By excluding everything except bugs, you eliminate the "consider adding..." noise. The AI goes deep on logic instead of wide on suggestions.

Step 2: The Edge Case Finder

The second prompt generates test scenarios the code might fail on:

Given this function, generate 10 edge case inputs that could cause unexpected behavior:

[paste function signature + code]

For each edge case:
- Input value(s)
- Expected behavior
- What actually happens (trace through the code mentally)
- Verdict: PASS or FAIL

Focus on boundaries, empty inputs, large inputs, type mismatches, and concurrent access.

Why it works: Humans are terrible at thinking of edge cases for their own code. We wrote it with the happy path in mind. The AI has no such bias.

Step 3: The Contract Checker

The third prompt verifies the code matches its specification:

Here is a function and its specification:

SPEC:
[paste the requirement / ticket / doc]

CODE:
[paste the implementation]

Check whether the code fulfills every requirement in the spec.

For each requirement:
- Requirement (quoted from spec)
- Status: FULFILLED / MISSING / PARTIAL / CONTRADICTED
- Evidence: which line(s) address this requirement
- Gap: what's missing (if any)

Why it works: The most dangerous bugs aren't in the code — they're in the gap between what was asked and what was built. This step catches "it works but it's wrong."

Putting It Together

Run all three steps on every PR that matters:

PR #347: Add discount calculation to checkout

Step 1 (Bug Hunter):
→ Found: negative discount not clamped → total can go below zero
→ Found: floating-point comparison using === instead of tolerance

Step 2 (Edge Cases):
→ FAIL: discount = 100% → total becomes 0.00 → payment gateway rejects
→ FAIL: multiple discounts applied → order matters, not commutative

Step 3 (Contract Check):
→ MISSING: spec says "max one discount per order" — code allows stacking
→ PARTIAL: spec says "show discount on receipt" — only shows in cart, not receipt

Three prompts. Three minutes. Four real bugs that would have shipped.

When to Use This

Use the full 3-step chain for:

Payment/billing code
Authentication/authorization
Data migration scripts
Anything user-facing that's hard to roll back

Use just Step 1 (Bug Hunter) for:

Regular feature PRs
Internal tooling
Anything you'll test manually anyway

Skip AI review entirely for:

Config changes
Dependency bumps
One-line fixes you fully understand

The Prompt Chain in Practice

I keep these three prompts in a file called review-chain.md in my project root. When I review a PR:

Copy the diff
Run each prompt in sequence
Compile findings into a review comment
Add my own human judgment on top

The AI finds the mechanical issues. I focus on design, architecture, and "does this even make sense?"

Together, we're a better reviewer than either of us alone.

Results

Since adopting this chain:

Caught 3 production bugs in the first week that passed human review
Review time didn't increase (the prompts run in parallel while I read the diff)
Junior devs on the team started using it too — review quality went up across the board

Try the Bug Hunter prompt on your last PR. I bet it finds something. Share what it catches — I'm tracking the most common bug categories for a follow-up post.