DEV Community

Nova Elvaris
Nova Elvaris

Posted on

Why Your AI Code Review Misses Stateful Bugs (and the 3-Context Fix)

A lot of AI code reviews look sharp right up until they miss the bug that actually matters.

They catch naming noise, dead comments, maybe a missing null check. But they miss the regression caused by a cache key change, the migration that no longer matches the model, or the new flag that breaks the retry path two services away.

The pattern I've noticed is simple: the model isn't bad at review, it's under-contextualized.

Most review prompts only include the diff. Stateful bugs usually live outside the diff.

Why the diff alone isn't enough

A diff shows what changed. It does not show:

  • what state existed before the change
  • what surrounding invariants must still hold
  • what hidden dependency the change now violates

If a PR changes this:

cache.set(user.id, profile)
Enter fullscreen mode Exit fullscreen mode

to this:

cache.set(profile.email, profile)
Enter fullscreen mode Exit fullscreen mode

The diff looks syntactically harmless. But if downstream readers still call cache.get(user.id), you've just created a bug that only appears in a later request path.

The model won't reliably catch that if you only hand it the patch.

The 3-context fix

I now structure review prompts around three layers of context.

1. Change context

This is the diff itself.

Here is the unified diff for the PR. Identify likely logic, state, and integration risks.
Enter fullscreen mode Exit fullscreen mode

Necessary, but not sufficient.

2. Runtime context

Tell the model what state or workflow the code participates in.

Runtime context:
- cache keys are always user IDs
- writes happen during login
- reads happen during profile fetch and billing sync
- stale cache entries can cause cross-user data leaks
Enter fullscreen mode Exit fullscreen mode

This is usually the missing layer. It gives the model something to reason against.

3. Invariant context

List the rules that must stay true after the change.

Invariants:
- cache write key must equal cache read key
- one user may never read another user's profile
- failed sync retries must remain idempotent
Enter fullscreen mode Exit fullscreen mode

Invariants are powerful because they shift review from "does this code look nice?" to "what rule might this break?"

The prompt template

This is the version I keep around:

You are reviewing a code change for bugs, not style.

## Change context
[paste diff]

## Runtime context
- describe where this code runs
- describe stateful dependencies
- describe side effects

## Invariant context
- list 3-5 rules that must remain true

## Output format
Return:
1. short summary
2. likely bug risks
3. missing tests
4. what additional file/context you would inspect next

Do not comment on naming, formatting, or refactors unless they create a bug risk.
Enter fullscreen mode Exit fullscreen mode

That last line matters. Otherwise the model burns attention on surface-level cleanup.

A practical example

Here's a compact example with a queue consumer:

# before
if job.attempts > 3:
    mark_failed(job)

# after
if job.attempts >= 3:
    mark_failed(job)
Enter fullscreen mode Exit fullscreen mode

Looks fine, right?

But if the invariant is "a job gets 3 retries after the initial run," then >= 3 changes the allowed retry count. That's a behavioral bug, not a syntax bug.

A diff-only review may miss it.

A review with runtime and invariant context usually flags it immediately.

What changed for me after using this

Once I started feeding these three context types into review prompts, the comments got noticeably better:

  • fewer style nitpicks
  • more integration warnings
  • better test suggestions
  • clearer calls for follow-up inspection

The model still doesn't replace a human reviewer. But it stops acting like a linter with opinions and starts acting more like a junior engineer who understands the system constraints.

That's a much better role.


Question for you: What's the last bug your AI review missed, and was the missing piece really model quality, or just missing runtime context?

Top comments (0)