AI Prompting for Code Review

#ai #programming #productivity #chatgpt

Stop telling AI to "review this code." Do this instead.

You already use Claude or ChatGPT for code review, debugging, and tests. And if you are honest, the output is usually mediocre: an unranked wall of style nits mixed with real bugs, a confident fix for the line that threw while the actual bug sits upstream, or three happy-path tests that prove nothing.

It is tempting to blame the model. The real problem is the prompt. "Review this code" is a vague instruction, and vague instructions get vague answers. Here is a concrete example and the one change that fixes it.

A real bug

def calculate_order_total(items: list[dict]) -> int:
    total = 0
    for item in items:
        total += item["price"] * item["quantity"]   # TypeError fires here
    return total

def build_order(cart_id: str) -> dict:
    cart = catalog_client.get_cart(cart_id)          # migrated to Catalog API v2
    items = [
        {"sku": line["sku"], "price": line["unit_price"], "quantity": line["qty"]}
        for line in cart["lines"]
    ]
    return {"cart_id": cart_id, "total": calculate_order_total(items), "items": items}

It throws TypeError: unsupported operand type(s) for *: 'NoneType' and 'int', but only on about 2% of checkouts, and it started right after an API migration.

A naive prompt, "Why is this throwing a TypeError? Fix it.", gets you this:

item["price"] is None, so add a guard:
price = item["price"] or 0

It "works." The crash stops. And you have just shipped a worse bug: every unsellable item is now silently priced at $0. The error moved. It was never understood.

The prompt that finds the actual cause

You are a debugging specialist. Your job is to find the ROOT CAUSE of a bug,
not to apply a random fix and hope. You reason from evidence and you are honest
about uncertainty.

SYMPTOM
{{ERROR_MESSAGE_STACK_TRACE_OR_DESCRIPTION}}

RELEVANT CODE
{{PASTE_CODE_HERE}}

WHAT CHANGED RECENTLY (if known)
{{RECENT_DEPLOY_DEP_BUMP_CONFIG_OR_"unknown"}}

ENVIRONMENT
{{LANGUAGE_VERSION_AND_WHERE_IT_HAPPENS}}

Work the problem in this order and output exactly these sections:

## What the symptom tells us
Read the trace/symptom literally. State what is definitely true before theorizing.

## Ranked hypotheses
A numbered list, MOST LIKELY FIRST. For each: the hypothesis, a confidence
(HIGH/MEDIUM/LOW), the evidence for it, and the evidence against it.

## Most likely root cause
Commit to the top hypothesis. Name the exact line(s) where the bug ORIGINATES
(not just where it throws) and explain the chain of events.

## The fix
The concrete change, with corrected code. State any assumption it depends on.

## How to verify
A specific way to confirm the fix works AND that you found the real cause.

## How to prevent recurrence
One or two concrete guards: a test, a type, an assertion, input validation.

RULES
- Do NOT propose a fix before identifying a cause. Cause first, then fix.
- Rank hypotheses by likelihood; don't list everything equally.
- Distinguish where the error SURFACES from where it ORIGINATES.
- If you need information you don't have, say what you'd check and why.

Run that on the same bug and the output is different in kind, not degree:

The bug originates in build_order, which copies the price from the API response with no validation. It only surfaces later in the pricing math, where the None is first used. The crash site is an innocent victim of unvalidated data admitted upstream. This explains every clue: only ~2% of carts (those with an unpriced item), started at the migration cutover, and not reproducible locally. Reproduce it by pulling a failing cart's raw response and confirming a null price, then add the regression test before trusting the fix.

It found the upstream cause, refused to guess, and told you how to confirm you actually fixed it.

Why it works

Three rules do the heavy lifting:

Cause before fix. The model is forbidden from proposing a fix until it has named a cause. Stack traces point at the crash site, which is usually downstream of the real bug.
Ranked hypotheses with confidence and counter-evidence. "Here are six things it could be" becomes a prioritized investigation, and the counter-evidence keeps it honest about what it is guessing.
Verify the cause, not just the symptom. "The error went away" is not the same as "I understood it."

The same idea, everywhere

This is not specific to debugging. Every weak AI coding prompt has the same fix: tell the model what a good answer looks like, and constrain it.

Code review: demand a severity label on every finding and a ship / don't-ship verdict, so you can triage in seconds instead of reading prose.
Test generation: ask for edge and error cases first, plus invariants that hold for all inputs, instead of three happy-path asserts.
PR descriptions: derive everything from the diff, refuse to fabricate testing, and flag leftover debug code.

The model was always capable of the good answer. The structure is what pulls it out.

If you want the rest

I packaged six of these prompts, code review, debugging, refactor planning, test generation, PR descriptions, and blameless postmortems, each with a real, tested worked example: Code Review Copilot Prompts. The bug-triage prompt above is yours regardless. Copy it and use it on your next bug.