max

Posted on Jan 30

Stop Getting Useless AI Code Reviews (Here's How to Actually Catch Bugs)

#ai #codereview #programming #productivity

You paste your code into Claude or ChatGPT. You type "review this code." You get back a polite essay about how your variable names could be more descriptive and maybe you should add some comments.

Congratulations, you just wasted 45 seconds getting feedback your linter already gives you for free.

The dirty secret of AI-assisted code review is that most developers are asking the wrong questions. They treat the AI like a junior dev doing a drive-by review instead of what it actually is: an infinitely patient analyzer that can check your code against hundreds of failure modes simultaneously -- if you tell it to.

Let me show you how to get reviews that actually find bugs.

The Problem: Generic Prompts, Generic Reviews

Here is a prompt I see developers use constantly:

Review this code and let me know if you see any issues.

And here is what you get back: a surface-level scan that mentions code style, suggests renaming a variable from x to something_more_descriptive, and tells you to add error handling "where appropriate." None of that prevents a 3am production incident.

The AI is not being lazy. It is responding to the scope you gave it, which is "everything and nothing." When you say "any issues," the model optimizes for breadth. It gives you a little bit about style, a little about performance, a little about readability. It covers the easy stuff because you did not tell it to dig deeper.

Technique 1: The Targeted Review

Instead of asking for a general review, pick a specific failure category and go deep.

Before:

Review this Python function for issues.

After:

Review this Python function exclusively for error handling gaps.

For each gap you find:
1. What input or condition triggers the failure
2. What happens when it fails (crash, silent corruption, wrong result)
3. A concrete fix with code

Ignore style, naming, and performance. I only care about
ways this function can break at runtime.

[paste function]

The difference is night and day. The first prompt gets you "consider adding a try-except block." The second gets you "if user_data is None, line 14 throws an AttributeError because you call .get() on a NoneType; here is the guard clause you need."

Run separate passes for separate concerns. One pass for error handling. One for security. One for performance. Three focused reviews beat one unfocused review every time.

Technique 2: Give the AI a Threat Model

Context changes everything. Watch what happens when you tell the AI who is using your code.

Before:

Check this API endpoint for security issues.

After:

This is a public-facing REST endpoint that accepts user input.
It is deployed behind an API gateway but has no rate limiting yet.
Authentication is handled via JWT tokens passed in the Authorization header.
The endpoint writes to a PostgreSQL database using parameterized queries.

Assume the attacker is an authenticated user trying to:
- Access other users' data
- Escalate their permissions
- Cause denial of service
- Exfiltrate data through the response

Review this endpoint code. For each vulnerability found,
rate it Critical/High/Medium/Low and show the exploit scenario.

[paste code]

Now instead of getting "make sure you sanitize inputs" (thanks, very helpful), you get findings like: "An authenticated user can modify the user_id parameter to access other users' records because line 23 uses the request parameter instead of extracting the ID from the JWT token. This is a Critical IDOR vulnerability."

That is a finding you can actually act on.

Technique 3: The Before/After Diff Review

This one is underused. Instead of reviewing a whole file, give the AI the change and let it focus on what is new.

I'm modifying this function to add caching. Review ONLY my changes
for correctness. Here is the original:

python
def get_user(user_id: str) -> User:
return db.query(User).filter(User.id == user_id).first()


Here is my updated version:

python
_cache = {}

def get_user(user_id: str) -> User:
if user_id in _cache:
return _cache[user_id]
user = db.query(User).filter(User.id == user_id).first()
_cache[user_id] = user
return user


Specifically check for:
1. Cache invalidation problems
2. Thread safety issues
3. Memory leak potential
4. Cases where stale data causes bugs

This is where AI code review genuinely shines. It will immediately flag that the cache never invalidates, that a module-level mutable dict is not thread-safe in a multi-worker setup, and that deleted or updated users will serve stale data forever. Those are three real bugs you might not catch in a self-review because you are focused on whether the caching logic itself works.

Technique 4: The Adversarial Edge Case Finder

This is my favorite technique for functions that handle user input or complex business logic.

Here is a function that calculates shipping costs.
Your job is to break it. 

Find inputs that produce:
- Wrong results
- Crashes or exceptions  
- Infinite loops or hangs
- Negative or nonsensical values

For each breaking input, show:
1. The exact input values
2. What the function returns or throws
3. What it SHOULD do instead

[paste function]

Telling the AI to "break" your code activates a completely different analytical mode than asking it to "review" your code. Review mode is polite and surface-level. Break mode is adversarial and thorough. It will try zero, negative numbers, None values, absurdly large inputs, unicode strings where you expected ASCII, empty collections, and every other edge case you forgot about.

Technique 5: The Production Failure Premortem

This one saves you from the bugs that only show up at scale.

This code works in my tests. Imagine it is running in production
with 10,000 requests per minute. What breaks?

Consider:
- Race conditions
- Memory pressure
- Database connection exhaustion  
- Cascading failures if a dependency goes down
- What happens during deployment (old and new code running simultaneously)

[paste code]

Most developers test the happy path at low volume. This prompt forces the AI to simulate production conditions mentally and identify the failure modes that only emerge under load, during deployments, or when external services degrade. The deployment scenario alone -- old code and new code running side by side -- catches schema migration issues that are invisible in single-instance testing.

The Meta-Technique: Iterate, Don't Restart

When the AI gives you a review, do not just read it and move on. Push back.

Good findings. Now assume those are all fixed.
What is the NEXT most subtle bug you can find?
Dig deeper -- look for logic errors, off-by-one mistakes,
and assumptions that hold in tests but not in production.

The first pass catches the obvious stuff. The second pass, when you explicitly tell it to go deeper, catches the subtle stuff. I have found legitimate production bugs on the third iteration of this loop that I would never have caught manually.

Why This Works

All of these techniques share the same underlying principle: specificity creates depth.

A vague prompt gives the AI permission to be vague back. A specific prompt with a defined scope, a threat model, a failure category, and a required output format forces the AI to do actual analysis instead of pattern-matching against common review feedback.

The difference between "review this code" and a well-structured review prompt is the difference between a rubber stamp and a genuine code audit. Same AI, same code, completely different results.

Skip the Trial and Error

It took me months of daily iteration to figure out which prompt structures consistently surface real bugs versus generic noise. If you want to shortcut that process, I put together a Claude Code Prompt Pack with 50+ battle-tested prompts covering code review, security audits, debugging, architecture analysis, performance optimization, and more. Each prompt includes the template, usage notes, and tips on how to iterate for deeper results. It is the reference I wish I had when I started using AI for serious code work.

DEV Community