Leena Malhotra

Posted on Jan 2

What Broke When I Let AI Handle My Code Reviews (And How I Fixed It)

#webdev #programming #ai

I spent three months letting AI review my code.

Not as a novelty. Not as an experiment I could tweet about. As my actual process. Every pull request. Every refactor. Every bug fix.

Then I shipped something that worked perfectly in testing and exploded in production. The AI missed it. Not because the AI was bad—because I had stopped thinking.

I – The problem you don't know you have

Here's what most developers are doing right now:

They write code. They paste it into ChatGPT or Claude. They ask "Is this good?" The AI says yes (or suggests minor changes). They merge it.

They think they're being efficient. They're actually building technical debt at scale.

The issue isn't the AI's output. It's what happens to your brain when you outsource judgment.

When you let AI handle code review, you stop asking why. You stop questioning architecture. You stop seeing patterns across your codebase because you're no longer forced to hold multiple contexts in your head simultaneously.

Code review isn't just about catching bugs. It's about maintaining a mental model of how your entire system works.

The printing press rendered scribes obsolete. Before Gutenberg, bookmakers employed dozens of trained artisans to hand-copy manuscripts. A skill that took years to master. Before they knew it, that skillset was worthless.

But the editor emerged. Someone whose job was deciding what's worth printing in the first place.

The pattern is that skills abstract upward.

You don't need to be the person who writes every line. But you absolutely need to be the person who knows which lines matter.

So why is this time different?

Because most developers are using AI like a spell checker when they should be using it like a research team.

II – Single model review is a guess dressed up as certainty

I was using Claude Opus 4.1 for everything.

Great model. Excellent at analysis. Strong with TypeScript. But it has blind spots.

One day I asked it to review a React component that was re-rendering unnecessarily. Claude suggested memoization. Reasonable. I shipped it.

Then I ran the same code through GPT-5 out of curiosity. GPT pointed out that the real issue was prop drilling—the memoization was treating a symptom, not the cause.

That's when it clicked.

Every AI model is trained differently. Every model has different strengths. Claude excels at nuanced analysis. GPT is better at architectural patterns. Gemini catches edge cases Claude misses.

When you only use one model, you're not getting a code review. You're getting that model's perspective. And that perspective has gaps you can't see because you've stopped looking.

The gap between mediocre and great is taste. When anyone can generate code, the ability to know which code to trust becomes the skill.

This is where running the same review across multiple models stops being a feature and starts being a different way of thinking.

III – What actually broke (and what it taught me)

The production bug was subtle.

A caching layer that worked fine in our staging environment but failed under load. The AI had reviewed the logic. The logic was correct. What the AI didn't catch was the assumption baked into the implementation—that cache invalidation would happen synchronously.

Why didn't the AI catch it?

Because I didn't give it enough context. I pasted the function. I didn't show it the entire data flow. I didn't explain the deployment architecture.

The AI can only review what you show it. And when you're moving fast, you show it the minimum.

Here's what I learned: AI code review fails in three predictable ways.

Failure Mode 1: Context Collapse

You paste 50 lines. The AI reviews 50 lines. But the bug is in how those 50 lines interact with 200 other lines you didn't include.

When you analyze code across your entire codebase, you stop missing integration bugs.

Failure Mode 2: Architectural Blindness

AI is excellent at local optimization. It's terrible at system design. It will suggest a clever solution that makes one function faster while making your entire architecture more fragile.

Failure Mode 3: The Confidence Problem

AI never says "I don't know." It gives you an answer. And because it's articulate, you trust it. Even when it's wrong.

The people who figure this out don't abandon AI. They stop treating it like an oracle and start treating it like a team of junior developers who need direction.

IV – How to actually use AI for code review (without destroying your judgment)

Level 1: The Paster
You copy code. You paste it into a chatbot. You accept whatever it says. You've outsourced thinking.

Level 2: The Prompt Engineer
You write better prompts. You include more context. You get better answers. But you're still at the mercy of one model's perspective.

Level 3: The Orchestrator
You run the same code through multiple models. You compare. You synthesize. You notice when Claude catches something GPT missed and vice versa.

Level 4: The Architect
You use AI to handle specific review tasks while you control the system thinking. You know which model to use for what. You've built a workflow.

Most developers never leave Level 1.

Here's how to move up:

Step 1: Stop reviewing in isolation

Don't paste functions. Paste the entire context. Include the caller. Include the data flow. Include the deployment constraints.

When you extract patterns from documentation before you review, you catch assumptions the AI can't see.

Step 2: Use multiple models, always

Run your code through at least three different models. Look for disagreement. Disagreement is signal. It means there's something worth investigating.

The bottleneck isn't getting reviews—it's knowing which perspective to trust before you ship.

Step 3: Make the AI show its work

Don't accept "looks good" as a review. Make the AI explain its reasoning. Make it identify assumptions. Make it suggest edge cases.

If it can't defend its assessment, neither can you.

Step 4: Review the review

After you ship, come back to the AI's suggestions. Were they correct? What did it miss? Train yourself to see the gaps.

This is how you develop taste. Not by trusting AI blindly, but by learning where it fails.

V – The protocol

Here's exactly what I do now:

Morning: Architecture Review (20 minutes)

Before writing any code, I ask three models the same architectural question: "Given [system constraints], what are the tradeoffs of implementing [feature] this way?"

I look for disagreement. Where they agree, I move fast. Where they disagree, I slow down and think.

During Development: Context-Rich Prompts

I don't review individual functions anymore. I review:

The function
The caller
The data flow
The error handling
The deployment context

I paste all of it. Every time.

Before Merging: Multi-Model Analysis

I run the final code through:

Claude Sonnet 4.5 for logical analysis
GPT-5 for architectural patterns
Gemini 2.5 Pro for edge cases

Where they converge, I trust. Where they diverge, I investigate.

Weekly: Review What Broke

Every Friday I look at bugs from the week. I go back to the AI reviews. I identify what the AI missed and why. I update my prompts.

The goal isn't perfection. It's iteration. It's building a system that gets smarter every week.

VI – What this actually costs you

You're thinking: "This sounds like more work, not less."

You're right.

Using AI well takes effort. More effort than using it poorly. More effort than not using it at all.

But here's what you get:

Before (single model, quick paste):

5 minutes per review
70% catch rate on bugs
Zero improvement in your judgment
Declining code quality over time

After (multi-model, context-rich):

15 minutes per review
95% catch rate on bugs
Rapidly improving judgment
Compounding code quality

The 10 extra minutes isn't overhead. It's investment. You're not just reviewing code. You're training yourself to see what matters.

The developers who skip this step will be fast for six months. Then they'll spend six months debugging the mess they created.

The ones who slow down now will be faster in six months because they'll have better judgment.

VII – The questions you should be asking

Here's what you need to ask yourself:

Are you using AI to think faster or to avoid thinking?

When AI suggests a change, can you explain why that change is better?

If you couldn't use AI tomorrow, would your code quality drop?

If you answered honestly, you're probably uncomfortable.

That discomfort is the entire point. It means you've been outsourcing judgment and calling it efficiency.

The gap is opening right now. Between developers who use AI as a crutch and developers who use it as leverage. Between those who let models think for them and those who orchestrate multiple perspectives into better decisions.

The shift

Code review abstracted upward. You don't need to catch every semicolon. But you absolutely need to understand the system.

AI won't replace you. But developers who know how to orchestrate AI will replace developers who don't.

The ones who figure this out in the next six months will be building at a completely different level. The ones who keep pasting code into ChatGPT and accepting whatever comes back will keep wondering why their production environment is on fire.

Intelligence should be fluid, not fragmented. You don't pick a side—you orchestrate all of them.

-Leena:)

DEV Community