I asked AI to review its own code last week.
The code had a bug. An edge case. A variable name that made no sense.
The AI's review?
This code is clean, efficient, and well-structured. 10/10.
I asked again: Are you sure? What about the edge case?
It paused. Then fixed the bug. Then gave itself 11/10.
That's when I realized: AI code review isn't one thing. It's five different things. And most of us are stuck at Level 1 without even knowing it.
Here's the full ladder from trust me bro to actually production ready.
Level 1: It Works on My Machine
The workflow: Generate code → skim it → ship it → hope for the best.
The review: None. Just vibes.
You don't know what you don't know. The code works today. But edge cases? Security holes? Performance bottlenecks? You're betting your production environment on luck and the AI's confidence.
The tricky part is that this feels fine. The code looks clean. The AI sounded sure. It passed your quick sanity check. So you ship it.
And then three weeks later, a user hits the exact edge case you didn't think about. The one the AI didn't catch. The one you didn't check for. Because you were trusting vibes instead of verifying code.
The fix: Read the code you ship. Not skim — read. Line by line. If you can't explain what a line does, you don't ship it. That's the whole rule.
Your level if: You've ever copy-pasted AI code without fully understanding it.
(Be honest — we've all done it.)
Level 2: AI Self-Review
The workflow: Generate code → ask the same AI to review it → trust its confidence.
The review: The fox guarding the henhouse.
This feels smarter than Level 1. You're doing a review! You're being responsible! Except you're asking the same model, with the same blind spots, in the same conversation, to evaluate its own output.
AI doesn't know when it's wrong. Not because it's stupid — because it's not designed to know that. It pattern-matches. Its own code matches its own patterns perfectly. So it gives itself 10/10. Every time. And then 11/10 when you push back.
I tested this multiple times. I gave AI code with deliberate bugs. Asked it to self-review. It caught maybe 30% of them the obvious ones it had been trained to spot. The subtle ones? Invisible. Because they matched its own patterns.
The signal that you're here: The AI never says this needs serious work. It only ever says looks good, minor suggestions below.
The fix: Never trust self-review. The AI will always find itself innocent.
Your level if: You've ever asked ChatGPT to review code that ChatGPT wrote and shipped based on that answer.
Level 3: Cross-Model Review
The workflow: GPT generates → Claude reviews → Gemini tie-breaks.
The review: Different training data. Different error models. Different blind spots.
This is where it gets actually interesting. Different model families were trained differently, fine-tuned differently, and make different types of mistakes. Where they disagree — that's where the signal lives.
I started doing this consistently a few months ago. The pattern I noticed: when all three models agree the code is fine, it's usually fine. When two disagree with one, dig deeper. The disagreement is your to-do list.
The problem is you're now juggling multiple tools, multiple API keys, and a workflow that adds friction. It's better — meaningfully better — but it's not free.
The fix: Run your code through at least two different model families. Don't average the feedback — contrast it. The interesting part isn't where they agree. It's where they don't.
Your level if: You've ever had Claude catch something GPT missed — or vice versa — and it saved you from a production bug.
Level 4: Human + AI Hybrid
The workflow: AI scans for obvious issues. Human reviews for everything else.
The review: Speed plus judgment. The best of both.
Here's the thing nobody says out loud: AI is great at catching what it has seen before. Known patterns, common bugs, obvious mistakes. Humans are great at catching what doesn't belong — the thing that's technically correct but semantically wrong. The logic that works but violates an invariant nobody wrote down. The function that does what it says but not what was intended.
That gap between technically correct and actually right is where human review lives. And no amount of cross-model consensus closes it.
The workflow that works: AI does the first pass for syntax, edge cases, and known patterns. You do the second pass for context, business logic, and the stuff that doesn't fit. You don't let AI be the final word on anything that matters.
The signal that you're here: You find yourself saying this code works, but it doesn't feel right. That instinct is the human signal. Trust it.
The fix: Use AI for the first pass. Use yourself for the second. Never skip the second.
Your level if: You always do a final human pass before shipping, no matter how confident the AI review sounds.
Level 5: Production Ready
The workflow: Automated tests + observability + human judgment + continuous feedback loop.
The review: Not a moment. A system.
This is where the mindset shift happens. Level 1 through 4 treat code review as a gate — something that happens before merge. Level 5 treats it as a continuous process — something that starts before merge and never really stops.
| Before Level 5 | At Level 5 |
|---|---|
| Review once before merge | Review before and after merge |
| Catch bugs manually | Automated tests catch regressions |
| Hope nothing breaks | Observability tells you when it breaks |
| Incidents are surprises | Every incident improves the process |
| Confidence = luck | Confidence = systems |
The best code review doesn't happen in a PR. It happens when real users hit real edge cases in production. When your monitoring catches what no reviewer could. When your on-call rotation turns incidents into process improvements.
At Level 5, you're not afraid to ship. Not because you got lucky. Because you built the systems that catch what slips through.
The fix: Add automated tests. Add monitoring. Build the feedback loop. Make incidents a source of learning, not just a source of stress.
Your level if: You have automated tests, monitoring, and an on-call process — and you actually use them, not just check the boxes.
The Honest Truth About Where Most Teams Are
Most teams are somewhere between Level 1 and Level 3.
Level 1 is dangerous and way more common than anyone admits. Level 2 feels like progress but is mostly an illusion. Level 3 is genuinely better but costs time and money most teams don't budget for.
The jump from Level 3 to Level 4 is the hardest one. It requires humans who actually review code and protected time to do it. In most teams, that time gets cut first when things get busy.
The jump to Level 5 is the most expensive. It requires tooling, monitoring, organizational discipline, and a culture that treats incidents as learning opportunities instead of blame assignments.
But here's what I've learned the hard way: you can't skip levels. Level 2 won't get you to Level 4. Level 3 won't get you to Level 5. You have to build the foundation at each step before the next one holds.
Your Next Step — Based on Where You Are
If you're at Level 1:
Start reading every line of code you ship. Not skimming. Reading. That's it. That's the whole step.
If you're at Level 2:
Stop trusting self-review. Run the same code through a second model family and compare the feedback.
If you're at Level 3:
Add a human pass. Even 10 focused minutes of human review catches things that three models in consensus miss.
If you're at Level 4:
Add automated tests for the edge cases you've seen break in production. Then add monitoring. Then build the feedback loop.
If you're at Level 5:
Tell the rest of us how you got there. Seriously. Write the post. We need it.
One Question Before You Go
What level are you actually at right now?
Not what level your team's process says you're at. Not what level you aspire to be at. What level does your last three PRs honestly reflect?
I'll go first in the comments.
Your turn. 👇
Disclosure: I used AI to help structure and organize my thoughts — but every experience, example, and opinion in this article is my own.
Top comments (4)
Really like the framing.
I kept running into a slightly different problem: the hardest part wasn't reviewing the code - it was understanding what the model changed in the first place. In my case, a "small change" ended up rewriting half the repo.
What helped was defining boundaries before generation (in a spec).
It turned review into "compare against spec" instead of "reverse-engineer the diff".
Feels like this could sit as a "level 0" before the rest.
Kirill this is a brilliant addition. Thank you. 🙏
Review turned into compare against spec instead of 'reverse-engineer the diff.
That's the key line. Most of us don't write specs. We prompt vaguely, get vague output, then spend hours trying to figure out what the model actually did. The reverse-engineering tax is real and you've named it perfectly.
Defining boundaries before generation this is the missing step Not review after the fact. Constraint before the fact. A spec isn't just documentation. It's a contract between you and the AI.
And you're right this sits before Level 1. Level 0: Spec-First.
Thank you for this genuinely made the framework stronger. 🙌
Thank you for sharing, Harsh. Beautifully written.
Thank you Urmila. Glad you liked it.