Ken Imoto

Posted on May 10 • Originally published at zenn.dev

I Tried 3 Layers of AI Code Review So Your Diff Doesn't Have To

#ai #codereview #devops #claudecode

I shipped 3 bugs after 'looks good to me' AI code review last quarter

I had one of those quarters where every PR went through an AI reviewer, every PR got a friendly "LGTM with minor suggestions", and three of those PRs still managed to wedge production. One was an N+1 query that only appeared when a customer hit a specific endpoint with more than 50 items. One was a missing await that the AI cheerfully ignored because the code "looked async-ish". One was a permission check we removed and nobody, human or model, flagged it.

After the third one I stopped blaming the model and started blaming my setup. A single AI reviewer running once on a diff is not a code review. It is a vibe check.

What actually fixed it was splitting review into three layers, with three different jobs, and never letting any one of them pretend to be the others. This post is that setup.

Why a single AI reviewer falls over

I once spent a Sunday tagging every comment on our PRs for a month. The split was uncomfortable.

About 70% of human review comments were things like "indentation off", "leftover console.log", "this should be camelCase", "no test for this branch", "type is any". The kind of thing a linter, a formatter, or a half-asleep AI can spot in milliseconds.

Only about 30% touched the things humans are actually good at: is this the right architecture, does this match the business rule, what blast radius does this change have.

If you point one AI at the whole PR and ask it to review, it will mostly do the easy 70% (badly, sometimes), and gesture vaguely at the hard 30%. You end up with a reviewer that is simultaneously too noisy on style and too quiet on the parts that matter.

The fix is not "a smarter model". The fix is splitting the work.

The 3 layers I run now

Here is the shape of it.

Layer	Owner	What it catches
Layer 1	Hooks + CI	Mechanical issues: format, lint, types, missing tests
Layer 2	AI reviewer(s)	Pattern issues: N+1, dead code, naming, small refactors
Layer 3	Human reviewer	Design, business logic, security weight, blast radius

Each layer assumes the previous one passed. Layer 1 failures never reach Layer 2. Layer 2 blocking comments pause Layer 3 until they are resolved.

The order matters. Do not let humans waste eye-time on what a hook would have caught for free.

Layer 1: hooks and CI

Layer 1 has one job: kill mechanical problems before a human or an AI ever sees them.

I run two stages.

On the laptop, via Lefthook (or husky / pre-commit):

Formatter (Biome / Prettier)
Linter (Biome / ESLint)
Type check (tsc --noEmit)
Affected tests only

Heavy stuff at pre-commit makes people hate you, so I keep pre-commit narrow (changed files only) and push the slower checks to pre-push.

In CI, the exact same checks again.

Why both? Because local hooks can be skipped. Someone is always one --no-verify away from shipping a 200-line diff with any everywhere. CI is the part you cannot bargain with.

A small example of the kind of bug Layer 1 actually catches: a teammate once renamed a config key but missed one call site. TypeScript caught it in pre-push. The PR never opened. No reviewer time spent. No AI tokens spent. That is the win.

Layer 2: AI review, with role separation

This is where most teams over-spend or under-spend. The trick is treating different AI reviewers as having different jobs, not running three of them on the same diff and hoping for diversity.

I think about it as three roles. You do not need all three.

Pattern sweeper (e.g. CodeRabbit). Good at scanning the whole diff and surfacing N+1 queries, dead code, mis-shaped error handling, things that match a known pattern. Configurable via a yaml file so you can weight security-related paths heavier than test files.

Local refactor advisor (e.g. GitHub Copilot Code Review). Good at sitting on a specific function and saying "this loop becomes one map, this nested if becomes a guard clause". Concrete suggestions on a small surface area.

Project-rules enforcer (e.g. Claude with AGENTS.md / CLAUDE.md). Good at reading your repo's actual conventions and flagging "we do not use this util in this layer" or "this module is supposed to be pure". The other two cannot do this; they do not know your house rules.

For solo projects I just use the third one. For real team repos I run the pattern sweeper on every PR and let developers opt into the refactor advisor when they want a second opinion. That is roughly how much AI review a normal PR can absorb before the noise starts costing more than it saves.

A bug Layer 2 caught for me recently: a new endpoint was iterating over user sessions and calling the database inside the loop. CodeRabbit flagged the N+1 pattern with a one-line suggestion. A human reviewer might have caught it too, but the human reviewer would then not have had time to look at the actual auth flow change in the same PR. Which is the whole point.

Layer 3: humans, and only on four things

By the time a PR reaches Layer 3, the formatter, the linter, the types, the obvious patterns, and the project-rule violations are all gone. What is left is everything machines are bad at.

I keep human reviewers focused on exactly four questions:

1. Direction. Does this change fit the architecture we agreed on? Module boundaries, layering, who owns what. A model can recite your architecture; it cannot tell whether this PR is quietly drifting away from it.

2. Business logic. Does this match the actual rule the business wants? Edge cases, weird customer states, that one regulator who wants invoices rounded a specific way. This is where domain knowledge lives, and where AI is most confident and most wrong.

3. Security weight. Is this a "small refactor" or is this "we just changed who can see what"? AI can flag a permission check; a human decides whether the change is one that needs a second pair of eyes from the security-minded engineer.

4. Blast radius. What else does this touch that is not in the diff? Which untested area might regress? Long-tenure engineers know which parts of the codebase have a history of revenge.

Four questions. Not "did you forget a semicolon". When I drew the line here, average human review time roughly halved on our team and reviewers stopped resenting the queue.

What each layer misses (and why running all three matters)

Concretely, here is what I have watched leak past each layer.

Layer 1 alone: ships a PR that lints clean, types clean, tests pass, and quietly contains an N+1 query that takes down staging for hours under realistic load. Linters do not know about your database.
Layer 2 alone: rubber-stamps a refactor that is technically beautiful and strategically wrong. The AI does not know that this module is on the deprecation list and you are not supposed to add features to it.
Layer 3 alone: is what most teams had five years ago. Humans drowning in style nits, missing the security change buried on line 184 of the diff because their attention budget was already spent on tab vs spaces.

The 3-layer setup is not "more review". It is the same review, sorted so each reviewer is doing the thing they are actually good at.

How I wire this up in practice

The glue is a single AGENTS.md (or CLAUDE.md, same idea) at the root of the repo. It explicitly writes down which layer owns what.

## Review policy

### Layer 1 — hooks / CI
- format (Biome)
- lint (Biome)
- type check (tsc --noEmit)
- affected tests

### Layer 2 — AI review
- CodeRabbit: auto on every PR. Focus: N+1, dead code, error handling.
- Copilot Code Review: opt-in by author. Focus: local refactors.
- Claude (/review-pr): focus on AGENTS.md rule violations only.
  Do NOT comment on architecture or business logic.

### Layer 3 — human review
Reviewers focus on exactly four things:
- direction (architecture fit)
- business logic correctness
- security weight
- blast radius

### Comment style
We use Conventional Comments:
praise / nit / suggestion / issue / question

Two things I learned to be explicit about:

Tell the AI what NOT to comment on. Otherwise the pattern sweeper starts opining on architecture, the human reviewer reads it, defers, and now nobody is checking architecture. Bound the AI's job in writing.
Tell humans what IS their job. Otherwise they keep nitpicking format because format-nits are fast dopamine. Make Layer 3 boring on purpose.

Three traps I walked into so you do not have to

Trap 1: Layer 2 starts doing Layer 3's work. The AI writes a confident paragraph about your architecture. Humans read it and think "well, the AI has it covered". Now nobody is doing architecture review. Fix: write down in AGENTS.md that AI reviewers do not comment on design direction. They literally are not allowed to, and you tell them so.

Trap 2: Layer 1 gets bypassed. Local hooks get disabled because someone is "in a hurry". Fix: assume good-faith hooks will fail you about once a month and put the same checks in CI as a hard gate. I learned this the hard way after merging a PR with a leftover console.log that printed customer emails into our logs for a weekend.

Trap 3: Treating "LGTM" from AI as approval. It is not. The AI's "LGTM" means "I did not find a pattern I recognize". That is a useful signal, not a verdict. The human at Layer 3 is still the one who merges.

What I would do differently next time

If I were starting a new repo tomorrow, I would do Layer 1 on day one, Layer 3 (the four-question rubric) on day two, and add Layer 2 only after I had seen what kinds of bugs were actually leaking past humans. I jumped straight to "let us add three AI reviewers" once and ended up with a PR comments page that read like a group chat at 2am.

The 3-layer model is not about removing humans from review. It is about putting humans in the spot where they are obviously, embarrassingly better than any model: judgment about your specific system, your specific business, and your specific tolerance for risk. That is the part the model cannot fake.

Everything else, let the machines do the dishes.

If you want the full 21-chapter playbook with config examples for AGENTS.md, hook setups, and team rollout patterns, I wrote a Zenn Book on it:

Harness Engineering Practice (Zenn Book)