yunbow

Posted on May 6

Code Review Is Broken for AI-Generated Code — The Case for Pre-Implementation Governance

#codereview #softwareengineering #teamwork #aidevos

Your team's code review process was designed for a world where a developer wrote every line.

In that world, a PR with 200 lines meant 200 lines of human judgment. The reviewer's job was to check that judgment: catch bugs, enforce patterns, transfer knowledge. Review velocity matched author velocity. It worked.

Now a developer generates 800 lines in an afternoon. The PR is 800 lines of AI output, lightly edited. The reviewer's job hasn't changed. The workload has.

Something breaks at scale. This is that something — and the fix isn't "review more carefully."

Why Code Review Worked

To understand what's breaking, start with why code review worked.

The implicit model was straightforward: a developer who wrote the code understands the code. The reviewer's job isn't to re-derive it from scratch — it's to check what the author might have missed, enforce conventions they might have drifted from, and transfer knowledge bidirectionally.

Author accountability made this work. When a reviewer left a comment — "we don't use useEffect for data fetching here" — the author learned something. The next PR, they applied it. Over months, the team's conventions transferred from senior to junior through accumulated review feedback.

Review velocity matched author velocity. One developer submitting 3 PRs a week produced review work that 1–2 reviewers could handle with appropriate depth. The ratio held.

None of these assumptions hold when AI is in the loop.

How AI Changes the Math

Volume

A developer with an AI assistant generates significantly more code per day. Self-reported data and published productivity studies vary widely — 2× for complex tasks, up to 10× for focused generation work — but the directional finding is consistent: output volume per developer increases meaningfully.

If a developer submitted 3 PRs a week before, they might submit 15 now. The review backlog grows faster than review capacity. Reviewers face a choice: spend more time reviewing (at the cost of their own development work), or skim.

Most teams end up with a mix: some PRs get deep review, most get shallow review. The PRs that get shallow review are unpredictable. Convention violations slip through.

Accountability Diffuses

"The AI wrote it" is a new and genuinely complicated accountability situation.

The developer is still responsible — they submitted the PR. But they may not have deeply understood every line. They didn't write it from scratch; they prompted and accepted. Their mental model of why the code does what it does is less thorough than if they'd typed it line by line.

Reviewers feel this. The usual question — "what was the author thinking here?" — becomes harder to answer. "They asked AI to implement X" is not an answer that helps you evaluate whether the approach is sound.

Pattern Inconsistency Increases

Human developers drift from conventions gradually and inconsistently. Developer A might occasionally forget to validate input; Developer B might sometimes skip field selection on Prisma queries. These are distributed individual errors.

AI drift is different. When AI generates code without your team's conventions in its context, it defaults to patterns from its training data. Those patterns may be systematically different from yours. Every API route generated in the same session might have the same missing auth pattern. Every Server Action might use the same incorrect error handling.

This is worse than individual human drift because it's coherent and consistent — which makes it easier to miss. A codebase with uniform but incorrect patterns looks intentional.

The Knowledge Transfer Function Breaks

Junior developers lost a growth mechanism.

When AI generated the code and the reviewer fixed a convention violation in review, who learned? The reviewer wrote the comment. The developer accepted the suggestion. The AI generated the next file without knowing the convention existed.

The feedback loop that built senior developers over time — submit code, get specific feedback, internalize the pattern, apply it next time — is disrupted when AI is generating first drafts. The junior developer is learning to prompt and accept, not to derive patterns from first principles.

This isn't an argument against using AI. It's a recognition that knowledge transfer now requires explicit investment, not passive osmosis.

The Traditional Responses and Why They Fail

Teams hit the review breakdown and try a few things:

"Just slow down review." This creates a bottleneck. Developers wait days for review. The velocity gain from AI evaporates. You've traded speed for thoroughness — but even "slow, careful" review isn't systematic. Two reviewers looking at the same AI output often flag different convention issues and miss different ones. Slowdown adds time; it doesn't add consistency.

"Be more strict about AI code." Stricter in what sense? Without a written rubric, "be more strict" means each reviewer applies their own mental model more rigorously. What you get is reviewer opinion variance at higher intensity — more reviewer burnout, and still no guarantee the same issue is caught across PRs. Convention consistency requires an external reference, not more individual rigor.

"Require developers to review AI output before submitting." A reasonable baseline, but it doesn't solve the systematic gap. A developer who hasn't internalized a convention can't catch its violation when they review their own output. The AI writes what it doesn't know. The developer approves what they don't know. The gap passes through both filters undetected.

"Pair program with AI and with a reviewer in real time." Works for critical paths. Doesn't scale to the steady flow of feature work, where AI generation volume is highest and pairing capacity is lowest.

All of these responses share a flaw: they're trying to fix a pre-implementation problem at the post-implementation stage. By the time the PR exists, 800 lines of convention-inconsistent code already exist. The fix is preventing them before the first line is generated.

Pre-Implementation Governance: The Shift

The core insight: most convention violations, security issues, and pattern drift are predictable and preventable — if your AI has your conventions in context.

The current workflow for most teams:

Developer prompts AI
    ↓
AI generates 800 lines without team conventions
    ↓
Developer submits PR
    ↓
Reviewer finds convention violations, security issues, pattern drift
    ↓
Fix cycle (sometimes multiple rounds)
    ↓
Merge

The alternative:

Team conventions encoded in AI context (CLAUDE.md, .mdc files, Steering Rules)
    ↓
Developer prompts AI
    ↓
AI generates code following team conventions
    ↓
Developer submits PR
    ↓
Automated checks verify convention adherence (linting, type checking)
    ↓
Reviewer focuses on logical correctness and design decisions
    ↓
Merge

The shift: move convention enforcement from review (post-implementation) to context (pre-implementation) and automation (pre-merge).

This has three components:

1. Explicit rule encoding. CLAUDE.md is an index file — 3 directive lines pointing to your guideline files. The actual conventions (auth patterns, error handling, naming conventions, security invariants) live in the referenced files, not inline. This separation means the rules are always in AI's context, written explicitly enough that AI can follow them without interpretation, without diluting attention with a wall of text.

2. Automated static verification. Linting, type checking, and security scanning run as pre-commit hooks or CI gates. Rules that can be checked mechanically are checked mechanically — not by human reviewers.

3. Review scope redefinition. Reviewers stop checking convention compliance (the AI and linter handle that) and focus on what only humans can evaluate: logical correctness, architectural fit, edge case coverage, performance implications.

What Changes for Reviewers

This is the part that makes the shift concrete.

Stop reviewing (delegate to AI context + automation):

Naming convention adherence — your AI guidelines handle this; your linter enforces it
Missing error handling — your AI context requires ActionResult<T>; your types enforce it
Auth patterns — your AI security rules specify it; your linter checks for auth calls
Import ordering, formatting — automated entirely
Basic type errors — TypeScript strict mode catches these before review

Start reviewing (focus here):

Does the logic correctly implement the intended behavior?
Does the architectural approach fit the existing system?
Are the edge cases covered? (not just happy path)
Are there performance implications at scale?
Does this design decision create problems we'll regret in 6 months?

These are judgment calls. They require understanding the system, the business context, and the tradeoffs. No linter and no AI rule set can substitute for this.

The result: reviews become shorter in clock time and higher in cognitive quality. Reviewers spend their time on the questions only they can answer.

Implementing the Shift

Step 1: Audit Your Last 20–30 PRs

Go through the review comments. Categorize each:

Convention — naming, patterns, formatting
Security — auth, validation, data exposure
Logic — wrong behavior, missing edge case
Design — architectural mismatch, scalability concern

Count the percentage in each category. In PRs I've reviewed on AI-assisted teams, convention and security comments consistently make up the majority — often more than half. That's your automation target.

Step 2: Encode Conventions

Every convention-category comment from step 1 should become either:

A rule in your CLAUDE.md / AI guidelines
An ESLint rule (if checkable statically)
A TypeScript type constraint (if enforceable via types)

Repeated security-category comments become security rules in AI context + security ESLint plugins.

This is one-time work. Once a rule is encoded, you never leave that review comment again.

Step 3: Redefine Review Scope

Write a short "what we review" document for your team. One page. It explicitly states:

What's handled by AI context (convention compliance)
What's handled by automated checks (linting, types, security scanning)
What reviewers are responsible for (logic, design, edge cases)

This isn't telling reviewers to be less rigorous. It's telling them where to direct their rigor.

Step 4: Measure

Track:

PR cycle time — should decrease as convention-fixing rounds drop out
Reviewer time per PR — should decrease once automation handles convention checking
Convention violations post-merge — should approach zero with rules in AI context

If violations post-merge aren't dropping, the encoded rules are incomplete or imprecise. Iterate on the rules, not the review process.

What Doesn't Change

Code review still matters. Logic errors, design mismatches, and business rule violations require human judgment. AI-generated code needs review for correctness just as human-generated code does — arguably more, because the author's understanding of the code is shallower.

What changes is what review is for.

Pre-implementation governance doesn't eliminate review. It elevates it. When reviewers aren't spending their attention on convention checklists, they can spend it on the architectural questions that actually determine whether a feature ages well.

The Knowledge Transfer Problem

One thing pre-implementation governance doesn't solve: junior developer growth.

If conventions are enforced by AI and automation before a junior developer ever touches the code, they may never encounter the correction that would have built their intuition. The feedback loop that built senior developers is still disrupted.

This is a real cost. The mitigation is intentional: pair programming sessions focused on design decisions (not convention compliance), explicit architectural discussions, and making the encoded rules legible — so junior developers read the guidelines and understand why the conventions exist, not just that they exist.

The rules your team codified for AI should be readable as an engineering culture document. They're not just instructions for the AI.

The Broader Pattern

Quality assurance in software has been moving in one direction for decades: earlier.

Testing shifted from manual QA after shipping to automated tests before shipping. Security shifted from penetration testing in production to SAST in CI. Code quality shifted from "hope the reviewers catch it" to linting in the editor.

In each case: catch problems earlier, make them cheaper to fix, free up human attention for judgment calls.

Pre-implementation governance is the same shift applied to AI-generated code. Conventions that used to be caught in review get enforced before generation. Human review bandwidth goes to the problems that require it.

The teams that will scale AI-assisted development well are the ones investing in rule infrastructure now — not the ones reviewing faster.

What's your team's current review-to-generation ratio? If you're generating 10× more code but haven't changed your review process, the math is working against you.

This is the final article in the AI Dev OS series. The full framework — including rule templates for Next.js and TypeScript — is at github.com/yunbow/ai-dev-os.

DEV Community