Bala Paranj

Posted on Jun 2 • Edited on Jun 25

Fallacies of GenAI Development #4: Dropping Human Review Removes the Bottleneck

#ai #softwaredevelopment #architecture #engineering

✓ Human-authored analysis; AI used for formatting and proofreading.

This is the fourth in a series of eight posts on the false assumptions teams make when building with generative AI. Fallacy #1 covered why faster generation doesn't mean faster engineering. Fallacy #2 covered why plausible isn't correct. Fallacy #3 covered why AI can't reliably verify AI. This post covers what happens when the review gate is removed entirely.

The Fallacy

"Human code review is the bottleneck. If we drop it, the pipeline moves at AI speed."

Why it's tempting

The math is simple and the frustration is real. AI generates a PR in 3 minutes. Human review takes 3 hours. The human is 60x slower than the machine. If you have five developers each generating AI-assisted PRs, one tech lead reviewing them becomes the constraint on the entire team's output. The pipeline stalls at the review stage.

The trending solution: drop the review. Let AI write 100% of the code. Trust the tests. Ship faster. Prominent voices in the industry advocate this explicitly — if the review is the bottleneck, remove it. The reviewer's time is better spent on higher-level work.

The logic is compelling until you ask one question: what replaced the review?

Research backs the concern. Perry et al.'s 2023 Stanford study ("Do Users Write More Insecure Code with AI Assistants?") proved that developers using AI assistants wrote more insecure code but were more likely to believe it was secure. The human "reviewer bottleneck" isn't just a speed problem — it's a cognitive failure caused by a false sense of security. The faster the AI generates, the more confident the developer becomes, and the less they scrutinize.

Why it's wrong

A gate exists for a reason. It catches things. When you remove a gate, the things it was catching start reaching production. The question isn't "is the review slow?" The question is "what was the review catching that nothing else catches?"

Human code review catches at least three categories of problems that no other part of the pipeline addresses:

1. Properties nobody wrote tests for. Tests verify what someone THOUGHT to verify. Review catches what nobody thought to test — an architectural violation, a security assumption, a performance implication, a business logic error that doesn't have a test case because nobody anticipated it.

2. Compositional correctness. Tests verify individual components. Review is often the only place someone looks at how components interact — does this change break the implicit contract between module A and module B? Does this new endpoint introduce a dependency cycle? Does this database migration interact badly with the concurrent migration from the other team?

3. Design coherence. Tests verify behavior. Review verifies intent — is this the right approach? Does this change align with the architecture? Are we building the right thing, or just building a thing that passes tests? This is judgment, not verification. But it's judgment that prevents the codebase from drifting into incoherence over hundreds of changes.

Drop the review and all three categories reach production unchecked. Not immediately — the tests still catch the easy bugs. But the hard problems — the architectural drift, the composition errors, the untested security assumptions — accumulate invisibly.

Manny Lehman proved this in 1980. His Second Law of Software Evolution states explicitly: "As an evolving program is continually changed, its complexity, reflecting deteriorating structure, increases unless work is done to maintain or reduce it." Code review is the primary mechanism for that active work. Removing it doesn't pause the entropy — it confirms that the entropy will accumulate until the system becomes unmanageable. Dropping review makes architectural collapse a mathematical certainty, not a risk.

Rigby and Bird's 2013 empirical study of thousands of code reviews at Microsoft and in open-source projects ("Convergent Software Peer Review Practices") found that the primary value of review isn't bug-finding. It is knowledge transfer, design improvement, and maintaining team-wide standards. Drop the review and you don't just miss bugs — you kill the Theory Building (Naur, 1985) and the Conceptual Integrity (Brooks, 1975) of the system.

The boom

Month 1-3: The velocity spike. PRs merge without waiting for reviewers. Feature delivery accelerates visibly. The team ships more in three months than the previous six. Leadership celebrates. Metrics look great.

Month 4-6: The silent accumulation. Each AI-generated PR that skipped review carried a small number of decisions nobody examined. A variable naming convention drifted. An error handling strategy diverged between services. A retry policy was implemented differently in three modules. None triggers a test failure. Each is a micro-crack in the architecture.

Month 7-9: The first incident nobody can diagnose. A production failure in a code path nobody understands. The developer on call opens the file. It was AI-generated. It was never reviewed. Nobody built a mental model of how it works. The developer reads the code — it looks correct. The bug is in the interaction between this function and another function in a different service, also AI-generated, also never reviewed. Debugging takes three days. Writing the code took three minutes.

Month 10-12: The architecture change that can't be made. The team needs to refactor a core module. The module has been modified by AI agents dozens of times since anyone reviewed it. The tests pass. The code reads well. But nobody knows WHY it's structured the way it is. Nobody knows which behaviors are intentional and which are accidental artifacts of AI generation. The team is afraid to change it. The code that was built in months can't be safely modified in months.

The deeper damage: ownership loss. When no human reviews the code, no human owns the code. The team discovers they aren't software engineers anymore — they're prompt operators, powerless to fix a system they didn't build and don't understand. The prompts generated the code. The code runs the business. Nobody in between can explain how.

Sarkar et al.'s 2024 study on developer experience with AI coding assistants ("What is it like to use a generative AI coding assistant?") found exactly this pattern: AI-assisted coding leads to "shallow understanding." Developers focus on the immediate fix and fail to build the deep mental model of how the change affects the rest of the system. If no human reviewed the code, the cognitive debt is at its maximum when the incident occurs.

This is the cognitive debt trajectory from Fallacy #1, now at the code level. The review was the only mechanism that built human understanding of AI-generated changes. Without it, the understanding was never formed. The debt compounds silently until it's called in — always during an incident, always at the worst possible time.

Three models of review

The industry is debating between two models. There are actually three.

Model 1 — Human reviews everything:
    AI generates code → Human reads every line → Ship

    ✓ Accurate (human judgment catches subtle issues)
    ✗ Slow (human is the bottleneck at 60x slower than AI)

    This is what teams had. It doesn't scale.

Model 2 — Nobody reviews:
    AI generates code → Tests pass → Ship

    ✓ Fast (pipeline moves at AI speed)
    ✗ Unsafe (properties nobody tested reach production unchecked)

    This is what teams are moving toward. It breaks at Month 7.

    Note: "AI code review" feels like Model 1 but is actually Model 2
    with a false sense of security. As Fallacy #3 showed, the AI
    reviewer has the same failure modes as the AI generator. The
    safety profile is Model 2. The confidence level is Model 1.
    That mismatch is where the damage compounds.

Model 3 — Specification gate reviews:
    AI generates code → Specification gate verifies → Ship
    Human reviews SPECIFICATIONS, not code

    ✓ Fast (gate operates at machine speed)
    ✓ Accurate (properties verified mechanically, exhaustively)
    ✓ Sustainable (human effort scales with specifications, not code volume)

    This is what every safety-critical domain converged on.

Model 2 looks like an upgrade from Model 1. It's actually a downgrade — it removed the safety mechanism without replacing it. The team traded accuracy for speed, calling it optimization.

Model 3 is the actual resolution. It doesn't compromise. It separates the two requirements:

ACCURACY operates on specifications (small, stable, human-authored, reviewed deliberately)
SPEED operates on code verification (fast, exhaustive, mechanical, every change)

Specification doesn't mean a 100-page requirements document. It means any machine-verifiable artifact of intent that already exists in your codebase: a TypeScript interface, an OpenAPI definition, a Protocol Buffer schema, a SQL migration, a Semgrep rule, a database constraint. Small. Stable. Human-authored. Machine-enforced.

The human reviews the rules of the game. The machine reviews every move to ensure the rules weren't broken.

How every safety-critical domain got here

No safety-critical domain uses Model 1 or Model 2. Every one uses Model 3. They arrived there independently when their version of "AI-speed generation" exceeded human review capacity:

Aviation: Jet engines got fast enough that pilots couldn't react to every condition. Model 1 (pilot monitors everything) was too slow. Model 2 (remove the pilot) was too dangerous. Model 3: fly-by-wire envelope protection. The pilot reviews the FLIGHT PLAN (specification). The computer enforces the FLIGHT ENVELOPE (mechanical gate). The pilot can't stall the aircraft even if they try — the specification gate overrides the input. This isn't ad hoc — DO-178C (the standard for flight software certification) requires that requirements (specifications) be reviewed by humans for intent, while code is verified against those requirements using deterministic tools: structural coverage analysis, data coupling analysis, formal methods. Humans never review every line of flight code. They review the specification of what the code must do, and machines verify every line against it.

Nuclear operations: Reactor dynamics happen faster than operators can track every parameter. Model 1 (operator monitors all parameters) was too slow. Model 2 (remove operator oversight) was too dangerous. Model 3: automated protection systems. The operator reviews the PROCEDURES (specification). The interlocks enforce PARAMETER LIMITS (mechanical gate). The reactor scrams automatically if parameters exceed the envelope — regardless of operator input.

Financial trading: Algorithmic execution happens in microseconds. Model 1 (human reviews every trade) was too slow. Model 2 (no review) caused flash crashes. Model 3: pre-trade risk checks. The risk manager reviews the RISK LIMITS (specification). The system enforces them on EVERY TRADE (mechanical gate). No order that violates the limits reaches the exchange — regardless of what the algorithm generated.

Google monorepo: 2 billion lines of code. Model 1 (human reviews every change to every dependency) was too slow. Model 2 (merge without review) would break the monorepo. Model 3: automated testing + API contract enforcement. The team reviews INTERFACE CONTRACTS (specification). CI enforces them on EVERY CHANGE (mechanical gate). A large-scale change touching millions of lines merges — because every affected test passes mechanically. As Winters, Manshreck, and Wright document in Software Engineering at Google (2020), the code review chapter makes this explicit: "mechanical checks" (linters, tests) are automated, but "design and intent" review is the gate that keeps the system coherent. Even the world's largest codebase doesn't drop review — it moves review to the highest level of abstraction.

The pattern repeats: when generation speed exceeds human review capacity, the review doesn't disappear. It splits into two activities at two speeds. The human does the slow, high-judgment work (reviewing specifications). The machine does the fast, exhaustive work (verifying code against specifications).

The TRIZ contradiction that forces Model 3

This isn't a preference. It's a resolution of a physical contradiction.

The review must be ACCURATE — it requires human domain expertise to judge whether the code is correct, whether the architecture is sound, whether the security properties hold.

The review must be FAST — human review speed is the bottleneck that prevents the team from capturing AI's productivity gains.

Same system. Opposite requirements. TRIZ's separation principle: if a system must be simultaneously A and not-A, separate the contradiction across different artifacts.

ACCURATE (human speed):
    Human authors specification    → small, slow, requires expertise
    Human reviews specification    → infrequent, high-judgment

FAST (machine speed):
    Machine verifies code          → instant, exhaustive, every change
    Machine blocks violations      → deterministic, no judgment needed

Both requirements satisfied. Neither compromised. The accuracy lives in the specification review. The speed lives in the mechanical verification. Different artifacts, different speeds, no contradiction.

What you can do this week

If you've already dropped human review (Model 2):

1. Identify what the review was catching. Look at your last 20 review comments before you dropped the process. Categorize: how many were about properties (should always be true), how many about composition (interaction between components), how many about design (is this the right approach)? The specification gate replaces the property-category comments.

2. Convert one review comment into a specification. "This endpoint must always require authentication" — that's a specification. Add it as a CI check. Mechanical. Deterministic. Every PR. You just restored one piece of what the review was providing, at machine speed.

3. Keep a "review debt" log. Every time an incident occurs in code that was never reviewed, log it. Track the category. After a quarter, the log tells you exactly which specifications you need. The incidents write the specification backlog for you.

If you still have human review (Model 1):

1. Identify the bottleneck reviews. Which reviews take the longest? Which ones block the most PRs? These are the candidates for Model 3 — convert the reviewer's judgment into specifications and enforce them mechanically.

2. Convert one slow review into a fast gate. The reviewer who checks "does this PR conform to our API contract" — replace that with a contract test. The reviewer who checks "does this migration have a rollback" — replace that with a CI check for rollback scripts. Each conversion speeds up the pipeline without removing the safety mechanism.

3. Measure reviewer time on specifications vs. code. If the reviewer spends 80% of their time checking properties (mechanical work) and 20% on design judgment (human work), there's a 4x speedup available by moving the mechanical work to CI. The reviewer focuses on the 20% that requires human judgment. The machine handles the 80% that doesn't.

The bottleneck is real. The review is slow. But removing the review is removing the brakes. The resolution is to move the review to the right level — specifications for humans, code verification for machines — and let each operate at its natural speed.

Next in the series: **Fallacy #5 — "Better Context Prevents Hallucination."* Why improving input quality doesn't guarantee output correctness, and why verification must check the output — not just improve the input.*

The Fallacies of GenAI Development: eight assumptions every team is making. Each one leads to an architectural failure. Each one has already been solved.

References

Lehman, M.M. (1980). "Programs, Life Cycles, and Laws of Software Evolution." Proceedings of the IEEE, 68(9), 1060–1076. Second Law: complexity increases unless active work is done to reduce it.
Rigby, P.C. & Bird, C. (2013). "Convergent Contemporary Software Peer Review Practices." ESEC/FSE 2013. Empirical study: review's primary value is knowledge transfer and design improvement, not bug-finding.
Perry, N. et al. (2023). "Do Users Write More Insecure Code with AI Assistants?" IEEE S&P 2023. Stanford study: AI-assisted developers write more insecure code while believing it is more secure.
Sarkar, A. et al. (2024). "What is it like to use a generative AI coding assistant? An interpretative phenomenological analysis." Microsoft Research. AI coding leads to shallow understanding and missed system-level effects.
DO-178C (2011). "Software Considerations in Airborne Systems and Equipment Certification." RTCA/EUROCAE. Aviation software standard: human-reviewed requirements, machine-verified code.
Winters, T., Manshreck, T. & Wright, H. (2020). Software Engineering at Google. O'Reilly. Chapter on code review: mechanical checks automated, design and intent review is the coherence gate.
Naur, P. (1985). "Programming as Theory Building." Microprocessing and Microprogramming, 15(5), 253–261.
Brooks, F.P. (1975). The Mythical Man-Month. Addison-Wesley. Conceptual integrity as the most important consideration in system design.

DEV Community