Juhani Ränkimies

Posted on Mar 9

The Trust Problem: Why Code Review Breaks When AI Writes the Code

#ai

Something breaks when AI starts writing most of your code.

Not the code itself -- that's often fine. What breaks is how you know it's fine.

The old trust model

For decades, the trust model for software has been code review. A human reads the diff, reasons about what it does, checks it against what was intended, and either approves or pushes back. The reviewer's judgment is the quality gate.

This works when a human writes code at human speed. The reviewer can match the pace of production. They can hold the intent and implementation in their head simultaneously. They have time to think about edge cases, architectural fit, and whether the change actually solves the stated problem.

What changes with AI

AI generates code faster than humans can meaningfully inspect it. This isn't a theoretical concern -- it's the daily experience of anyone using Cursor, Copilot, or Claude Code on non-trivial projects.

The practical result:

Volume overwhelms review. A 200-line AI-generated diff takes the same review effort as a 200-line human diff, but AI produces them in seconds. The bottleneck shifts entirely to the reviewer.
Plausibility replaces correctness. AI output looks reasonable. It passes a quick scan. It often compiles and even passes existing tests. But "looks reasonable" is not the same as "does what was intended."
Intent becomes invisible. When a human writes code, the reviewer can usually reconstruct intent from the diff -- naming choices, structure, comments reveal what the author was thinking. AI-generated code has no intent. It has statistical patterns.
Cumulative drift goes unnoticed. Each AI-generated change might be individually acceptable. Over weeks, the codebase drifts in ways nobody explicitly chose. Architecture erodes. Conventions fragment. Nobody can explain why the system looks the way it does.

The honest version: most developers using AI assistants have already stopped doing thorough code review of AI output. They scan it, run the tests, and merge. The old trust model is already broken in practice. We just haven't replaced it with anything explicit.

The wrong response

One common response is to try harder at code review. Slow down. Read every line. Understand every choice.

This doesn't work for the same reason manual testing didn't scale -- it requires effort proportional to output volume, and the volume is growing. You can't outrun AI generation speed with human review effort.

Another response is to trust the AI. It's getting better. It mostly works. Ship it.

This works until it doesn't. Until the AI confidently implements something you didn't ask for, or subtly changes behavior in a way that passes tests but violates a business constraint that was never written down. The cost of these failures grows with system complexity.

What replaces code review?

The question isn't whether AI can write good code. It often can. The question is: how do you know the code does what it should?

Code review answered this by having a human verify intent against implementation. If that doesn't scale, something else needs to carry the trust.

The answer is contracts and proof.

A contract defines what must be true -- in concrete, testable terms, written before implementation.
Proof demonstrates that the contract holds -- through automated tests, quality checks, and verification that can run without human judgment.

This isn't new. It's the spec-and-test model that's been around for decades. What's new is that with AI doing the implementation, this model shifts from "nice to have" to "necessary." When you can't rely on reviewing implementation, you must be able to rely on verifying outcomes.

Spec first. Proof second. Code third.

The operating model is simple:

Spec first. Define what the change must accomplish. Write acceptance criteria in concrete, testable terms. Be explicit about what's in scope and what's not.
Proof second. Map each criterion to automated verification. Write or update tests before implementation. Define the quality gates that must pass.
Code third. Implementation comes last and is free to change as long as the proof still holds. AI can generate it, refactor it, rewrite it entirely -- the contract and proof are what matter.

The human role shifts. You're no longer primarily a code reviewer. You're a contract designer and proof architect. You decide what must be true. You design how to verify it. AI handles the implementation.

Two verification lanes

One lesson from working this way: there are two kinds of trust you need, and they require different verification.

Feature verification proves the system does what the spec says. This is the obvious one -- acceptance tests, integration tests, behavioral scenarios. Did the feature work? Does it handle the edge cases? Does it respect the boundaries?

System verification proves the codebase remains healthy. This is the one that erodes silently -- linting, type checking, security scanning, architectural constraints, dependency hygiene. The system can pass every feature test and still be rotting structurally.

Both are mandatory. Passing feature tests is not enough if the code violates architecture or introduces security vulnerabilities. Passing quality gates is not enough if the required behavior is missing.

When AI writes code, it optimizes for making tests pass. It does not inherently care about architectural consistency, security posture, or maintainability. Both lanes need automated enforcement, not just one.

What this looks like in practice

A concrete example. You're adding a health endpoint to a Rust web service. Under the old model, you'd write the code, write a test or two, open a PR, and a colleague reviews it.

Under spec-driven development:

Spec: "The service exposes GET /health returning 200 with {"status": "ok"}. The endpoint requires no authentication. Response time must be under 50ms."
Verification map: AC-1 maps to an integration test. AC-2 maps to an auth-bypass test. AC-3 maps to a performance assertion.
Tests: Written or updated to match the spec. They may initially fail.
Implementation: AI generates the handler. The tests pass or they don't. If they pass, you don't need to read the handler line by line. The contract is satisfied.
Quality gates: rustfmt, clippy, cargo check, security scan -- all green. The system is still healthy.

The AI's implementation is accepted based on passing proof, not on how convincing the code looks.

What good looks like

Specs talk about user outcomes and system behavior, not implementation.
Tests are traceable to requirements. You can point at any test and say which acceptance criterion it proves.
CI can show which requirements are proven by which checks.
Refactoring is safe because verification protects intent, not implementation.
AI can move fast without turning the codebase into an unreviewable black box.

Anti-patterns

Some things look like spec-driven development but aren't:

Acceptance criteria that describe implementation. "Use a HashMap for caching" is not a contract. "Repeated queries for the same key return cached results within 5ms" is.
Tests with no requirement linkage. A test suite that passes tells you something. A test suite where each test maps to a specific acceptance criterion tells you much more.
Code coverage as a proxy for contract coverage. 90% line coverage means nothing if the untested 10% contains the critical business logic. Requirement traceability matters more than coverage percentages.
Quality checks treated as optional. If linting, type checking, and security scanning aren't mandatory gates, they'll erode when AI accelerates output.

The honest version

This approach has real costs. Writing specs takes time. Maintaining verification maps is overhead. When you're moving fast on a solo project, it can feel like ceremony.

The tradeoff is explicit: you invest in contracts and proof upfront so you can trust AI output without reading every line. Whether that tradeoff pays for itself depends on your context -- system complexity, team size, how much AI is generating, how costly a behavioral mistake would be.

For a weekend project, it's probably overkill. For a system you're building to last, where AI is writing most of the code, the question isn't whether you need a trust mechanism beyond code review. The question is what that mechanism should be.

This series explores one answer.

DEV Community