Na'aman Hirschfeld (Goldziher)

Posted on Jul 5

Code review can't keep up with AI. Build a verification layer instead.

#ai #testing #devtools #softwarequality

There is a popular argument going around that code review no longer matters. The reasoning is that a model wrote the code, another model can check it, and a human reading the diff is a slow, expensive step you can now skip. I think that gets the problem exactly backwards.

Code review did not stop mattering. It stopped scaling. You can now generate code faster than any human can read it, let alone reason about it. The old model, where an engineer reading the diff is the last line of defense, quietly broke the moment a single afternoon started producing more change than a person can hold in their head. Skipping review does not fix that. It just removes the one check that was already failing and puts nothing in its place.

The answer is to move verification onto things that keep up. I have been building xberg, a document-intelligence engine with a Rust core and a dozen language SDKs in one repo, and this is the setup that lets me ship at generation speed without shipping garbage.

Static analysis and types do the cheap work

The first layer is everything a machine can check without running your program: linting, static analysis, type checks. This is the cheapest verification you will ever buy, and it catches whole classes of errors before a test ever runs. A type error, an unused binding, a dangerous cast, a lint rule you agreed to as a team, none of these need a human to notice.

For a polyglot repo this is its own project. I ended up building polylint, two self-contained Rust binaries driven by one config, because wiring a separate linter and formatter per language into pre-commit was itself a maintenance tax. The point is not the tool. The point is that mechanical checks should run on every change, locally and in CI, and they should be fast enough that nobody is tempted to skip them.

Comprehensive tests prove behavior

Static analysis tells you the code is well formed. It does not tell you the code is correct. That is what tests are for, and at generation speed the test suite is doing far more of the reviewing than any human is.

The tests that earn their keep here are end-to-end and integration tests. Unit tests are useful, but a wall of them can be green while the system as a whole does the wrong thing. What you want is tests that exercise real paths through the software the way a user would: real inputs, real boundaries, real failure modes. When I add a format to xberg's extraction pipeline, the test that matters is the one that runs a real file of that format end to end and checks the output, not the one that mocks the parser and asserts it was called.

Comprehensive coverage of behavior is what lets you accept a large generated change with confidence. The suite is the reviewer that never gets tired and never skims.

Agents review against written guidelines

Here is where the "another model can check it" idea is actually right, as long as you give it something to check against. An agent doing a first-pass review is fast, tireless, and consistent, but only if the standard it reviews against is explicit and written down. Vague instructions produce vague reviews.

So write the guidelines. The conventions, the security rules, the patterns you want and the ones you have banned, the things a reviewer on your team would flag. I keep those rules in one place and generate the per-tool configs from it with ai-rulez, so Cursor, Claude, and Copilot all review against the same standard instead of three slightly different ones. An agent with clear guidelines catches a real fraction of what a human reviewer would, at a speed a human never will, and it frees the human to look at the things that actually need judgment.

Then you actually test it, by hand

None of the above removes the need to run the software yourself. This is the step people are most tempted to skip now, and it is the one that catches what everything else misses.

A green pipeline is a signal. It is not proof that your software does what you think it does. Automated tests only check the things you thought to check. Running the thing yourself, exercising the real paths, poking at the edges, is how you find the assumption the whole test suite quietly shares with the bug. QA is not a box you tick after the checks pass. At generation speed it becomes one of the highest-leverage things an engineer does, because it is the only layer that brings human judgment to bear on software no human wrote line by line.

Teams that treat QA as a core engineering function, rather than a phase they graduated out of, are the ones this speed will not break.

Open source makes the code better, for free

There is one more layer, and it is the strongest one I know of. Open source your code and it gets tested at a scale you could never reproduce in-house. Thousands of people run it in environments you never imagined, on inputs you never thought to generate, and every bug they hit and report comes back and makes the code more robust. No internal test suite competes with real usage at that volume.

Auditability and good citizenship are real, but the benefit that changes the code itself is simpler. Public code is massively tested code, and massively tested code is better code. Every one of xberg's libraries is open source partly for this reason.

The shape of it

Put together, the verification layer looks like this: static analysis and types catch the cheap errors, comprehensive end-to-end and integration tests prove the behavior, agents do first-pass review against written guidelines, you test it by hand to catch what the machines missed, and open source turns your users into the largest test suite you will ever have.

Code review did not become optional. It became a system instead of a person. Build that system, and you can move as fast as the models let you without lying to yourself about whether it works.

Top comments (3)

Viktor • Jul 5

The "move verification onto things that keep up" framing is the right instinct - the diff-reading human was always the bottleneck, and pretending review still scales just papers over the gap. Where I'd push: at generation speed the tests are getting generated too, so "the suite is the reviewer that never gets tired" can quietly become a suite that's confidently, consistently green on the wrong behavior. A tireless reviewer with a blind spot just applies the blind spot faster.

The layer I'd add under yours: something that verifies the verifier has teeth. Mutation testing is the cheap version - flip a > to >=, delete a line, see if any test goes red. If nothing fails, that test is decoration, and a green wall of decoration is exactly what you don't want backing a large generated change. Same logic for the agent-review layer: the guidelines only hold if you occasionally feed it a known-bad diff and confirm it catches it, otherwise you're trusting a reviewer you've never calibrated.

How are you keeping the e2e suite honest when a chunk of it is itself AI-written - anything beyond coverage numbers?

Edu Peralta • Jul 5

Agreed that reviewing every line by hand does not scale, but I would push back on the word 'instead'. A verification layer tells you whether the code works, not whether it did the thing you actually asked for. I have watched an agent turn a red test green by quietly rewriting the test, and watched it solve the wrong problem flawlessly, both of which sail through CI. What still catches those for me is reading the diff for intent rather than correctness, so I run the verification layer AND skim the change, just at a higher altitude than line by line.

Alex Shev • Jul 6

Verification layers are becoming the real bottleneck for AI-assisted coding. Generation got cheap; proving that the generated change matches intent is still where teams spend trust.