Code Reviews: The Part of the Loop Almost Nobody Tracks

#ai #devex #opinion

Pull requests aren't what they used to be.

A PR used to be a story. Someone planned a thing, broke it into pieces, made a few commits as they figured out the shape of the work, opened a PR, got feedback, iterated. You could read it. You could reason about it. The diff had a narrative.

Now a lot of PRs land as one commit, a couple hundred lines touching a bunch of files, generated in an afternoon. No plan, no decomposition, just "here's the thing, please review." If you're lucky, the description was written by a human. If you're unlucky, that was generated too.

Generation got a hundred times cheaper, review didn't.

The Metric Trap

Some of what I'm about to say isn't new. Addy Osmani wrote a great piece earlier this year covering a lot of the same territory, with a focus on the practical side: the PR contract, how solo and team workflows diverge, concrete principles for keeping AI review tools useful instead of noisy. If you want the field manual, read his post. I'm trying to do something different here. I want to step back and look at why we ended up in this situation in the first place, and what the metrics we've chosen to measure are doing to the people who carry the cost.

A lot of companies are now measuring AI adoption by tokens spent, which should make you tilt your head.

Tokens. Spent. As if the goal of buying a tool is to use as much of it as possible.

You bought a drill. The value of the drill isn't the number of holes you put in the wall. It's that the holes you actually needed got drilled faster and in the right places. A team that measures success by hole count ends up with a wall full of holes and nothing hanging on it. Drill enough of them in the wrong places and you start weakening the wall itself.

Tokens measure activity, not outcomes. They tell you a developer typed something into an agent. They don't tell you whether the agent built the right thing, whether the developer understood what came back, or whether the resulting code is going to be readable in six weeks. A developer who writes a tight prompt, gets a focused diff, and ships a clean PR will burn fewer tokens than someone who spirals through fifteen "no, do it differently" cycles before pushing the wreckage into a branch. Guess which one looks more productive on the dashboard.

Tokens aren't useless. They're a fine conversation starter. If a developer is spending nothing, that's worth a chat: do they not need it, or do they not know how to use it? If a developer is spending a lot, that's also worth a chat: are they actually doing more work, or is their workflow a mess that needs cleaning up? Either direction, the number is the prompt, not the answer.

If you want this argument made with a much bigger dataset behind it, the 2025 DORA report (now renamed State of AI-assisted Software Development) goes there explicitly. It calls out lines-of-code-generated-by-AI and AI-suggestion-acceptance-rate as vanity metrics, and the follow-up telemetry from partners across more than twenty thousand developers shows what happens when nobody is looking at the other half of the loop: PR size up around 50%, review times up several times over, and incidents per PR climbing in step. The shape of the problem is the same one I'm describing. The only thing my version adds is a name for the failure mode and a less polite tone.

The moment you put tokens on a leaderboard, you've stopped measuring AI adoption and started measuring AI noise. The noise has to go somewhere. It goes into the review queue.

Where the Cost Lands

I wrote a few weeks ago about how AI broke the junior pipeline. The entry-level work that used to turn juniors into mediors got automated away, and companies eliminated those roles for short-term cost savings while quietly building toward a cliff. The seniors who remain are stretched thin, mentoring nobody because there's nobody to mentor.

Code review is where that bill comes due.

The old apprenticeship had a shape. Juniors wrote docs and small implementations, mediors reviewed them and corrected the obvious stuff, seniors got involved when something was actually hard. Knowledge moved up the chain because work moved up the chain. Reviews were how you learned what "good" looked like.

That ladder is gone. It didn't fall, it got squeezed. Mediors are doing the work juniors used to do, with AI in the loop. Seniors are reviewing the work mediors used to handle. There's nobody underneath catching the obvious stuff anymore, because there's nobody underneath. Every AI-generated PR from every level of the org eventually lands on the same small group of senior reviewers who were already the most expensive, most overcommitted people in the building.

Those PRs aren't the focused, narratively-coherent diffs that used to come up the chain. They're a few hundred lines dumped in one commit, generated faster than anyone wants to read them, with commit messages like "implement feature."

Generation went up, review capacity didn't, and the asymmetry has a name. The name is "the senior who used to enjoy their job."

The Escape Hatch That Isn't

Faced with a queue that won't stop growing, reviewers do the obvious thing: they reach for AI to help review.

I'm sympathetic. I've done it. AI is genuinely useful for the mechanical parts of review: style, dead code, missing null checks, the test that doesn't actually test the thing it claims to test, the obvious off-by-one. If you treat it like a fast, tireless intern that only catches the easy stuff, it earns its keep.

The problem is the part it can't do.

I've been thinking about this HN thread on AI porting code from one language to another, in the context of the Bun rewrite from Zig to Rust. The mechanical translation is the easy part: syntax, structure, type mappings. The hard part is semantic equivalence. Does this new version actually behave the same way under load, at edge cases, in the weird corners nobody documented? You only know if your port is correct by running both forever and diffing the output.

Code review has the same shape. The mechanics are the easy part. The hard part is the stuff that isn't written down anywhere: business rules, product intent, the invariant in module X that everybody on the team knows about but nobody put a comment on, the reason we did it this weird way three sprints ago. AI doesn't know any of that. It can't. It hasn't sat in the planning meeting where you argued for an hour about whether this customer segment counts.

When an AI review comes back clean, the senior reviewer is actually in a worse position than before. The PR has been vetted, sort of. The mechanical stuff is fine. But the actual question, should this code exist, in this shape, doing this thing, is still entirely on them. They still have to go through every file the way they always did, because the bot can't tell them which changes matter. The job didn't get easier. It just got harder to justify how long it takes, because on paper they had help.

It's AI all the way down, except at the one layer where judgment is non-negotiable. That layer is exactly where seniors live.

The Boring Fix

I'm an advocate for boring solutions that actually work, so here's a boring one: stop making the PR the unit of AI work.

These dumps exist because people skipped the planning stage, or did a half-hearted plan and then let the agent write everything in one go. That's a workflow problem, not an AI problem. AI is perfectly capable of doing focused, decomposed work. It just doesn't do that by default, and most teams aren't asking it to.

The workflow that actually works, in my experience, looks like this. Plan first, with the agent, in writing. Decompose the plan into small, independently-shippable tasks. Run those tasks separately, with different agent runs, different commits, different PRs if the work is big enough. Each commit does one thing and is small enough that a human can actually read it. The reviewer gets a sequence of focused changes with a coherent story, not a pile of unrelated changes under one commit message.

This is unsexy. It's slower per-feature than letting the agent freestyle. It will not impress anyone who is counting tokens. But it produces PRs that a senior can review meaningfully in one sitting, and it produces commit history that someone can actually use to debug a regression six months from now.

It also, not coincidentally, looks a lot like how good engineers worked before AI. Plan, decompose, small commits, clear messages. The AI didn't change what good practice looks like. It just made the cost of skipping good practice invisible to anyone who isn't doing the reviewing.

What To Actually Measure

If you're in a position to set metrics, here are some that aren't tokens:

Review load per senior. PRs assigned, lines reviewed, hours spent. The hours part is the catch: it usually relies on self-reporting, which is just more work piled on the people you're trying to protect. PRs and lines you can pull from the git host. Hours you have to ask for, and asking honestly is its own can of worms.
Average PR size. Trending up is a signal that decomposition is breaking down.
Time-to-first-review and time-to-merge. Going up usually means the queue is winning, but read these carefully. In a multi-team setup, a PR can sit in "approved but not merged" for days because another team's piece isn't ready yet. That's a coordination signal, not a review-capacity one. Time-to-first-review is the cleaner of the two.
Revert rate and post-merge bug rate. These are the honest tax on speed. If they climb while the productivity dashboard keeps going up and to the right, the productivity dashboard is measuring the wrong thing.

None of these are perfect. All of them are better than tokens. If you want a more rigorous framework than four bullet points from a blog post, the DORA capabilities model is roughly where I'd point you: classic delivery metrics (lead time, deployment frequency, change failure rate, time to recovery), plus working in small batches as a first-class capability rather than a nice-to-have. It's not a perfect map for the AI era either, but it was built with care, and "build on top of DORA" is a much better starting point than "count tokens and hope."

If you're a developer reading this, the most useful thing you can do for your senior reviewers is to stop opening PRs you wouldn't want to review yourself. Plan the work. Break it up. Make the commits tell a story. Read your own diff before you ask anyone else to.

I should be honest about something. Nobody really knows how to measure this correctly yet. We're still in uncharted territory, and anyone who tells you they've got it figured out is selling something. The folks at the frontier labs occasionally claim they do, but their own products still ship with bugs, they have functionally unlimited tokens to throw at the problem, and every confident proclamation that software development has been "solved" has fallen flat on its face within a few months of being made. The metrics I've listed are guesses, informed by working through this myself. They're better than tokens because almost anything is better than tokens, but they're not the answer. The answer doesn't exist yet.

The companies that are going to handle this transition well are the ones that figure out the obvious thing. Generation and review are two-thirds of the same loop. The other third is feature definition, which is also getting drafted, expanded, and quietly bloated with AI help, but that's a completely different post. For now, the immediate problem is the two-thirds we already have. If you only measure one of them, you optimize for the wrong thing, and the other half eats your seniors.

The companies that don't figure it out will spend the next few years wondering why their best people are quietly checking out, why review queues never go down, why the codebase keeps getting weirder, and why nobody seems to know how anything works anymore.

I've got a feeling I'll be busy either way.