The Hidden Cleanup Cost Behind AI Coding Velocity Promises

#meta #blogging #webdev

Every marketing page for an AI coding tool leads with the same number. GitHub Copilot cites up to 55% productivity gains. Individual task studies show developers completing isolated exercises twice as fast. These numbers are real — in the specific conditions they were measured. The problem is that the conditions rarely match how software is built at scale, over time, by teams. The speed at which code is generated is only one variable in a much longer equation, and it happens to be the easiest to measure.

The harder variables are what get ignored: how long it takes to review that code, how much of it gets written twice because the first version was wrong-shaped for the codebase, and how much slow-burn technical debt it deposits. When you total those up, the velocity story gets more complicated.

Where the cleanup actually lands

The most direct evidence of the hidden cost comes from code review. A LogRocket analysis found that senior engineers spend an average of 4.3 minutes reviewing AI-generated suggestions, versus 1.2 minutes for human-written code. That is more than a 3x increase in per-suggestion review time, applied to a dramatically higher volume of suggestions. Teams with heavy AI adoption merged 98% more pull requests — but saw review time increase 91%.

More PRs, each taking longer to review, with no corresponding improvement in throughput. That is not a productivity gain. That is a shifted bottleneck.

Part of the problem is structural. AI tools generate code optimized for a single metric: whether the developer accepts the suggestion. That tuning rewards plausibility at the diff level, not fit at the codebase level. A LogRocket breakdown of real pull requests found a REST endpoint that a human would write in 29 lines ballooning to 186 lines when AI-assisted — more than 6x the volume — and an error-handling refactor that added 10 lines by hand versus 272 lines via AI. All of it syntactically correct. All of it requiring a reviewer to ask: is this justified?

The reviewer role shifts from "what is missing?" to "is any of this necessary?" That is a genuinely harder question to answer quickly, especially in a codebase you do not own.

CodeRabbit's December 2025 analysis of 470 open-source pull requests found that AI-generated code produced roughly 1.7 times more issues than human-written code — about 10.83 issues per PR versus 6.45 for human-authored PRs. For open-source maintainers, this asymmetry is brutal: a contributor can generate a low-quality patch in five minutes; a maintainer may spend hours tracing, explaining, and rejecting it.

Review overhead is not distributed evenly. It lands disproportionately on senior engineers, who are the people whose time is most expensive and whose attention is most constrained. If your team measures AI productivity by how fast junior developers ship first drafts, you are measuring the wrong end of the pipeline.

The debt that compounds quietly

Beyond immediate review cost, there is a longer-accumulating problem: AI-assisted code tends to produce patterns that make codebases harder to maintain over time.

GitClear's 2025 analysis — drawing on 211 million changed lines from repositories at Google, Microsoft, Meta, and enterprise companies over five years — found that the share of copy-pasted code blocks increased 8x during 2024, and that code churn (new code revised within two weeks) nearly doubled from 3.1% to 5.7%. That second metric matters because churn is a reliable proxy for premature or wrong-shaped code: something gets committed, then immediately revisited because it did not actually fit.

More striking: the share of "moved" lines — code that was refactored and reorganized — dropped from 24.1% in 2020 to just 9.5% in 2024. Refactoring is how codebases stay coherent over time. Its decline suggests AI tools are biased toward addition rather than reorganization. 2024 is the first year in GitClear's dataset where code duplication outpaced refactoring activity.

Google's own 2024 DORA report, covering tens of thousands of developers, found that a 25% increase in AI adoption correlated with a 7.2% decrease in delivery stability. Delivery stability — the rate at which changes cause production incidents — is exactly the metric that does not show up in a demo or a lines-per-minute benchmark. It shows up three months later, in pages and postmortems.

These are not arguments against using AI tools. They are arguments for measuring what you are actually buying. Faster generation plus slower review plus higher debt accumulation can easily sum to a net negative, depending on your context.

Why the headline numbers miss this

Productivity studies in software development face a persistent measurement problem: the things that are easy to count are not the things that matter most. Lines of code, commits per day, pull request volume, task completion time in isolated exercises — these are observable. Code review time per PR, debt accumulation rate, future incident frequency attributable to today's commits — these require longer time horizons and more complex attribution.

The METR study published in July 2025 is one of the few that tried to close this gap. Sixteen experienced open-source developers completed 246 real tasks on large codebases — averaging one million lines of code — using Cursor Pro with Claude 3.5 Sonnet. Before starting, developers predicted AI would reduce their completion time by 24%. The measured result: AI increased completion time by 19%. The gap between expectation and outcome was nearly 40 percentage points.

The researchers noted that this does not mean AI tools are useless in every context. The study covered complex, real-world tasks in large unfamiliar codebases — exactly the conditions where AI context limitations and review overhead bite hardest. Simpler, greenfield, or more tightly scoped work likely tilts the other way. But it is also exactly the kind of work that takes up most of a senior engineer's day.

The perception gap matters beyond individual productivity. Teams and managers who believe AI is delivering outsized velocity may cut slack time, reduce review staffing, or accept larger PRs without adjusting expectations. The cleanup then arrives — diffused, delayed, and attributed to something else.

Accounting for it honestly

None of this means you should avoid AI coding tools. The genuine gains are real: documentation, boilerplate, test scaffolding, and isolated feature work in well-understood domains all benefit materially. The goal is honest accounting, not abstinence.

A few things that actually help:

Measure review time, not just generation time. If your team instruments CI/CD, add review duration per PR to the dashboard. A jump in review time that coincides with increased AI adoption tells you something. Treating generation velocity and review velocity as one number hides the transfer.

Track churn rate on AI-assisted commits. The GitClear finding — that code written in 2024 was nearly twice as likely to be revised within two weeks — is measurable in your own repo. If AI-generated commits churn faster than hand-written ones, you are paying for generation twice.

Scope the task before prompting. The 186-line REST endpoint versus the 29-line one is not a failure of the AI model. It is a failure of the prompt. "Add an endpoint for X" produces more output than "Add an endpoint for X, matching the style of the existing /users endpoint, under 40 lines." Specificity is a skill that lowers cleanup cost.

Keep the reviewer-to-generator ratio sensible. Teams that let AI multiply their PR volume without adjusting review capacity will see the bottleneck compound. This is a staffing and process question, not just a tooling question.

Be honest about what you are actually measuring in productivity trials. If your internal pilot measures how fast developers complete greenfield tasks in a new repo, it is not measuring what will happen when those same developers maintain that code six months later.

The velocity AI coding tools offer is real, and in the right conditions it is meaningful. But velocity in generation is a narrow slice of what determines software cost. Cleaning up after that generation — reviewing, refactoring, debugging, and eventually undoing — is where a substantial part of the bill arrives, and it arrives on a delay that makes it easy to attribute elsewhere. The teams that account for it explicitly will make better decisions about when to lean on AI and when to reach for something slower and more deliberate.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.