We tried to measure AI's actual impact on our codebase. Here's why it's so hard.

#ai #productivity #codequality #devtools

Everyone's seen the stats. "55% faster." "40% more code." "3x productivity." They end up in pitch decks and team retrospectives, and nobody really questions them because the conclusion feels right -- AI tools do feel helpful when you're using them.

But when we actually tried to find those numbers in real commit histories and PR patterns, we hit a wall.

Not because AI isn't changing how people write code -- it clearly is. But because measuring how much, and whether it's actually good, turns out to be a genuinely messy problem.

The obvious metric is the wrong one

The first instinct is output. More commits, more PRs, more lines of code. And yes, those numbers do go up.

But so do some less flattering ones:

Average commit size grows -- more lines per commit, which correlates with harder-to-review changes
PR cycle times don't improve, and sometimes get worse -- reviewers are spending longer on more code written in a less familiar style
Commit message quality drops -- more "update logic", less actual context

None of this means AI is making things worse. It means raw output metrics don't capture what's actually happening.

The before/after problem

Comparing pre- and post-AI adoption sounds straightforward. In practice, you're also comparing different project phases, team compositions, architectural decisions, and a dozen other variables that moved at the same time. Almost every "AI made us X% faster" claim, when you look at the methodology, is comparing an enthusiastic adoption period against an uncontrolled baseline.

What you can actually observe

After looking at a lot of repositories, the measurable impacts fall into a few categories:

Code homogeneity increases. AI-assisted codebases tend to become more internally consistent -- the same patterns repeat. Good for cognitive load, but the same mistakes replicate everywhere too.

Review burden shifts. The bottleneck doesn't disappear, it moves. Code output goes up, but someone still has to review it. AI-generated code tends to look correct even when it isn't, which makes subtle bugs harder to catch.

Test coverage quietly degrades. AI-assisted PRs consistently ship proportionally less test code than production code. The feature lands fast, the tests get deferred.

The uncomfortable ones

Knowledge erosion. We've seen repositories where contributor breadth increases but contributor depth decreases. Bus factor metrics that look healthy, where none of the contributors could confidently explain the module without re-reading it. The metric looks fine. The codebase is fragile.

Architectural drift. AI-generated code is only as good as the context it had when generating. Over months, this creates "dialects" within the same repo -- different patterns for the same operations, because different sessions had different context.

The productivity paradox. More code, but not proportionally more capability. The codebase grows faster than the product does. AI accelerates accidental complexity because the cost of writing code drops while the cost of understanding and maintaining it stays constant.

What to actually track

If a single metric or simple before/after comparison won't cut it, here's what we think gives a more honest picture:

PR cycle time (not PR volume)
Review comment density (not approval speed)
Revert and hotfix rates (not commit counts)
Test-to-production code ratio over time
Contributor depth per module, not just breadth

And watch for: commit sizes trending upward without corresponding feature complexity, declining review thoroughness as volume increases, and the growing gap between code output and test coverage.

AI coding tools are genuinely useful -- we use them ourselves. But the rush to quantify impact has produced a lot of misleading numbers, and optimising for the wrong metrics leads you somewhere you don't want to be.

Curious whether others are tracking any of this, or whether most teams are still pointing at velocity and calling it done. What signals have you found actually tell you something meaningful?