Your team rolled out an AI coding agent three months ago, and leadership wants a number that proves the seat licenses paid off. The dashboard offers two easy ones: lines of code generated, and the share of AI-assisted pull requests that got merged. Both are trivial to pull, both look healthy, and both will steer you wrong.
Why the Easy Metrics Lie
Lines of code has been a discredited productivity measure for decades, but agents make it actively dangerous. An agent will produce 400 lines where 40 would do — boilerplate, defensive checks for inputs that cannot occur, a helper it did not notice already existed three files over. Counting that output as productivity rewards the exact behavior you want to suppress. Teams getting real value from agents often watch their net diff shrink, because the agent is also deleting dead code and collapsing duplicated abstractions.
PR acceptance rate is more seductive, because it sounds like a quality signal. It is not. One figure that circulated in this debate: the KubeStellar project reportedly merged 81% of its AI-assisted pull requests. Read that carefully. It tells you 81% of those PRs cleared review. It tells you nothing about whether they should have been opened, whether they introduced defects found weeks later, how many review rounds each one cost, or whether the merged code was still in the codebase a month on.
An 81% acceptance rate is just as consistent with reviewers rubber-stamping output they did not fully read as it is with genuine quality. AI-assisted PRs are often smaller and more numerous, which inflates acceptance rate while quietly raising the total review burden across the team. The metric measures a reviewer's willingness to click merge — not the agent's contribution to the product.
If a metric goes up when an agent produces more code regardless of whether that code was needed, it is a vanity metric. Lines of code, commit count, and raw PR throughput all fail this test. Treat them as activity logs, not scorecards.
What to Track Instead
The useful question is not how much the agent produced, but what its output cost and how long it lasted. Four measurements cover most of that, and you can derive all of them from data already sitting in Git and your incident tracker.
| Metric | What it catches | Where it comes from |
|---|---|---|
| Code survival rate | Agent output rewritten or deleted within 3-4 weeks |
git blame history on agent-authored lines |
| Review rounds per PR | Cost shifted from author to reviewer | PR review timeline |
| Change failure rate | Whether agent-assisted changes break production more often | Incident tracker, PRs tagged |
| Commit-to-deploy time | Whether the agent shortens delivery, not just authoring | Deployment pipeline |
Code survival rate is the hardest of the four to game. If 60% of an agent's lines are gone within a month, the agent generated rework, not progress — and rework is invisible to both lines of code and acceptance rate. Change failure rate is one of the four DORA metrics, and commit-to-deploy time maps onto DORA's lead time for changes, so you can compare AI-assisted changes against a baseline the industry already understands instead of inventing a scale.
Pair the quantitative side with one qualitative measure. The SPACE framework's central argument is that developer productivity is multidimensional and cannot collapse into throughput. A recurring two-question survey — did the agent reduce or add friction this week — catches problems Git data cannot, like an agent that produces mergeable code while making the codebase harder to reason about.
Before you roll an agent out team-wide, record two weeks of baseline numbers without it. A metric with no "before" value can be argued into meaning almost anything. The baseline is what turns your post-rollout numbers into evidence.
Running the Measurement Without Drowning in Dashboards
You do not need a metrics platform. Pick two measurements — code survival rate and review rounds per PR make a strong starting pair — and track them in a spreadsheet or shared doc for one quarter. Tag the PRs that used an agent so you can compare cohorts cleanly. Resist adding a third and fourth metric until the first two have told you something, because every metric you track is a number someone has to interpret, defend, and argue about in a review meeting.
Keep the comparison fair. The honest baseline is not the agent versus no tooling — it is the agent versus a developer with ordinary IDE autocomplete and a linter. Copilot, Cursor, and Claude Code also behave differently enough that blending them into one "AI" bucket hides the answer: an inline-completion tool, an editor-native agent, and a terminal agent each shift work to a different stage of the cycle. Measure each as its own cohort.
One trap deserves a name. Do not let the agent write the tests that validate its own code without a human reading them. An agent that generates both the implementation and a passing test suite can post a flawless acceptance rate while testing nothing real. Code survival rate will expose that eventually; a reviewer who actually reads the tests exposes it on day one.
None of this is about policing the agent. It is about learning, with evidence, where the agent genuinely helps your team and where it quietly shifts cost downstream — so the next renewal decision rests on data instead of a lines-of-code chart that was never measuring the right thing.
Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.
Top comments (0)