Paulo Victor Leite Lima Gomes

Posted on May 18

Stop Measuring AI Agents by How Much Code They Produce

#ai #agents #engineering #devtools

The CNCF published a post last week about KubeStellar reaching an 81% PR acceptance rate for contributions made by AI agents. On the same day, GitHub exposed team-level Copilot usage metrics through their API.

Two signals. Same problem. Different directions.

KubeStellar's team framed the number as a success. An 81% acceptance rate is genuinely impressive for agent-written code. But reading the post, I kept asking myself: what happens when the manager dashboard shows that Team A used Copilot for 5,000 suggestions this week and Team B used it for 500?

Gage says: more is better, obviously. Agents are producing. Velocity is up.

Paul says: wait, what are we actually measuring here?

the problem with counting output

Measuring AI agents by how much code they produce is the same mistake as measuring developers by lines of code written.

We know this. We have known this for decades. LOC is a garbage metric because it incentivizes verbosity, punishes cleanup, and says nothing about correctness, maintainability, or whether the code should exist at all.

But we are about to recreate this exact mistake at organizational scale, except this time the producer is not human and the latency between "produce" and "review" is measured in seconds instead of days.

The reason is obvious: code output from agents is easy to count. PRs merged, lines changed, suggestions accepted, prompts submitted, Copilot API calls made. These numbers come out of dashboards automatically. They look impressive. They trend upward.

But they measure activity, not value.

And activity is very cheap when the producer runs on tokens.

what kubestellar's 81% actually tells us

KubeStellar is an open source K8s multi-cluster management project. Their experiment had agents contributing to an existing codebase with established patterns, maintainer expectations, and a review process.

An 81% acceptance rate in that context means the agents learned the project's conventions. They produced code that passed CI, followed the existing patterns, and addressed real issues. That is not trivial. It suggests that with good context — clear issues, well-structured codebases, explicit contribution guidelines — agents can be genuinely helpful contributors.

What it does not mean is that 81% is a universal benchmark, or that the other 19% does not matter.

The 19% that gets rejected or reverted is where the interesting signal lives. Was it functionally wrong? Was it technically correct but stylistically off? Did it introduce subtle bugs that only showed up in production? Did it pass review but get reverted a week later?

These are not edge cases. They are the entire reason code review exists.

I want to know the rollback rate. I want to know the bug rate per agent-contributed PR. I want to know the maintenance burden — how many of those agent-contributed functions will need refactoring in six months because they solved today's problem without considering tomorrow's change.

Those metrics do not appear in a Copilot dashboard. They require deeper instrumentation, ownership tracking, and time.

what the github metrics API will do to your org

GitHub's new team-level Copilot usage metrics are useful. They let you see which teams are adopting AI tools, how many suggestions are being accepted, and where adoption is lagging.

They will also be weaponized within about three weeks of rollout.

Here is the scenario I have already seen play out at several companies:

Engineering leadership rolls out the dashboard. Team A has high adoption numbers. Team B has lower numbers. A well-intentioned VP asks why Team B is not using AI more. Team B starts optimizing for the metric. They accept more suggestions, merge faster, prompt more aggressively. The dashboard looks better. The code quality degrades, slowly and invisibly, because nobody is measuring maintenance cost per agent-contributed PR.

The dashboard metric becomes the goal. The goal becomes the degradation vector.

I am not anti-metrics. I am anti-metrics-that-look-good-but-measure-the-wrong-thing, because those are the most dangerous kind. They give you confidence in the wrong direction.

acceptance rate is better, but not complete

If you must measure agent productivity with one number, acceptance rate is at least better than raw suggestion count.

KubeStellar's approach — measuring what actually gets merged — accounts for the fact that not all output is valuable. It puts the emphasis on the review outcome, not the generation volume.

But acceptance rate has blind spots too.

A high acceptance rate can mean the agent is producing great code. It can also mean the reviewers are not reviewing carefully. Or that the PRs are so small they barely merit a review. Or that the codebase conventions are so loose that everything looks acceptable. Or that the team has normalized agent output and stopped treating it critically.

I have seen the "reviewer fatigue" pattern in several orgs already. When every PR is agent-generated, developers stop reading diffs carefully. The acceptance rate stays high because nobody is looking.

If you measure acceptance rate without also measuring review depth, you are measuring the review process, not the output quality.

what i would measure instead

If I were building a dashboard for AI-assisted development, I would track four things:

1. Acceptance rate with a review depth qualifier. Not just "was the PR merged?" but "how many rounds of review did it take? How many comments? How many changes requested?" A PR that gets accepted on the first try with zero comments might be perfect, or it might be unreviewed. Distinguish these.

2. Rollback rate within 30 days. This is the honest metric. Code that gets deployed and rolled back within a month is code that created production cost, regardless of how clean the PR looked. If agent-contributed PRs have a higher 30-day rollback rate than human-contributed ones, you have a review quality problem, not a generation problem.

3. Maintenance cost attribution. When a bug gets fixed or a feature gets refactored, who wrote the original code that had to be changed? If agent-contributed code accounts for a disproportionate share of follow-up work, that is a signal that the agent is producing surface-correct but structurally fragile code.

4. Context reuse rate. This one is speculative but I think it matters. How often does an agent reuse context from a previous PR — issue links, pattern choices, architecture decisions — versus starting fresh? Reuse suggests learning. Fresh starts on every PR suggest the agent is solving each problem in isolation, which is how you accumulate weird inconsistencies across the codebase.

the deeper problem is not measurement, it's accountability

I think the measurement conversation is really an accountability conversation.

When a human writes code and it breaks, there is a clear chain: author, reviewer, approver, deployer. The review was a human reading another human's work, and both felt the weight of that review because both names are on the commit.

When an agent writes code and it breaks, the chain gets fuzzy. Was the reviewer supposed to catch the subtle correctness issue? Was the agent's context insufficient? Was the prompt ambiguous? Was the issue description bad? The agent has no name. There is no accountability feedback loop.

Anthropic recently published a postmortem tracing Claude Code quality complaints to overlapping product changes and context management issues. That is honest, and it is the kind of transparency we need more of. But it also illustrates the structural problem: when code quality degrades because of agent behavior, the fix is not "tell the agent to do better." The fix involves better prompts, better context, better review processes, better tooling — and all of that requires human investment.

The agent does not learn from its mistakes unless someone builds that feedback loop.

the punchline

Stop measuring AI agents by how much code they produce. Start measuring by how much value survives the commute to production.

KubeStellar's 81% acceptance rate is an interesting data point. But the number that matters more is the one nobody is tracking yet: how much of that accepted code is still in production, still clean, still maintainable, and still correct six months from now.

That number is hard to measure. It requires naming, ownership, review depth, rollback tracking, maintenance attribution, and time.

But that is the number that will tell you whether your AI-assisted engineering investment is working, or whether you just optimized a dashboard that hides the real cost.

Acceptance rate is a starting point. It is not the finish line.