Organizations have long struggled with answering the question: "How do we measure developer productivity?"
In the past, some organizations have taken to measuring lines of code produced per engineer. Others have measured the number of tickets closed. Others measure the number of pull requests (PRs) merged.
And now in the age of AI, we've started measuring token usage.
Proxy metrics
The problem with all of these metrics is that they're not actually measurements of developer productivity — they're proxy metrics. And with any proxy metric, once you start measuring something, that metric becomes the goal. This is especially true if that metric has financial incentives or performance ratings tied to it.
Measuring lines of code? Great, engineers will start writing overly verbose PRs in order to write more lines. Goodbye concise functions and reusable code.
Measuring number of tickets closed? Great, engineers will start creating tickets for everything, and they'll break tasks down into the smallest tickets possible. Now we're more focused on Jira and paperwork than actual work and value provided.
Measuring number of pull requests merged? Great, engineers will focus on creating as many PRs as possible, with the smallest changes they can think of, or with the most trivial updates.
Measuring token usage? Great, engineers will use their AI assistants for everything to use more tokens, regardless of whether or not the output is useful or makes the engineer more efficient.
All of these examples are what are called perverse incentives — incentives that create unintended and oftentimes counterproductive behaviors that work contrary to the desired goal.
Cobra farms
One of the most famous examples of perverse incentives is the story of the cobra problem in India. According to the story, Delhi was overrun with cobras, venomous snakes. The government wanted to get rid of the cobras, so they offered a reward for every dead cobra that someone brought in. Seems reasonable, right? The more dead cobras there are, the less live ones there are.
But can you see the perverse incentive? People were rewarded for dead cobras. How do you get more dead cobras, besides finding more? You raise more cobras yourself and then kill them. So people began breeding cobras themselves so that they could turn them in for the reward.
And then, once the government realized what was happening, the government ended the incentive program. Cobra breeders no longer had any reason to breed more cobras, so they released the rest of what they had into the wild, resulting in even more cobras in the city.
The well-intentioned goal of rewarding people for dead cobras actually resulted in more cobras being present, not fewer.
(For the record, whether or not this story is true or if parts are exaggerated is not entirely certain. But it's a very fun example for a teaching moment!)
AI usage
So now let's get back to AI usage and tokens. Organizations are under more pressure than ever to prove that they're thriving in this brave new world of AI. Engineers are under that same pressure, expected to do more and accomplish more work. That's the promise we've all been sold, that AI will make us more productive.
But how do we measure this increased productivity? Well, we're back to the same problem organizations have always had, even before AI. It's hard to measure productivity, so we come up with a proxy metric, and that proxy metric is now token usage.
Surely if an engineer is using more tokens, that means they're using AI more, so they're more productive in their job and helping the business even more.… Right?
Usage, output, and outcomes
That's where the build trap happens. Organizations go astray when they focus more on output (e.g. pull requests merged, features shipped, projects completed) than on outcomes (e.g. value delivered to customers, user behavior change, pain points resolved).
Maybe token usage is increasing, but does that correspond to an increase in pull requests and actual work done? If token usage increases but the output of the individual stays the same, then we haven't actually accomplished anything. We're just celebrating using more tokens (spending more money) but with no actual increased productivity.
And taking that a step further, let's say that we do see an increase in pull requests merged, which presumably is a good thing. But does that increase in output lead to a corresponding increased benefit in outcomes? Is the product better because of the increased output?
In other words, does the increased code result in a tangible increase in the value that the company provides to its users?
That question is far more difficult to answer, but far more important.
Mediocre organizations measure usage and output and then stop there. Great organizations measure outcomes.
Top comments (5)
I've talked to a few teams pushing AI hard and you can see the output go up, but something subtle changes in how engineers relate to the code.
It shifts from 'I built this' to 'the model suggested this.' That mismatch matters a ton. Nobody fully owns it anymore, so when something breaks it's unclear what happened. And the reviews are basically glances, the AI-generated changes are large enough that nobody can really untangle them anyway.
You can have more output and less understanding at the same time. That's a strange place to be.
100%, I've seen a lot of that as well.
We've tried to be pretty explicit that no matter who or what wrote the code (you or an AI assistant), you as the code author are still ultimately responsible for understanding and maintaining that code. You shouldn't be shipping something if you can't explain it. But that shift in mentality is real.
And agreed again, the code review process is also a bit strained. When an organization pumps out code three times faster, now there are three times more PRs to review. And it gets worse if every PR contains 1000+ lines of code changes. We have to continue to go back to fundamentals, that PRs should be small, focused, well-scoped changes, since that makes them easier to review and understand without the reviewer experiencing a bunch of cognitive overload. Otherwise you just start rubber stamping PRs without actually reviewing them.
It seems that as an industry we're having to re-learn some of the basics that we've known for decades. (Or, find new ways to solve the new problems we've created for ourselves.)
The cobra farm analogy is perfect — I've seen teams where the 'most AI-assisted' devs produced the most reverted PRs. Measuring outcomes instead of output sounds obvious until you try to define what a good outcome actually is.
Absolutely, that's the tricky (but essential) part!
The distinction between output and outcome is where most AI rollouts quietly die. Teams ship features fast, token counts go up, and it looks like progress — until someone asks whether any of it actually moved the metric.
As a PM this is the exact conversation I keep having: usage dashboards are not success metrics. The hard part is designing outcome signals upfront, before the build starts, so you’re not retrofitting accountability onto a pile of shipped features.
Good framing. Bookmarking this for the next sprint planning where someone wants to measure AI value by lines of code generated.