Organizations have long struggled with answering the question: "How do we measure developer productivity?"
In the past, some organizations have taken to measuring lines of code produced per engineer. Others have measured the number of tickets closed. Others measure the number of pull requests (PRs) merged.
And now in the age of AI, we've started measuring token usage.
Proxy metrics
The problem with all of these metrics is that they're not actually measurements of developer productivity — they're proxy metrics. And with any proxy metric, once you start measuring something, that metric becomes the goal. This is especially true if that metric has financial incentives or performance ratings tied to it.
Measuring lines of code? Great, engineers will start writing overly verbose PRs in order to write more lines. Goodbye concise functions and reusable code.
Measuring number of tickets closed? Great, engineers will start creating tickets for everything, and they'll break tasks down into the smallest tickets possible. Now we're more focused on Jira and paperwork than actual work and value provided.
Measuring number of pull requests merged? Great, engineers will focus on creating as many PRs as possible, with the smallest changes they can think of, or with the most trivial updates.
Measuring token usage? Great, engineers will use their AI assistants for everything to use more tokens, regardless of whether or not the output is useful or makes the engineer more efficient.
All of these examples are what are called perverse incentives — incentives that create unintended and oftentimes counterproductive behaviors that work contrary to the desired goal.
Cobra farms
One of the most famous examples of perverse incentives is the story of the cobra problem in India. According to the story, Delhi was overrun with cobras, venomous snakes. The government wanted to get rid of the cobras, so they offered a reward for every dead cobra that someone brought in. Seems reasonable, right? The more dead cobras there are, the less live ones there are.
But can you see the perverse incentive? People were rewarded for dead cobras. How do you get more dead cobras, besides finding more? You raise more cobras yourself and then kill them. So people began breeding cobras themselves so that they could turn them in for the reward.
And then, once the government realized what was happening, the government ended the incentive program. Cobra breeders no longer had any reason to breed more cobras, so they released the rest of what they had into the wild, resulting in even more cobras in the city.
The well-intentioned goal of rewarding people for dead cobras actually resulted in more cobras being present, not fewer.
(For the record, whether or not this story is true or if parts are exaggerated is not entirely certain. But it's a very fun example for a teaching moment!)
AI usage
So now let's get back to AI usage and tokens. Organizations are under more pressure than ever to prove that they're thriving in this brave new world of AI. Engineers are under that same pressure, expected to do more and accomplish more work. That's the promise we've all been sold, that AI will make us more productive.
But how do we measure this increased productivity? Well, we're back to the same problem organizations have always had, even before AI. It's hard to measure productivity, so we come up with a proxy metric, and that proxy metric is now token usage.
Surely if an engineer is using more tokens, that means they're using AI more, so they're more productive in their job and helping the business even more.… Right?
Usage, output, and outcomes
That's where the build trap happens. Organizations go astray when they focus more on output (e.g. pull requests merged, features shipped, projects completed) than on outcomes (e.g. value delivered to customers, user behavior change, pain points resolved).
Maybe token usage is increasing, but does that correspond to an increase in pull requests and actual work done? If token usage increases but the output of the individual stays the same, then we haven't actually accomplished anything. We're just celebrating using more tokens (spending more money) but with no actual increased productivity.
And taking that a step further, let's say that we do see an increase in pull requests merged, which presumably is a good thing. But does that increase in output lead to a corresponding increased benefit in outcomes? Is the product better because of the increased output?
In other words, does the increased code result in a tangible increase in the value that the company provides to its users?
That question is far more difficult to answer, but far more important.
Mediocre organizations measure usage and output and then stop there. Great organizations measure outcomes.
Top comments (0)