Sayed Ali Alkamel

Posted on Jun 16 • Originally published at dev.to

Is Token Usage the New Lines of Code? How to Measure Developer Productivity in the AI Age

#ai #programming #devtools #productivity

Is Token Usage the New Lines of Code? How to Measure Developer Productivity in the AI Age

"Whatever gets measured gets gamed." — Goodhart's Law

An OpenAI engineer processed 210 billion tokens in a single week. A Claude Code user spent over $150,000 in a month. Salesforce quietly set a $175 minimum monthly token-spend target for engineers, and developers started asking AI to summarize documents they already understood, just to hit the number.

Welcome to tokenmaxxing. The newest chapter in software engineering's oldest and most embarrassing story: measuring the wrong thing, confidently, at scale.

This article will give you the unfiltered answer to a question every engineering leader and developer is quietly debating right now. Is measuring token usage the same as measuring lines of code? And if so, what should we actually be measuring?

The Ghost We Thought We Buried: Lines of Code

Fred Brooks knew it in 1975. In The Mythical Man-Month, he made it plain that adding more programmers to a late project makes it later. Productivity in software is not linear, it is not additive, and it cannot be reduced to volume.

And yet, for decades, companies fell for the simplest metric available: lines of code (LOC). Count the output, reward the output, get more output. What they actually got was bloated codebases, unnecessary complexity, and engineers who learned to pad their work to look busy.

Kent Beck, creator of Extreme Programming, said it directly: "The way you get programmer productivity is not by increasing the lines of code per programmer per day. That does not work. The way you get programmer productivity is by eliminating lines of code you have to write."

The industry eventually accepted this. LOC as a metric became a cautionary tale. We moved on to commits, pull requests, story points, velocity, cycle time. Every single one of these was gamed the moment it became a target. Agile teams discovered that story points originally meant as capacity planning proxies had transformed into performance theater.

Then AI arrived. And the industry, with astonishing speed, invented a brand new broken metric.

Tokenmaxxing: Lines of Code in a Lab Coat

Token budgets are the new LOC. The logic sounds compelling at first: if a developer is using more AI compute, they must be building more things, right?

Wrong. Measuring token consumption as a proxy for productivity makes the same fundamental error as counting lines of code. It measures an input to the process, not the output, and certainly not the outcome.

TechCrunch's April 2026 investigation into tokenmaxxing found that engineers with the largest token budgets produced the most pull requests, but the productivity improvement did not scale. They achieved two times the throughput at ten times the cost of tokens. The tools were generating volume, not value.

Faros AI drew on two years of customer data and found that code churn, lines of code deleted versus lines added, had increased 861% under high AI adoption. GitClear's January 2026 report found that regular AI users averaged 9.4x higher code churn than their non-AI counterparts, more than double the productivity gains the tools provided.

More code is being written. Most of it is not sticking.

Alex Circei, CEO of Waydev, which tracks developer analytics across more than 10,000 engineers, put the problem precisely: engineering managers are seeing code acceptance rates of 80 to 90 percent, but they are missing the churn that happens when engineers have to revise that accepted code in the following weeks, which drives the real-world acceptance rate down to between 10 and 30 percent of generated code.

Token usage measures how aggressively a developer consumes AI compute. It says nothing about whether the output shipped, whether it held up in production, whether it was the right thing to build, or whether a senior engineer would need to rewrite it three weeks later.

What the Research Actually Says (and It Is Uncomfortable)

The METR randomized controlled trial from July 2025 tested 16 experienced open-source developers across 246 real tasks. The result shocked the industry.

Developers using AI tools completed tasks 19% slower than the control group.

The more alarming part: before the study, developers predicted AI would speed them up by 24%. After completing it, they still estimated they had been 20% faster, despite objective measurement proving the opposite. The perception gap was total.

This is not an argument against AI tools. The same period saw Google's 2025 DORA report show that AI adoption correlates with higher software delivery throughput for teams that have invested in strong engineering foundations first. The difference lies in how AI is being used and, critically, how productivity is being measured.

Martin Fowler, Chief Scientist at Thoughtworks and one of the most influential voices in software engineering, wrote in August 2025 that most LLM usage in the industry is "fancy auto-complete," and that the developers getting the most value are those who allow AI to directly read and edit source code files, not just suggest snippets. Fowler has long argued that measuring individual developer productivity is a "fool's errand," and the AI era has only deepened that conviction. Optimizing for the speed of code production, he argues, misses the point entirely, much like measuring a novelist's productivity by words per minute rather than by the quality of the narrative.

Nicole Forsgren, creator of DORA and SPACE, and author of Accelerate, said it plainly in late 2025: "AI broke our developer productivity metrics. Lines of code? Meaningless. Commits? Not the point. Velocity? Can be misleading. We need new frameworks for measuring DevEx in the age of AI."

So What Should You Actually Measure?

The answer is not a single number. It never was. Here are the frameworks and signals that hold up:

1. Business Outcomes, Not Activity Proxies

The only honest question is: did the software deliver value? Did the feature ship and work? Did it reduce support tickets, increase revenue, or improve retention? These are hard to measure in a sprint, which is exactly why organizations reach for activity proxies. Resist it.

2. Code Durability

Instead of counting how many lines were written or tokens consumed, track how much of that code survives. Code durability, the ratio of code that stays in the codebase after 30, 60, and 90 days, is a far more honest signal of whether the work was good. High churn under AI adoption is a red flag that should be on every engineering dashboard in 2026.

3. The SPACE Framework

Developed by Nicole Forsgren and colleagues from GitHub and Microsoft Research, SPACE measures five dimensions simultaneously: Satisfaction and well-being, Performance, Activity, Communication and collaboration, and Efficiency and flow. The key insight is that no single dimension tells the full story. A team shipping more PRs but burning out is not a productive team.

4. DORA Metrics (With AI-Era Adjustments)

Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Restore remain the gold standard for measuring the health of your delivery pipeline. In the AI era, Change Failure Rate and Mean Time to Restore are more important than ever, because the speed gains from AI mean you can ship broken code faster too. Add AI attribution and complexity-adjusted throughput to see the full picture.

5. Specification Quality

Here is the new skill nobody is measuring yet. In the Agentic Engineering era that Andrej Karpathy described at AI Ascent 2026, the developer's primary output is increasingly the specification, not the code. The human role shifts from writing code to owning the spec, design, and judgment calls. A developer who writes a rigorous, well-scoped specification that agents can execute correctly is more productive than one who burns tokens on vague prompts and spends days fixing hallucinated output.

The Skills That Actually Matter Now

Kent Beck said something quietly devastating: "90% of my skills are now worth $0. But the other 10% are worth 1000x."

That 10% is not typing speed. It is not syntax memorization. It is not the ability to recall a library API from memory.

The skills that compound in the AI age are:

Problem formulation. The ability to articulate what needs to be built clearly enough that an agent can execute it without ambiguity.

Architectural judgment. You can outsource implementation to agents. You cannot outsource the ability to catch a subtle logical error or a design decision that will haunt the codebase for years.

Agent orchestration. Running parallel agents, reviewing their outputs critically, understanding when to trust and when to reject. This is the new technical skill floor.

Context curation. Knowing what context to give an AI and what to leave out. A developer who extracts the 20 relevant lines from a 10,000-line codebase and frames the right question consistently outperforms one who pastes the entire repo and hopes for the best.

None of these skills show up in a token count.

The Verdict

Is measuring token usage the same as measuring lines of code?

Yes. Structurally, they are the same mistake.

Both measure an input to the process instead of the output. Both are easily gamed once they become a target. Both create perverse incentives: write more code, burn more tokens, look productive while potentially reducing the quality and maintainability of the system. And both measure the wrong end of the value chain entirely.

The engineers doing the most valuable work in 2026 might be consuming far fewer tokens than their tokenmaxxing colleagues. They might be spending hours in a whiteboard session killing a bad idea before a single prompt is written. They might be writing a specification so precise that an agent executes the entire feature on the first pass. That work is nearly invisible to any activity-based metric.

Goodhart's Law will always win. When token spend becomes a target, it will cease to be a good measure. It will just become the new leaderboard. And we will spend another decade learning the same lesson Fred Brooks already taught us in 1975.

The question is not how much AI compute a developer burns. The question is whether the software ships, holds up, and matters.

Everything else is noise.

References and Further Reading

Martin Fowler, Some thoughts on LLMs and Software Development, martinfowler.com, August 2025
Becker et al., Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, METR / arXiv:2507.09089, July 2025
Tim Fernholz, Tokenmaxxing is making developers less productive than they think, TechCrunch, April 2026
Nicole Forsgren, Frictionless: Seven Steps to Help Engineering Teams Move Faster in the Age of AI, 2025
Faros AI, AI Acceleration Whiplash, March 2026
GitClear, Developer Cohort Analysis: AI Coding Output, January 2026
Andrej Karpathy at Sequoia Capital AI Ascent 2026, karpathy.bearblog.dev
Kent Beck, TDD, AI agents and coding, The Pragmatic Engineer, July 2025
Gergely Orosz, How AI is changing Software Engineering, The Pragmatic Engineer, April 2026
Fred P. Brooks Jr., The Mythical Man-Month: Essays on Software Engineering, 1975

Sayed Ali Alkamel is Manager of Digital Application Platforms at Oman Housing Bank, a Google Developer Expert in Dart and Flutter, and co-founder of Flutter MENA. He ships production systems at the intersection of fintech, mobile, and AI infrastructure.

Top comments (2)

Alex Shev • Jun 16

Token usage is a useful operational metric, but it is dangerous as a productivity metric by itself. The better signal is probably tokens per accepted change or tokens per resolved task, because raw volume can reward noisy loops.

Alex Shev • Jun 17

Token usage can be a diagnostic, but it becomes dangerous as a productivity metric. It tells you how much model attention you burned, not whether the work got safer, smaller, or easier to maintain. I would use it like cloud spend: useful for spotting waste, terrible as the scoreboard.