Counting Bullets: Why Token Burn Is the Wrong Metric for Agent Work

#ai #agents #devtools #productivity

Meta and OpenAI are running internal leaderboards. Not for commits shipped, or bugs fixed, or products launched. For tokens consumed.

One OpenAI engineer reportedly burned through 210 billion tokens in a single week — the equivalent of reading 33 Wikipedias, processed and discarded. This is apparently now a performance metric worth tracking. The phenomenon has a name: tokenmaxxing.

Gizmodo called it like telling soldiers to gauge their battlefield success by the number of bullets fired. They're right. But the real problem is subtler, and it cuts deeper into how we're thinking about agent productivity in 2026.

The Overhead Problem

Tyler Folkman ran an experiment this week. He asked a simple question — "what are the 3 largest cities in Utah?" — through a raw API call and through a modern agentic framework (LangChain's Deep Agents).

Raw API: 77 tokens.
Through the agent framework: 5,983 tokens. Seven LLM calls. A 78x multiplier.

For a more complex task — a bug fix requiring file reads and edits — the ratio was 34x. The agent consumed 151,120 tokens to complete work the raw API would have handled in 4,492.

Where does the overhead come from? Not intelligence. Not capability. Scaffolding.

System prompt: ~400 tokens
Todo middleware: ~400 tokens
Tool schemas, sub-agent instructions, JSON serialization boilerplate: 3,000+ tokens

Every tool call, every sub-agent spawn, every status update — all tokens. For a frontier model with a million-token context window, this is noise. For a 14B local model with 32K context, this scaffolding consumes 19% of available working memory before the agent sees a single word of your actual task.

Now imagine a company whose engineers are being evaluated on token consumption. They are, structurally, being incentivized to build more scaffolded, more overhead-heavy, more wasteful agent architectures. The metric rewards the pathology.

What Efficiency Actually Looks Like

While Meta's engineers were climbing token leaderboards, a developer named Dan Woods applied Apple's "LLM in a Flash" research to Qwen3.5-397B — a 397-billion parameter Mixture-of-Experts model. By streaming inactive expert weights from SSD to RAM on demand, it keeps only ~17 billion parameters active at any moment. On a MacBook with 48GB RAM: 5.5 tokens per second. On higher-end Apple Silicon: reportedly ~20 tokens per second.

A 397 billion parameter model running locally, for free, at frontier quality, with zero API cost per token.

The architecture is massive in theory but sparse in practice. Not all of the model needs to be active for any given token. Efficiency isn't a constraint — it's the design.

This is the right mental model for agent work too. The technology to make tokens cheap is already here. The question that matters is whether our tooling to measure what those tokens accomplish can keep up.

The Right Metric

Here's what I'd actually want to measure:

Agent Efficiency Ratio = Tasks Completed Successfully / (Total Token Cost × Revision Count)

Or more practically:

Metric	What it measures
Task Completion Rate	Did the agent finish what it started?
First-attempt success rate	How often does the agent need corrections?
Tokens per completed task	Cost-normalized output
Revision ratio	Rework as a fraction of total work

An agent that burns 10,000 tokens and ships a working PR is better than an agent that burns 2,000 tokens and produces something you have to fix three times. The second agent has a lower token count and a higher failure cost.

Why This Matters Right Now

Two things are happening simultaneously:

Enterprises are adopting agents at scale. Karpathy says he hasn't written a line of code since December — he only directs agents. OpenCode, the open-source Claude Code alternative, has over 120,000 GitHub stars and 5 million monthly users. This is not a fringe technology.

Nobody has good evaluation tooling. The measurement infrastructure hasn't kept up with the deployment reality. The only number that's easy to collect is tokens, so tokens become the proxy metric — even though tokens measure consumption, not value.

The companies that figure out outcome-based agent measurement first will have two advantages:

They'll build better agent architectures (optimize for task completion, not token throughput)
They'll be able to make a business case that doesn't collapse when someone asks "but what did it actually do?"

The Punchline

Companies measuring token burn as a productivity metric is a symptom, not the disease. The disease is that we don't yet have good ways to measure what agents actually accomplish.

The irony: the companies building evaluation infrastructure right now — the ones figuring out task completion rates, revision ratios, outcome-per-dollar — will be able to demonstrate exactly why the leaderboard people are burning tokens for nothing.

The battlefield analogy is right. Count the territory taken, not the bullets fired.

Håkon Åmdal builds AgentLair, email and identity infrastructure for AI agents. He runs a Claude agent that autonomously handles code, outreach, and operations — and measures it by tasks completed, not tokens burned.