varun pratap Bhardwaj

Posted on Jun 23 • Originally published at qualixar.com

Microsoft Killed a Tool Its Own Engineers Loved. Here's the Token-Economics Lesson.

#ai #career #productivity #programming

Microsoft handed its engineers Claude Code. Then, sometime around the end of June 2026, it took it away — not because the tool underperformed, but because the token bills burned through an entire division's annual budget. The replacement: Copilot CLI, Microsoft's own product, which costs the company far less per token.

That one sentence tells you more about where AI is headed than six months of analyst briefings.

I've spent 15 years building the systems these companies run on — enterprise architecture, reliability, the operational plumbing. I've watched many technology waves hit. This one is different, and not in the way the headlines say. The real story is not that AI is taking jobs. The real story is that AI, unmanaged, is bankrupting the budget when it works. That distinction matters enormously for anyone building or integrating these systems.

▶ I traced the whole chain in this 11-minute film — watch it first if you want the narrative: https://youtu.be/x1l7uWKsN_E

What a token actually is (and why it's the meter)

Before the economics, one paragraph on the mechanics, because the cost structure follows directly from how these models work.

A large language model is, at its core, a next-token predictor. Given a sequence of tokens — subword units, roughly 0.75 words each — the model outputs a probability distribution over what token comes next, then samples from it. Every forward pass costs compute. That compute is what you pay for when you call an API.

The billing meter is simple: tokens in + tokens out. Every character of your system prompt, every line of code in the context window, every word the model writes back — all metered. At scale, across thousands of engineers running multi-step agentic workflows with large context windows, this is not a rounding error. It is the dominant cost line.

This is why token economics is not an abstract finance problem. It is an engineering problem, and the engineers who understand it will be the ones who stay in the room.

The Microsoft case

The facts, sourced to The Verge and Windows Central: Microsoft deployed Claude Code to engineers, adoption was real, and engineers found it useful. Then the bills arrived. Token consumption at scale, with agentic coding assistants that iterate autonomously — reading files, running tests, generating diffs, re-reading files — produces token counts that look nothing like a typical chat session. The division's annual budget was gone. Microsoft moved engineers to Copilot CLI, a tool it controls and can price internally.

This is not a cautionary tale about AI failing. It is a cautionary tale about deploying AI without a token budget, usage telemetry, or a cost model. The tool worked. The governance didn't exist.

The Uber numbers

Uber's case is even sharper because we have more detail. The reported figures: 5,000 engineers with access to Claude Code, 84% adoption within months, power users burning up to $2,000 per month each, and the company's entire 2026 AI budget exhausted in four months. (Source: Bloomberg, TechCrunch, Fortune.)

Run the arithmetic yourself. 5,000 engineers, even if only 10% are "power users" at $2,000/month, is $1,000,000 per month from that cohort alone. In an agentic workflow, a single task — "refactor this service, write tests, validate against the existing test suite" — can consume tens of thousands of tokens as the model iterates. Multiply by however many tasks a productive engineer runs per day. The number becomes obvious in hindsight. Nobody modeled it in advance.

The counterintuitive part: prices fell, bills rose

Here is the number that should stop you cold: token and inference prices fell roughly 60–80% over the same period the Microsoft and Uber bills went vertical. (Source: artificialanalysis.ai and provider pricing trackers.)

The unit cost of intelligence collapsed. The total bill still exploded.

This is a textbook instance of what economists call the Jevons paradox: when a resource becomes cheaper per unit, total consumption rises faster than the price fell. The canonical example is coal in 19th-century Britain — Watt's more efficient steam engine made coal cheaper to use per unit of work, so industry used enormously more coal, and total coal consumption went up. The same dynamic is running in AI right now.

Cheaper tokens make more use-cases economically viable. More use-cases get built. Each use-case consumes tokens. Engineers, once unblocked by cost, run more iterations, use larger context windows, build more agentic loops. The feedback is fast and the effect is non-linear. Nobody managing a budget modeled this because it's not obvious until you're four months into a fiscal year and the money is gone.

Satya Nadella named the implication directly in a post seen by tens of millions in mid-June 2026: every company must now build two kinds of capital — human capital, and token capital. The models, data, and compute you own instead of rent. That framing is not rhetorical. It is the accurate description of a new cost structure that most finance teams do not yet have tooling for.

Why this is an AI Reliability Engineering problem

I use the term AI Reliability Engineering deliberately, because it maps precisely onto what we already know how to do in software.

In traditional systems, reliability engineering means: define SLOs, instrument everything, understand failure modes, build circuit breakers and fallbacks, operate within resource budgets. You do not deploy a service that can consume unbounded CPU without capping it. You instrument latency and error rates from day one. You test under load before you hit production.

None of this was applied to these AI deployments. The failures are identical in structure to the reliability failures I've spent 15 years watching: deploy first, instrument later, pay the bill when it arrives.

Token management is resource management. It needs the same treatment: measure before you scale, set hard budgets at the team and task level, build telemetry that surfaces cost per feature and cost per developer, and treat a budget overrun as an incident, not a line item to negotiate next quarter.

The specific failure mode at Microsoft and Uber was not that AI is expensive. It was that nobody built the harness.

The discipline you can apply Monday

This is not theoretical. Here is what the discipline looks like in practice, starting with how you structure the work before you spend a single token.

Spec first, execute second. The most expensive thing an AI agent can do is iterate toward an underspecified target. If you hand a coding agent "improve the authentication module," it will read every file that might be relevant, generate a plan, generate code, generate tests, discover the tests fail, re-read context, try again. Each loop is tokens. A precisely scoped task — "add rate limiting to the POST /login endpoint, 5 attempts per minute per IP, using the existing Redis client at src/cache/redis.go, write unit tests using the existing test harness, touch no other files" — costs a fraction of the open-ended version and fails in more predictable ways.

This is not a new discipline. It is what a good tech lead does when writing a story for a junior engineer. The AI just makes the cost of not doing it visible.

Diagnose before you run. Before an agent executes a multi-step task, spend a small token budget on diagnosis: what is the actual state of the system, what are the dependencies, what will break. This is the equivalent of reading the codebase before you start writing. An agent that skips this step will discover blocking issues late in the task, after burning tokens on work that cannot be committed.

Test cheap, fail small, then spend big. Build a graduated token budget for each class of task. Run the first 10–50 test iterations on a cheaper model or with a constrained context window. Let it fail. Understand the failure modes. Then spend the larger token budget on the full execution. The cost difference between getting this right and running everything at full scale from the start is an order of magnitude.

Instrument everything. You cannot manage what you do not measure. Token consumption per task, per developer, per feature, per sprint — this is operational data, not just billing data. The companies that know this number today will be the ones that can actually operate AI at scale in 18 months. The companies discovering it for the first time on a quarterly invoice will be the ones explaining to the board why the AI budget is gone.

Treat the token budget like a resource limit, not a line item. A circuit breaker that stops an agent when it hits 100K tokens on a task that should take 10K is the same as a timeout on a database query. It is not a restriction on AI usefulness. It is the operational discipline that keeps the system from eating the budget.

What the context window does to this

One detail that compounds all of the above: context windows have grown dramatically. A model that can hold 200K tokens in context is genuinely more powerful for complex tasks — it can reason over large codebases, long conversation histories, extensive documentation. It is also, by construction, more expensive per inference when that context is populated.

Agentic systems compound this further. In a multi-step agent loop, the growing context of what has happened so far (tool outputs, intermediate reasoning, prior code generations) accumulates across turns. A task that takes 20 agent steps, each with an accumulated context that doubles from the previous step, does not cost 20x a single step. The cost curve is steeper than that.

This is not an argument against large context windows. It is an argument for understanding that you are paying for every token in that window on every forward pass, and for building systems that manage context efficiently — summarizing completed steps, trimming irrelevant history, structuring the agent loop to minimize unnecessary context accumulation.

The broader picture

MIT's Project NANDA found that 95% of enterprise generative AI pilots show no measurable P&L return. Gartner projects 40% of agentic AI projects will be cancelled by 2027. These numbers are not surprising if you understand the token economics. Projects are getting killed not because the AI failed to produce output, but because the cost of producing that output at scale was never modeled.

The companies that survive the next wave of AI integration will be the ones that treat token capital the way a finance team treats money: measured, budgeted, optimized. The engineers who build those systems — the harness, the guardrails, the human-in-the-loop checkpoints, the token telemetry — are not overhead. They are the mechanism by which AI delivers the return that 95% of projects are currently failing to produce.

I call this AI Reliability Engineering because that is what it is. It is the operational discipline that closes the gap between what a model can do in a demo and what it can sustainably do in production, at scale, without burning the budget in four months.

There is a real job here. The requisition does not exist yet at most companies. But the problem it solves is already showing up on the P&L.

The coda

The story the headlines are writing is "AI takes jobs." The story the data is telling is more precise: AI, unmanaged, takes budget. Managed well, it takes neither — it multiplies the output of every engineer who understands how to direct it.

Microsoft engineers who knew how to write a tight spec, run cheap diagnostic passes, and validate incrementally were the ones delivering real output before the token bills arrived. The engineers who ran open-ended agentic loops and hoped for the best were the ones contributing to the budget problem.

You were never the cost. You are the cure — but only if you build the skills to operate these systems reliably.

Don't trust that framing. Verify it. Start with the numbers above.

Sources

Microsoft kills Claude Code over token costs — The Verge (Tom Warren), Windows Central
Uber: 5,000 engineers, 84% adoption, up to $2,000/month power users, 2026 AI budget exhausted in 4 months — Bloomberg, TechCrunch, Fortune
Token/inference prices fell roughly 60–80% (2025–2026) — artificialanalysis.ai, provider pricing trackers
Satya Nadella: "human capital and token capital" — X (post, mid-June 2026), Yahoo Finance, Stocktwits
MIT Project NANDA: 95% of enterprise GenAI pilots show no measurable P&L return — MIT Project NANDA 2025
Gartner: 40% of agentic AI projects cancelled by 2027 — Gartner 2025
Accenture: worst single-day stock loss (~18%) in company history, June 19 2026 — Financial Times, CNBC
Accenture: outsourcing bookings down 15%; clients reallocating existing budgets — Accenture earnings call, FT, CNBC
NASSCOM: ~1 million AI professionals needed in India by 2027, fewer than 500,000 qualified today — NASSCOM

Varun Pratap Bhardwaj is the founder of Qualixar, building the AI Reliability Engineering category. The full video essay is at https://youtu.be/x1l7uWKsN_E.

DEV Community