The Token Got Cheaper. Your Bill Didn't.

#ai #enterprise #cost #inference

An enterprise client of an AI consultant SUPPOSEDLY accidentally spent half a billion dollars on Claude in a single calendar month (I am going to leave whether or not this is true as an exercise to the reader, because it LIKELY will happen... let's call it historical fiction?) Apparently, they had failed to set per-employee usage limits on the licenses, the agentic workflows their employees were running compounded against each other until the bill hit the comma where it did, and the consultant told Axios about it in late May. And while it is being called a "cautionary tale," the reality is that the cost structure of enterprise AI in 2026 is mismatched against the way it is being priced, sold, and budgeted for, by enough that one missing license control compounded to nine figures inside thirty days.

This is one of my BIGGEST pet peeves in the industry right now... per-token pricing for end-users.

Gartner's latest forecast says inference on a trillion-parameter LLM will cost more than 90% less in 2030 than it did in 2025. Epoch AI's tracker puts the year-over-year drop at roughly 10x for equivalent capability. Equivalent-to-GPT-4 performance, which cost more than $400 per million tokens in 2023, now sits at $0.40. That is, by every reasonable benchmark, the single largest deflation in price-per-unit performance any computing platform has ever produced.

And yet.

The companies actually deploying this technology in production are watching their monthly AI bills go up by roughly 320% year-over-year, against unit prices that fell something like 280x. Uber's CTO admitted (claimed?) in April the company had already burned through its entire 2026 Claude Code budget. We have a structural mismatch between what the industry is pricing and what the industry is actually buying. This is not going to last.

The vendors know it, and the ones closest to the cost structure are repricing first. In the first week of June, the three tools that own agentic coding all stopped pretending the flat seat could survive contact with the actual cost of inference. GitHub Copilot moved to usage-based billing on June 1, a monthly credit allotment and metered tokens after that, and developers running long agent sessions opened their first invoice to jumps of 10x to 50x. Within forty-eight hours Cursor had carved its team plans into tiers with separate usage pools, and Cognition had relaunched Windsurf as a metered Devin. Three competitors who would happily watch each other go bankrupt made the identical unpopular move inside one week, which is as good a leading indicator as anything. The all-you-can-eat seat was a venture subsidy against a bill that has now come due, and a subsidy is the most expensive thing in the world to be a customer of right up until the moment it ends.

The arithmetic of the loop

The thing the per-token price chart does not tell you is how many tokens a single user request actually generates. In 2023, the typical "AI feature" inside an application was a single model call. The user typed a question, the model returned an answer, the bill was one round trip. The unit economics were simple enough: price per token times tokens per response times number of responses per day.

In 2026, however, a modern agentic workflow, the kind every enterprise vendor is selling and every Fortune 500 is buying, calls the model somewhere between 10 and 20 times per user task. There is a planner call, a retrieval call, a verifier call, a tool-use call, a critique call, a refinement call, possibly a second retrieval informed by the critique, and a final answer-formatting call. Each of those calls is cheaper than the one call it replaced. The product of all of them, against the same user task, is more expensive than the original was.

The RAG pipelines that are now mandatory in any enterprise deployment make this worse, not better. Every retrieval-augmented call inflates the context window with retrieved documents, which means the input token count for the model balloons by a factor of three to five. The cost of an input token is lower than it has ever been, and the number of input tokens being shoved into every call is higher than it has ever been, and the two trends are not converging. They are diverging, and the divergence is the bill.

Always-on monitoring agents, the ones every cybersecurity vendor and every observability platform is now shipping with a default-on toggle, are the third factor. A monitoring agent that runs continuously against a production data feed does not generate a single request per user. It generates a continuous request per data point. The unit cost of that request is trivial, but the product of unit cost and request rate, over a month, is not trivial. It is the largest line item the buyer did not budget for. The unnamed half-a-billion-dollar customer is what happens when you stack all three of those factors on top of each other, give the result a default-on toggle, and then go home for a long weekend.

Containers got cheap. The shipping business didn't.

The cleanest analogy here is the shipping container, and I am going to use it because the parallel is exact, not because it is fashionable.

Containerization, which arrived as a serious industrial standard in the late 1960s, reduced the cost of moving a ton of goods across an ocean by roughly an order of magnitude in fifteen years. The container itself became a commodity and the price of a single trans-Pacific shipment plummeted. By every measurable unit, the cost of moving cargo went down. YET the result was not that shipping got cheaper as a category. The result was that the volume of cargo being shipped exploded, because the cost reduction made entire product categories economic that previously were not. Cheap electronics. Fast fashion. Perishable food on long-haul routes. Just-in-time global manufacturing. None of it existed at meaningful scale before the container. All of it exists now.

The visible cost is the container price, which fell. The invisible cost is what the cheap container made possible: warehousing networks the size of small countries, the inventory-financing operations needed to keep them stocked, the customs and compliance infrastructure that absorbs the friction, and the consumer behaviors that assume a six-day delivery window from anywhere on Earth. The container did not save the world money. It moved the money from the moving of goods to the storing, financing, choreographing, and consuming of them. The bill went up. The container got cheaper. Both can be true.

A token is a container. The model call is the box. The thing you actually pay for in a 2026 production AI deployment is not the boxes. It is the warehouse: the data plane, the retrieval substrate, the orchestration layer, the eval harness, the safety review, the monitoring system that runs against your monitoring system. The token is what the vendor quotes you on. The warehouse is what you actually built.

The bill you have not seen yet

The dominant cost of a 2026 enterprise AI deployment is not the LLM bill. It is the data movement that feeds the LLM, where every RAG retrieval pulls data from somewhere, and every agent invocation reads context from a database, a vector store, a cached document, a tool call, an upstream system. The bytes moved per useful answer have gone up by orders of magnitude. The price of moving a byte across a public cloud has not gone down. In some regions, against some egress paths, it has gone up.

This is the place where the entire architecture conversation should be happening, and it isn't. The vendors are competing on price-per-token because that is the metric the customer is measuring. The customer is measuring price-per-token because that is the metric the vendor is publishing. Both sides agree to compete on the part of the bill that is collapsing, and quietly ignore the part of the bill that is growing. The result is a market in which the headline cost is falling 10x per year and the actual cost is going up, and nobody is willing to put both numbers on the same slide.

There is a version of enterprise AI architecture that handles this correctly, and it is the version where the compute moves to the data rather than the data moving to the compute. If the retrieval substrate sits next to the model, you stop paying egress fees. If the agent loop runs against a local cache of the relevant context, you stop paying for the redundant retrieval round-trips. If the monitoring agents run at the edge against the data they are monitoring, you stop paying to ship that data into a central inference cluster and back out again. The unit-cost-of-token chart says nothing about this, because it is not measuring it. The total bill does.

Akamai and Comcast ran a benchmark on this in March where they had a voice small language model on four NVIDIA RTX PRO 6000 GPUs, single centralized cluster versus an AI Grid distributed across four sites, under burst traffic. The distributed deployment ran 52.8% cheaper at baseline and 76.1% cheaper during bursts, with sub-500ms latency at P99 and an 80.9% throughput gain at peak. That is what the architecture conversation looks like when you measure the right thing. It is not a per-token comparison. It is a total-cost-of-delivering-the-answer comparison, and the centralized model loses.

Stop watching the wrong number

If you are signing a contract for AI infrastructure this quarter, stop optimizing for the per-token price. The price will keep falling, on a timescale that makes any contract you sign for it irrelevant inside of a year. The vendors competing on it are competing on the visibly cheap part of a cost structure that is shifting somewhere else.

Optimize for where the data sits, what it costs to move, and which calls have to round-trip through your central inference path. The bill you have not yet seen is in the egress line item, the vector store retrieval costs, and the monitoring spend that compounds while you sleep. The bill on the model is the easy one. It is also, increasingly, not the bill.

The half-a-billion-dollar customer set their license limits wrong. That was a control failure. The control failure is interesting because the thing it failed to control got large enough in a single month to make the news. Two years ago that same control failure would have produced a five-figure bill, the CFO would have noticed at the next quarterly review, and nobody would have written about it. The control failures are getting expensive faster than the controls are getting better. That gap is the part of the cost curve nobody has put on a chart.

The token got cheaper. Your bill didn't. Both of those things are true at the same time, and the gap between them is where the next decade of enterprise AI architecture is going to be decided.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*

NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts!

Originally published at The Token Got Cheaper. Your Bill Didn't..