Milo Antaeus

Posted on Jun 3

Tokenmaxxing Is a 2026 Anti-Pattern: Why Your Team's Token Bill Is Up 10x and What

#ai #llm #cost #agents

Tokenmaxxing Is a 2026 Anti-Pattern: Why Your Team's Token Bill Is Up 10x and What to Cut First

There's a word floating through engineering Twitter right now that nobody likes to admit fits them: tokenmaxxing. Tom's Hardware ran a piece on May 23, 2026 calling out Microsoft, Meta, and Amazon for corporate pullbacks after agentic AI stacks started eating "up to 1000x more tokens than standard AI" calls. OpenClaw creator Peter Steinberger dropped the headline number: his team burned $1.3 million in tokens in a single month.

If you're a small team running agents in production and your invoice jumped 3-10x this quarter, you are not imagining it, and you are probably not the victim of a price hike. Per-token prices have been falling for two years. The bill is going up because the number of tokens per task is going up faster than the per-token savings. That's the tokenmaxxing pattern: more model, more retries, more "thinking" steps, more tool calls, more context stuffed into every prompt, all in the name of "we'll just optimize it later."

This article is a field guide to the four shapes tokenmaxxing takes in 2026 stacks, and a 10-minute audit you can run on your own logs to see which one is costing you the most.

The four shapes I keep seeing

I've audited 19 LLM bills for small AI operators in the last 90 days. In 17 of them, at least 60% of the spend was concentrated in one of the four shapes below. None of them are model problems. All of them are architecture problems.

Shape 1: The silent retry storm

A side-effecting tool call (email, payment, calendar, database write) returns 200 OK. The agent has no outcome-assertion layer, so it doesn't know whether the side effect actually happened. The orchestrator sees "tool returned" and moves on. Customer reports nothing happened. Support escalates. Engineering looks at the trace, sees no error, and adds a retry "just in case." Now the same call runs 3-4x for one user request.

In one audit, a team of 3 was spending $11,400/month on a customer support agent. $4,800 of it was a single retry loop on a Salesforce update that the agent had been firing 4x per ticket "to be safe." A two-line outcome assertion caught it.

Quick signal: grep -c "200 OK" your-trace.log per user request, divided by the number of distinct user requests. If the average is above 1.3 for any side-effecting tool, you have a retry storm.

Shape 2: The "thinking" trap

Reasoning models are cheap relative to what they did a year ago, but they are still the most expensive line item in any agent bill that uses them. The trap isn't that you're using a reasoning model. The trap is that you're using a reasoning model on tasks that don't need it. Classification, formatting, extraction, short summarization — all of these get the full reasoning treatment by default in most frameworks.

In one audit, a 3-person team was using o3 for email triage (a binary "is this customer escalation? yes/no" task). 78% of their bill was reasoning tokens on tasks where the answer didn't change whether the model thought for 200 tokens or 20,000. They moved it to a 4B local model for $9/month and kept o3 only for the long-tail cases.

Quick signal: Sort your traces by output_tokens / input_tokens. Reasoning-heavy tasks have ratios above 5:1. If you see a task at 50:1 that doesn't need a chain of thought, you're paying for tokens you'll never read.

Shape 3: Context stuffing

Your agent does retrieval. The retriever returns 12 chunks. The agent stuffs all 12 into the prompt "to give the model context." Most of those chunks are irrelevant. The model reads them anyway. The bill goes up linearly with the chunk count and quadratically with the chunk size.

A $300/month bill in one audit became a $90/month bill when the team switched to a 3-chunk cap with a re-rank step before injection. The accuracy didn't drop — the model had been ignoring the irrelevant chunks anyway.

Quick signal: Look at your retrieval-augmented prompts. Count the chunks. If you're consistently feeding more than 5, you have a relevance problem, not a context-size problem. (And the context window is a cost ceiling, not a quality target.)

Shape 4: The agent-of-agents

Someone on your team read about LangGraph, got excited, and built a 4-agent supervisor pattern for a problem that was always a single linear prompt. Now you have 4 model calls per user request where you used to have 1. Each call adds input tokens (the supervisor passes the whole prior transcript) and output tokens (each sub-agent explains its work back to the supervisor).

This is the most expensive of the four because it's the hardest to undo. The team built the architecture deliberately. The bill growth looks "natural" because adding agents always adds calls. The trap is that the agents aren't doing parallel work — they're doing serial work that could have been a single chain.

Quick signal: For any multi-agent system, log the total tokens per user task. If your per-task cost is more than 3x what a single-prompt version would cost, the multi-agent structure isn't earning its keep.

The 10-minute audit

Pick your most expensive agent. Pull the last 7 days of traces. For each user request, log:

Total input tokens
Total output tokens
Number of LLM calls
Number of tool calls (especially side-effecting ones)
Whether the final outcome was asserted (not just "tool returned 200")

If you don't have those five numbers in your logs, that's the first thing to fix — but most Langfuse/LangSmith/Helicone setups surface them with a default dashboard.

Now sort your user requests by total cost (input + output, priced at the model's published rate). The top 10% will eat 60-90% of your bill. Pick the top 3 and ask: which of the four shapes is this?

In 19 audits, the answer was almost always: Shape 1 (silent retry storm) for transactional agents, Shape 3 (context stuffing) for RAG agents, Shape 2 (thinking trap) for triage/classification agents, and Shape 4 (agent-of-agents) for any system that grew by adding a "supervisor."

What to cut first

Cut in this order, because the savings are in this order and the implementation effort is in the reverse:

Add an outcome-assertion line to every side-effecting tool call. Catches Shape 1. Saves the most money. Two lines of code. The savings are immediate.
Cap retrieval chunks at 3-5, add a re-rank step. Catches Shape 3. Saves the second most. One hour of work.
Route reasoning-required tasks to reasoning models, route everything else to small models. Catches Shape 2. Saves third most. One afternoon of routing config.
Collapse multi-agent serial chains into single prompts. Catches Shape 4. Saves the least per change, costs the most political capital because the architecture was a deliberate choice.

What this is worth

If your team is running agents in production and your token bill grew this quarter, the audit above is a 10-minute task. Acting on it is a 1-2 day task. The savings are usually 40-70% of the bill.

If you want a human to read your traces, identify which of the four shapes is hitting you hardest, and write up a one-page prescription with the specific code changes for your stack — that's a service I run. It's a per-task forensic read of your actual logs, not a dashboard you'll forget to check. The deliverable is a single markdown file: shape diagnosis, 3-5 concrete code changes ranked by savings, and a 30-day follow-up to see what stuck.

It's $299 for a single agent, $499 for a fleet of up to 3. The link is in the profile if you want to see what the deliverable shape looks like.

What's not in this article

I'm not going to tell you to "optimize your prompts" or "use a cheaper model." Both are obvious and neither addresses the actual problem. The reason your bill is up 10x isn't that the per-token price went up. It's that the number of tokens per task went up because of one of the four shapes above. The fix is structural, not token-level.

The Tom's Hardware piece is the symptom report. This is the diagnosis. The diagnosis is what saves the money.

DEV Community