wartzar-bee

Posted on Jun 14

We burned 136 million tokens running an autonomous agent studio. Here's how we cut the bill ~90%.

#ai #agents #llm #devops

We run a studio where AI agents work mostly unattended — they write code, ship sites, produce content, and keep going without a human in the loop. Running agents like that, around the clock, teaches you one thing fast: the bill is the product constraint. Not the model's intelligence. The bill.

Here's the most expensive lesson we paid for, and the architecture we rebuilt to stop paying it.

The 136M-token fire

One of our agents burned ~136 million tokens in a stretch where it produced almost nothing. We assumed runaway tool calls. It wasn't.

The cause was mundane and brutal: the agent was waking itself on a timer (a cron / scheduled self-invoke) into one ever-growing session. Two things compounded:

The whole thread is re-sent every turn. An LLM is stateless. A long agent session doesn't "remember" — the entire conversation is re-uploaded as input on every single call. A session that grows to 800k tokens of context costs ~800k input tokens per turn, even if the model writes two sentences.
The prompt cache expires. Providers cache your context so re-sends are cheap — but the cache has a short TTL (minutes). Any timer that fires slower than the TTL means every wake-up re-reads the full context uncached, at roughly 10× the cached price.

So: a self-looping agent, on a timer longer than the cache window, re-sending an ever-larger thread, uncached, forever. That's how you turn "a few cents of output" into 136M tokens.

The reflex fix vs the real fix

The reflex fix is "use less / set a token limit." That caps the damage; it doesn't change the economics. You're still running every step on a frontier model, still re-sending context, still paying frontier prices for work a much cheaper model could do.

The real fix was to stop treating the expensive model as the default. We rebuilt the runtime around four cost-native principles. None of them are exotic — but no agent framework ships them as the default, because most of the ecosystem makes money when your usage goes up, not down.

1. Never self-re-invoke a frontier model on a timer

A frontier model running a recurring loop in one session is the single most expensive pattern in agent ops. We banned it. Recurring, autonomous work runs off the frontier model entirely — a cheap planner decomposes the goal, a cheap/local worker executes, and a deterministic check verifies. The frontier model is summoned only when a genuine human-level judgment call is needed, and then in a fresh, lean session.

2. Route every step to the cheapest model that can do it

This is the lever almost nobody defaults to, and it's the biggest one. Most steps in an agent loop are mechanical: read a file, run a command, reformat output, check a condition. You do not need a $15/M-token model for that. You need a $0.14/M-token model — or a local one running at ~$0 marginal cost.

We route by step:

Routine / mechanical → a cheap API model (DeepSeek, Gemini Flash) or a local model (Ollama / MLX) at zero marginal cost.
Genuine reasoning / judgment → a frontier model, deliberately, and only then.

Reported savings for this pattern across the industry land at 60–86%. Our own bill dropped about an order of magnitude. The quality cost is near zero if you add the next piece.

3. Gate cheap work with a deterministic verify

The fear with cheap/local models is quality. The answer isn't "trust the cheap model" — it's "let the cheap model do the work, then verify it with something that can't lie." A test suite. A linter. A schema check. An exit code. If the cheap model's output passes a deterministic gate, it's correct by construction and you never paid frontier prices to find out. If it fails, you retry or escalate. The verify-gate is what makes aggressive downshifting safe.

4. Hard caps + honest per-agent attribution

Every agent runs under a spend cap. Cross it and the agent defers, it doesn't barrel on. And we attribute spend per agent — so when something costs too much, we know exactly which one and why, instead of staring at one big number at the end of the month. (The 136M fire was invisible precisely because nothing attributed cost to the loop while it ran.)

The other half: don't re-derive the world every session

The context-resend problem has a second fix beyond caching: keep sessions short and lean, and put continuity in durable files, not in one infinite thread. Our agents write their state, decisions, and memory to disk. A fresh session reads a small digest of what matters instead of re-uploading a 500k-token history. Short threads are cheap threads.

Why this isn't the default anywhere

If routing-to-cheap saves 60–90%, why doesn't every agent framework do it out of the box?

Incentives. The big agent frameworks and observability tools monetize usage, seats, and traces — they grow when your token count grows. The model providers obviously don't profit from you spending less with them. So the most valuable cost lever in agent engineering is the one nobody in the value chain is motivated to ship as a default. It gets left to you.

That's the whole reason we open-sourced our runtime.

The takeaway

You cannot run a real agent network on a frontier model alone. It will cost you your margin. The architecture that works:

frontier model only for genuine judgment, in fresh lean sessions,
everything else routed to cheap or local models,
a deterministic verify-gate so cheap stays correct,
hard per-agent caps and real attribution,
durable files for continuity instead of one infinite thread.

We learned it by setting 136 million tokens on fire. You don't have to.

We're building this in the open — a cost-native, brain-agnostic agent runtime (run any agent on a local model, a cheap API, or a frontier model; same code, same isolation). If you want the runtime or the deeper postmortems, follow along.

DEV Community