DEV Community

FuturMix
FuturMix

Posted on

Our AI agent bill swung 40% month to month. Here's how we made it predictable.

A few months into running coding agents in production — Claude Code, Aider, a couple of home-grown multi-step workflows — finance pinged me with a question I couldn't answer: "Why is this month 40% more than last month? Did we ship more?"

We hadn't. Traffic was flat. The agents were doing roughly the same work. But the bill had swung from around $1,100 to over $1,500 with no change we could point to. That's the part that stung — not that it was expensive, but that it was unpredictable. You can budget for "expensive." You can't budget for "rolls dice every month."

So we pulled the bill apart line by line. The swing came from three hidden variables, none of which we were watching. Here's each one and what we did about it.

Variable 1: A platform fee that scales with everything

We were routing most calls through an aggregator for the convenience of one endpoint across multiple models. Reasonable choice. But the aggregator's pricing applies a 5.5% platform fee on top of model costs on the pay-as-you-go tier — flat, with no volume discount.

That fee is predictable in percentage terms but it scales with everything else:

Monthly model spend 5.5% platform fee
$500 $27.50
$1,000 $55
$1,500 $82.50

So when the other variables pushed our spend up, this fee amplified the swing instead of damping it. On a $1,500 month, $82.50 evaporated before a single token was generated. It's not the biggest line, but it's pure overhead riding on top of an already-moving number.

The realization: the fee itself isn't the villain — it's that it's a multiplier on volatility we hadn't controlled yet.

Variable 2: Upstream price changes you don't see coming

Here's the one that actually surprised us. Aggregators that advertise "pass-through pricing" pass through price changes too. When an upstream provider adjusts list prices, or quietly shifts the default version a model alias points to, your per-token cost moves — and you find out on the invoice.

We had a workflow pinned to a generic model alias rather than a specific version. At some point the alias started resolving to a newer, pricier snapshot. Same code, same prompts, higher per-token cost, no deploy on our side. That alone accounted for a chunk of the swing.

The fix here is boring but effective: pin to explicit model versions, not floating aliases, anywhere cost matters. A floating alias is convenient until it's a silent price change. If your provider or aggregator lets you lock a rate or a version, do it for your high-volume paths.

Variable 3: The biggest culprit was our own code

I'll be honest: the largest single contributor wasn't the fee or upstream pricing. It was us. Our agents were burning tokens on two self-inflicted patterns, and this is the part worth stealing regardless of which provider you use.

Retry storms. A few of our agent steps had naive retry logic — on a timeout or a malformed response, resend. Fine in isolation, but some of those "failures" had actually half-completed, and a couple of steps could retry 3–4 times on a bad day. Every retry is a fresh, fully-billed call. We were paying for the same logical step multiple times and calling it resilience.

What helped, in order of impact:

  1. Cap retries per logical step (we landed on 2) and add exponential backoff so a transient blip doesn't become a billed stampede.
  2. Make retries cheaper than the original where possible — retry on a smaller/cheaper model first, escalate only if it fails. Most transient failures succeed on the second try regardless of model.
  3. Trim the context you resend. Our agents were re-sending the entire accumulated conversation on every step. Summarizing or windowing older turns cut input tokens substantially with no quality loss on the tasks we measured.

The retry cap and context trimming together cut our token volume by roughly a third — and crucially, they cut the variance, because the worst months were the ones where retry storms had piled up unnoticed.

If you take one thing from this post, take this section. It has nothing to do with which API you call. It's just the easiest 30% you're probably leaving on the table.

What actually made the bill predictable

Putting it together, predictability came from attacking all three:

  • Variable 3 (our code): retry caps + context trimming — the biggest and most portable win.
  • Variable 2 (upstream drift): pin versions / lock rates on high-volume paths so the per-token cost stops moving on its own.
  • Variable 1 (platform fee): decide whether you're actually using the routing you're paying 5.5% for.

Disclosure: FuturMix is our product, so treat this next part as one example of how we closed variables 1 and 2, not a neutral ranking.

On variable 1, we moved our mainstream-model traffic off the fee model. On variable 2, we wanted a locked, predictable rate instead of pass-through drift. FuturMix is the relay we built to do both: no platform fee on top, and a fixed discount off provider list price (Claude ~10% off Anthropic, GPT ~30% off OpenAI, Gemini ~20% off Google) so the rate doesn't float when upstream list prices wobble. It's OpenAI-compatible, so switching is a base-URL change:

from openai import OpenAI

# Before
client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key="sk-or-...")

# After
client = OpenAI(base_url="https://futurmix.ai/v1", api_key="sk-...")
Enter fullscreen mode Exit fullscreen mode

Honest tradeoff: it carries 25+ models vs 200+ on the big aggregators, and the community is smaller. If you need the long tail of open-source models or 200+ options, the breadth aggregator still wins — this is for teams living on mainstream models (Claude, GPT, Gemini) who care more about a stable, lower rate than catalog size.

When to use what

Your situation Best move
One model, high volume Go direct to the provider
Running agents (Claude Code, Aider, etc.) Fix variable 3 first — it's provider-agnostic
2–3 mainstream models, want a predictable rate Discount/no-fee relay, pin versions
Testing many models, prototyping Breadth aggregator (catalog > cost)
Need 200+ models incl. open-source Breadth aggregator

The point isn't "switch your provider." It's that an unpredictable AI bill is usually three separate problems wearing one trench coat. Fix your own retries, pin what you can, and only then decide whether the routing fee is buying you anything.


Disclosure: FuturMix is our product — an OpenAI-compatible relay for 25+ mainstream models with no platform fee and a fixed discount off provider rates. This post reflects our real debugging of our own bill; the variable-3 fixes work no matter whose API you're on.

How do you keep your LLM bill from swinging? What's your worst surprise-invoice story? Drop it in the comments — I'll be in here replying for the next couple of days.

Top comments (0)