What I Actually Pay For When My LLM Bill Doubles Overnight

#devops #infrastructure #llm #sre

TL;DR: Your LLM bill isn't one number, it's about six. Retry storms, runaway agents, and bad routing are the usual culprits. A bit of observability work up front saves you from staring at a $40k invoice wondering what happened.

Last quarter I helped a mate's team trace a sudden 3x jump in their OpenAI spend. They thought it was usage growth. It wasn't. It was a retry loop in their orchestration code that fired off three full-context requests every time a single tool call timed out. Took us about two hours to find, ten minutes to fix.

I reckon most teams running LLMs in prod have something similar lurking. You just don't see it until the invoice lands.

The bill has more parts than you think

When you read "LLM cost" on a finance dashboard, that single number is hiding a bunch of independent failure modes. Worth pulling them apart.

Cost driver	What it actually is	Where it hides
Input tokens	Prompt + context + system message	Long system prompts, fat RAG chunks
Output tokens	Model's response	Verbose prompts, no max_tokens cap
Retries	Failed requests you paid for anyway	Library defaults, agent loops
Cached vs uncached	Prompt caching hits or misses	Cache invalidation from tiny prompt edits
Provider markup	Your gateway/aggregator's cut	Hidden in unit pricing
Wasted spend	Calls you didn't need to make	Background agents, debug code in prod

The first three are the ones I see blow up budgets. Provider markup matters at scale but it's predictable. The other three sneak up on you.

Retries are the silent killer

Here's the pattern I see constantly. A team uses some agent framework, the framework has a default retry of 3 with exponential backoff, and the prompts include the full conversation history. A timeout on token 4000 of a 4096-token response means you just paid for 4000 tokens, then immediately paid for another 4000+ tokens, then maybe a third time.

# What people write
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4",
    messages=long_history,
    max_retries=3,  # default
)

# What they should write
response = client.chat.completions.create(
    model="gpt-4",
    messages=long_history,
    max_tokens=500,
    timeout=30,
    max_retries=1,
)

Two changes. Cap your output. Stop retrying expensive operations more than once. If a 30-second call fails once, the second attempt usually fails too.

Caching is worth the effort, but it's fragile

Prompt caching on most providers gives you something like 50-90% off cached input tokens. Brilliant when it works. The trap is that cache keys are exact prefix matches. Change your system prompt by one character, your timestamp injection bumps every request, your dynamic user context shifts the prefix... cache hit rate goes to zero and your bill quietly goes back up.

A useful exercise: log your actual cache hit rate per route, not just the average. I had a service where overall hit rate was 70%, looked fine, but one specific endpoint was running at 4% because someone added a timestamp to the system prompt for "debugging."

Routing across providers

Once you've got more than one model in play, where requests go starts mattering a lot. Cheap models for classification, expensive ones for synthesis. Local models for bulk preprocessing if you can host them.

A few options here, depending on how much you want to manage yourself:

Build it in your own gateway service
Use a router library like LiteLLM in-process
Run a proxy like Bifrost (https://github.com/maximhq/bifrost), Kong AI, or Portkey in front of your services
Stick with one provider and use their own routing

The proxy approach is what I've ended up preferring on bigger setups because it gives you one place to log, retry, and rate-limit. The downside is one more thing to keep alive. For smaller services I just call the SDK directly and accept the duplication.

Set hard limits per workload

The cheapest debugging tool I've found is a per-API-key spend cap. Most providers offer them. Most teams don't set them because the dashboards default to "monitor only."

Set them. Set them low. If your batch job is supposed to cost $200 a day and you cap it at $400, you'll get paged the day someone accidentally points it at production traffic instead of staging. That page is much nicer to receive than a Slack message from finance two weeks later.

# Example budget config we use
budgets:
  - name: chat-prod
    daily_limit_usd: 1200
    hourly_limit_usd: 100
    alert_at_pct: [50, 80, 95]
    hard_stop_at_pct: 100
  - name: experiments
    daily_limit_usd: 50
    hard_stop_at_pct: 100

Trade-offs and Limitations

A few honest caveats.

Caching aggressively means your prompts get rigid. You can't iterate on system prompts as freely because every edit nukes your cache.

Hard spend caps will absolutely cause prod outages. That's the point, but it means you need a runbook for "we hit the cap, what now." Either auto-raise with approval, or fail open to a cheaper fallback model, or fail closed and alert. Pick one before it happens.

Per-key budgets only work if you have enough keys. If everything runs through one shared key, you've got coarse-grained control at best.

And honestly, observability work has diminishing returns. If your bill is $500 a month, don't build a routing platform. If it's $50k a month, the engineering time pays for itself in a week.