TL;DR: Your LLM bill isn't one number, it's about six. Retry storms, runaway agents, and bad routing are the usual culprits. A bit of observability work up front saves you from staring at a $40k invoice wondering what happened.
Last quarter I helped a mate's team trace a sudden 3x jump in their OpenAI spend. They thought it was usage growth. It wasn't. It was a retry loop in their orchestration code that fired off three full-context requests every time a single tool call timed out. Took us about two hours to find, ten minutes to fix.
I reckon most teams running LLMs in prod have something similar lurking. You just don't see it until the invoice lands.
The bill has more parts than you think
When you read "LLM cost" on a finance dashboard, that single number is hiding a bunch of independent failure modes. Worth pulling them apart.
| Cost driver | What it actually is | Where it hides |
|---|---|---|
| Input tokens | Prompt + context + system message | Long system prompts, fat RAG chunks |
| Output tokens | Model's response | Verbose prompts, no max_tokens cap |
| Retries | Failed requests you paid for anyway | Library defaults, agent loops |
| Cached vs uncached | Prompt caching hits or misses | Cache invalidation from tiny prompt edits |
| Provider markup | Your gateway/aggregator's cut | Hidden in unit pricing |
| Wasted spend | Calls you didn't need to make | Background agents, debug code in prod |
The first three are the ones I see blow up budgets. Provider markup matters at scale but it's predictable. The other three sneak up on you.
Retries are the silent killer
Here's the pattern I see constantly. A team uses some agent framework, the framework has a default retry of 3 with exponential backoff, and the prompts include the full conversation history. A timeout on token 4000 of a 4096-token response means you just paid for 4000 tokens, then immediately paid for another 4000+ tokens, then maybe a third time.
# What people write
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=long_history,
max_retries=3, # default
)
# What they should write
response = client.chat.completions.create(
model="gpt-4",
messages=long_history,
max_tokens=500,
timeout=30,
max_retries=1,
)
Two changes. Cap your output. Stop retrying expensive operations more than once. If a 30-second call fails once, the second attempt usually fails too.
Caching is worth the effort, but it's fragile
Prompt caching on most providers gives you something like 50-90% off cached input tokens. Brilliant when it works. The trap is that cache keys are exact prefix matches. Change your system prompt by one character, your timestamp injection bumps every request, your dynamic user context shifts the prefix... cache hit rate goes to zero and your bill quietly goes back up.
A useful exercise: log your actual cache hit rate per route, not just the average. I had a service where overall hit rate was 70%, looked fine, but one specific endpoint was running at 4% because someone added a timestamp to the system prompt for "debugging."
Routing across providers
Once you've got more than one model in play, where requests go starts mattering a lot. Cheap models for classification, expensive ones for synthesis. Local models for bulk preprocessing if you can host them.
A few options here, depending on how much you want to manage yourself:
- Build it in your own gateway service
- Use a router library like LiteLLM in-process
- Run a proxy like Bifrost (https://github.com/maximhq/bifrost), Kong AI, or Portkey in front of your services
- Stick with one provider and use their own routing
The proxy approach is what I've ended up preferring on bigger setups because it gives you one place to log, retry, and rate-limit. The downside is one more thing to keep alive. For smaller services I just call the SDK directly and accept the duplication.
Set hard limits per workload
The cheapest debugging tool I've found is a per-API-key spend cap. Most providers offer them. Most teams don't set them because the dashboards default to "monitor only."
Set them. Set them low. If your batch job is supposed to cost $200 a day and you cap it at $400, you'll get paged the day someone accidentally points it at production traffic instead of staging. That page is much nicer to receive than a Slack message from finance two weeks later.
# Example budget config we use
budgets:
- name: chat-prod
daily_limit_usd: 1200
hourly_limit_usd: 100
alert_at_pct: [50, 80, 95]
hard_stop_at_pct: 100
- name: experiments
daily_limit_usd: 50
hard_stop_at_pct: 100
Trade-offs and Limitations
A few honest caveats.
Caching aggressively means your prompts get rigid. You can't iterate on system prompts as freely because every edit nukes your cache.
Hard spend caps will absolutely cause prod outages. That's the point, but it means you need a runbook for "we hit the cap, what now." Either auto-raise with approval, or fail open to a cheaper fallback model, or fail closed and alert. Pick one before it happens.
Per-key budgets only work if you have enough keys. If everything runs through one shared key, you've got coarse-grained control at best.
And honestly, observability work has diminishing returns. If your bill is $500 a month, don't build a routing platform. If it's $50k a month, the engineering time pays for itself in a week.
Further Reading
- OpenAI prompt caching docs
- Anthropic prompt caching docs
- LiteLLM proxy
- AWS Bedrock pricing breakdown
- Google's "SRE Workbook" chapter on cost
Bills are signals. If yours doubles overnight, something in your system changed. Find that thing before you negotiate a bigger contract.
Top comments (0)