TL;DR: I audited a friend's startup that was spending $4,200/month on Claude API. Only $1,300 produced business value. The other $2,900 was waste — split across three patterns that hit most teams using LLM APIs in production. Here's how to find them in your own bill, and the code to fix each one.
The audit
A friend running a B2B doc-summarization product asked me to look at their Claude bill. Q1 was $4,200/month and climbing. We pulled their request logs into a spreadsheet, classified each call by purpose, then estimated what each should have cost. The answer was uncomfortable:
| Bucket | Cost | Producing business value? |
|---|---|---|
| First-time doc analysis | $890 | Yes |
| User chat turns | $410 | Yes |
| Repeated system prompts (no cache) | $1,810 | No |
| Opus calls that should be Sonnet | $680 | No |
| Serial bulk runs (should be batched) | $410 | No |
Three problems, $2,900/month of waste. Each one is unsexy and easy to miss, but together they were 70% of the bill.
Culprit #1: prompt caching is off
This is the silent killer. Claude 4.x supports prompt caching: send a 5-minute or 1-hour TTL cache_control block, and Anthropic charges you ~10x less for cached tokens on subsequent requests. Pricing today (per million tokens for Sonnet 4.6):
- Fresh input: $3.00
- Cache write: $3.75 (one-time, slightly more than fresh)
- Cache read: $0.30 — 10x cheaper
The catch: you have to opt in per-request, and most code doesn't. Before/after:
# Before — every call pays for the full system prompt
client.messages.create(
model="claude-sonnet-4-6",
system="You are an expert at...[2000 words of rules + examples]",
messages=[{"role": "user", "content": user_input}],
max_tokens=1024,
)
# After — system prompt cached for 5 minutes
client.messages.create(
model="claude-sonnet-4-6",
system=[
{
"type": "text",
"text": "You are an expert at...[2000 words of rules + examples]",
"cache_control": {"type": "ephemeral"},
},
],
messages=[{"role": "user", "content": user_input}],
max_tokens=1024,
)
One-line change. 90% discount on every subsequent call within the cache TTL.
For my friend: 20K tokens of system prompt × 8 requests/min × 50% cache hit ratio = ~$80/day saved. That alone was $2,400/month — most of the $1,810 leak.
OpenAI SDK calling Claude (via compatible proxies) has equivalent semantics:
client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[...],
prompt_cache_key="user-session-12345", # Stable across calls = cache hit
)
Action: open your last week of API logs. If you have any repeated system content across requests, you're leaking.
Culprit #2: model overkill
The mental shortcut "Claude = quality, just always use Opus" is expensive. Opus is 4x the cost of Sonnet for inputs, 5x for outputs. For a lot of work, Sonnet or even Haiku is indistinguishable.
I ran 5 tasks across the lineup (1000 samples, scored by judge model + human spot-check):
| Task | Opus 4.7 | Sonnet 4.6 | Haiku 4.5 | Best price/quality |
|---|---|---|---|---|
| JSON extraction from PDFs | 99.2% | 98.7% | 96.4% | Haiku |
| Code review (real bugs) | 74% | 74% | 61% | Sonnet |
| Creative copy (blind judged) | 51% pref | 48% pref | 32% pref | Sonnet |
| Multi-step reasoning chain | 89% | 76% | 54% | Opus |
| Customer chat | 92% sat | 89% sat | 81% sat | Sonnet |
The pattern: Opus wins clearly only on complex multi-step reasoning. For most tasks Sonnet is within margin of error at 1/4 the cost. Haiku trades 2-5% accuracy for 1/13 the cost — fine when you have downstream validation.
My friend was running every doc through Opus by default. Switching to Sonnet for analysis + Haiku for tagging dropped that bucket from $680 to $140. No quality complaints.
Action: pick the 3 most expensive endpoints in your bill, A/B-test them on the next cheapest model for a week, score outputs blind.
Culprit #3: serial calls when you could batch
If your work doesn't need a response in the next 30 seconds, the Anthropic Message Batches API charges half price with a 24-hour SLA. Same models, same quality, half the bill.
Good fits:
- Nightly summarization runs
- Classifying or tagging large datasets
- Embedding generation for indexing
- Internal report generation
- Training data prep
Bad fits:
- Anything user-facing (you'll wait hours)
- Anything where input depends on previous output
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"doc-{doc.id}",
"params": {
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"messages": [{"role": "user", "content": doc.text}],
},
}
for doc in docs
],
)
# Poll until done (or just check tomorrow)
while batch.processing_status != "ended":
time.sleep(60)
batch = client.messages.batches.retrieve(batch.id)
My friend had a nightly job re-summarizing all docs from the previous 24h. Moving it from asyncio.gather to batches cut that bucket from $410 to $205, no user-visible impact.
Action: any cron job, weekly report, or async task hitting your LLM API — most can be batched.
The result
After three changes (cache hint, model rebalance, batch the async work), my friend's monthly bill went $4,200 → $1,540. Same product, same quality, no rewrites — just turning on features the API already supports.
If your bill feels high, do the same audit:
- Pull last 30 days of API calls
- Count distinct
systemprompts. <10 unique but >10,000 calls = no caching - Look at top 5 model+endpoint combos by spend. Anything simple enough to downshift?
- Find your largest single-day spike. Batch job? Use the batches API.
A shortcut, if you don't want to instrument all this
I built a little proxy called MidRelay that handles the first two automatically: it injects a per-key cache hint into every request (even SDK code that doesn't know about cache_control gets the discount), and it exposes both OpenAI and Anthropic surfaces from the same key so you can route model-by-model without rewriting.
It also happens to be 60-80% cheaper than calling Anthropic / OpenAI directly. (Same models, same wire protocol — your existing SDK just changes the base_url.)
$5 of free credit to test it: drop a comment, I'll DM a code. First 100 readers, no signup gate.
But honestly — the techniques above work on any provider. Even if you never touch MidRelay, just turning on cache_control and downshifting one over-spec'd Opus call will cut your bill more than any "AI cost optimization" SaaS will.
Check your logs tonight.
Top comments (0)