DEV Community

Cover image for Why your Claude API bill is 3x what it should be (and how to fix it)
Shek
Shek

Posted on

Why your Claude API bill is 3x what it should be (and how to fix it)

TL;DR: I audited a friend's startup that was spending $4,200/month on Claude API. Only $1,300 produced business value. The other $2,900 was waste — split across three patterns that hit most teams using LLM APIs in production. Here's how to find them in your own bill, and the code to fix each one.


The audit

A friend running a B2B doc-summarization product asked me to look at their Claude bill. Q1 was $4,200/month and climbing. We pulled their request logs into a spreadsheet, classified each call by purpose, then estimated what each should have cost. The answer was uncomfortable:

Bucket Cost Producing business value?
First-time doc analysis $890 Yes
User chat turns $410 Yes
Repeated system prompts (no cache) $1,810 No
Opus calls that should be Sonnet $680 No
Serial bulk runs (should be batched) $410 No

Three problems, $2,900/month of waste. Each one is unsexy and easy to miss, but together they were 70% of the bill.


Culprit #1: prompt caching is off

This is the silent killer. Claude 4.x supports prompt caching: send a 5-minute or 1-hour TTL cache_control block, and Anthropic charges you ~10x less for cached tokens on subsequent requests. Pricing today (per million tokens for Sonnet 4.6):

  • Fresh input: $3.00
  • Cache write: $3.75 (one-time, slightly more than fresh)
  • Cache read: $0.30 — 10x cheaper

The catch: you have to opt in per-request, and most code doesn't. Before/after:

# Before — every call pays for the full system prompt
client.messages.create(
    model="claude-sonnet-4-6",
    system="You are an expert at...[2000 words of rules + examples]",
    messages=[{"role": "user", "content": user_input}],
    max_tokens=1024,
)
Enter fullscreen mode Exit fullscreen mode
# After — system prompt cached for 5 minutes
client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {
            "type": "text",
            "text": "You are an expert at...[2000 words of rules + examples]",
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[{"role": "user", "content": user_input}],
    max_tokens=1024,
)
Enter fullscreen mode Exit fullscreen mode

One-line change. 90% discount on every subsequent call within the cache TTL.

For my friend: 20K tokens of system prompt × 8 requests/min × 50% cache hit ratio = ~$80/day saved. That alone was $2,400/month — most of the $1,810 leak.

OpenAI SDK calling Claude (via compatible proxies) has equivalent semantics:

client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[...],
    prompt_cache_key="user-session-12345",  # Stable across calls = cache hit
)
Enter fullscreen mode Exit fullscreen mode

Action: open your last week of API logs. If you have any repeated system content across requests, you're leaking.


Culprit #2: model overkill

The mental shortcut "Claude = quality, just always use Opus" is expensive. Opus is 4x the cost of Sonnet for inputs, 5x for outputs. For a lot of work, Sonnet or even Haiku is indistinguishable.

I ran 5 tasks across the lineup (1000 samples, scored by judge model + human spot-check):

Task Opus 4.7 Sonnet 4.6 Haiku 4.5 Best price/quality
JSON extraction from PDFs 99.2% 98.7% 96.4% Haiku
Code review (real bugs) 74% 74% 61% Sonnet
Creative copy (blind judged) 51% pref 48% pref 32% pref Sonnet
Multi-step reasoning chain 89% 76% 54% Opus
Customer chat 92% sat 89% sat 81% sat Sonnet

The pattern: Opus wins clearly only on complex multi-step reasoning. For most tasks Sonnet is within margin of error at 1/4 the cost. Haiku trades 2-5% accuracy for 1/13 the cost — fine when you have downstream validation.

My friend was running every doc through Opus by default. Switching to Sonnet for analysis + Haiku for tagging dropped that bucket from $680 to $140. No quality complaints.

Action: pick the 3 most expensive endpoints in your bill, A/B-test them on the next cheapest model for a week, score outputs blind.


Culprit #3: serial calls when you could batch

If your work doesn't need a response in the next 30 seconds, the Anthropic Message Batches API charges half price with a 24-hour SLA. Same models, same quality, half the bill.

Good fits:

  • Nightly summarization runs
  • Classifying or tagging large datasets
  • Embedding generation for indexing
  • Internal report generation
  • Training data prep

Bad fits:

  • Anything user-facing (you'll wait hours)
  • Anything where input depends on previous output
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"doc-{doc.id}",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": doc.text}],
            },
        }
        for doc in docs
    ],
)

# Poll until done (or just check tomorrow)
while batch.processing_status != "ended":
    time.sleep(60)
    batch = client.messages.batches.retrieve(batch.id)
Enter fullscreen mode Exit fullscreen mode

My friend had a nightly job re-summarizing all docs from the previous 24h. Moving it from asyncio.gather to batches cut that bucket from $410 to $205, no user-visible impact.

Action: any cron job, weekly report, or async task hitting your LLM API — most can be batched.


The result

After three changes (cache hint, model rebalance, batch the async work), my friend's monthly bill went $4,200 → $1,540. Same product, same quality, no rewrites — just turning on features the API already supports.

If your bill feels high, do the same audit:

  1. Pull last 30 days of API calls
  2. Count distinct system prompts. <10 unique but >10,000 calls = no caching
  3. Look at top 5 model+endpoint combos by spend. Anything simple enough to downshift?
  4. Find your largest single-day spike. Batch job? Use the batches API.

A shortcut, if you don't want to instrument all this

I built a little proxy called MidRelay that handles the first two automatically: it injects a per-key cache hint into every request (even SDK code that doesn't know about cache_control gets the discount), and it exposes both OpenAI and Anthropic surfaces from the same key so you can route model-by-model without rewriting.

It also happens to be 60-80% cheaper than calling Anthropic / OpenAI directly. (Same models, same wire protocol — your existing SDK just changes the base_url.)

$5 of free credit to test it: drop a comment, I'll DM a code. First 100 readers, no signup gate.

But honestly — the techniques above work on any provider. Even if you never touch MidRelay, just turning on cache_control and downshifting one over-spec'd Opus call will cut your bill more than any "AI cost optimization" SaaS will.

Check your logs tonight.

Top comments (0)