DEV Community

Cover image for Cutting our Claude API bill by 78% with prompt caching
Tufail Khan
Tufail Khan

Posted on • Originally published at tufail.dev

Cutting our Claude API bill by 78% with prompt caching

In January 2026 our monthly Claude bill crossed $4,200, up from $600 six months earlier. We were serving a RAG-backed customer-support assistant that retrieved ~12K tokens of context per query, ran through an 800-token system prompt, and called Claude an average of 4.2 times per user session.

Rolling out Anthropic's prompt caching dropped that to $920/month — a 78% reduction — without touching any user-facing behavior.

This post is the exact playbook.

What prompt caching does

Claude's prompt caching stores prefix portions of your prompt in Anthropic's infrastructure. When a subsequent request reuses that same prefix, the cached portion costs 10% of the normal input-token price and is processed much faster.

The pricing in 2026:

  • Cache write: 1.25× input cost (on first use)
  • Cache read (hit): 0.1× input cost
  • TTL: 5 minutes by default, 1 hour available

Break-even is ~2 hits per cache write. In practice, a well-placed cache break point hits dozens to hundreds of times before it expires.

Where to cache — high, medium, low ROI

High ROI (always cache):

  • System prompts (usually stable across all requests)
  • Long tool-schema definitions
  • Retrieved context chunks reused within a session (RAG)
  • Few-shot example banks

Medium ROI:

  • User conversation history early in a session (caches grow as the conversation progresses)
  • Document chunks that appear frequently across queries

Low / anti-ROI:

  • Per-request user input
  • Anything that changes every call
  • Caches smaller than 1024 tokens (minimum cache block size for Claude Opus/Sonnet)

The anatomy of a cached prompt

In the Python SDK, you add cache_control markers to the content blocks you want cached. Everything before the marker gets cached as a prefix.

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,  # stable, reusable
            "cache_control": {"type": "ephemeral"},
        }
    ],
    tools=[
        {
            "name": "search_docs",
            "description": "...",
            "input_schema": {...},
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": retrieved_context,  # session-scoped RAG chunks
                    "cache_control": {"type": "ephemeral"},
                },
                {
                    "type": "text",
                    "text": f"User question: {user_input}",
                    # no cache marker — this changes every request
                },
            ],
        }
    ],
)

# inspect cache metrics
print(response.usage.cache_creation_input_tokens)
print(response.usage.cache_read_input_tokens)
Enter fullscreen mode Exit fullscreen mode

The key insight: up to 4 cache break points per request. We use all 4:

  1. System prompt (changes ~monthly)
  2. Tool schemas (changes ~monthly)
  3. Retrieved RAG context (changes per session)
  4. Conversation history (grows within session)

Metrics from real traffic

Before caching, on a representative 1,000-request sample:

  • Input tokens billed: 14.2M (≈ $42.60 at Opus 4.7 pricing)
  • Output tokens billed: 380K (≈ $28.50)
  • Total: $71.10

After caching, same workload:

  • Cache write input tokens: 1.8M ($6.75)
  • Cache read input tokens: 12.1M ($3.63)
  • Uncached input tokens: 300K ($0.90)
  • Output tokens: 380K ($28.50)
  • Total: $39.78 (−44%)

Output tokens dominate what's left. Short of switching models, the input side is essentially solved.

Watch out for: cache invalidation footguns

Cache hits match on exact byte-level prefix equality. Any variance busts the cache. Things that silently broke ours early on:

  • Whitespace drift in system-prompt templating (a stray \n from a template engine)
  • Dict-ordering when serializing tool schemas from a Python dict — always use json.dumps(..., sort_keys=True)
  • Timestamp injection into system prompts ("Today is {date}..." rebuilds the cache every day — move it to user content)
  • User-scoped data in system prompt — blows cache per user; move it down the prompt

Instrument cache_creation_input_tokens vs cache_read_input_tokens on every response and alert if the ratio drifts. A week of silent cache misses can cost you thousands.

The 1-hour cache tier

Anthropic added a 1-hour TTL option in mid-2025. It costs 2× the write price but lives 12× longer. For workloads with predictable hot paths — e.g. a support assistant where 80% of sessions hit the same product docs — the 1-hour tier amortizes beautifully.

"cache_control": {"type": "ephemeral", "ttl": "1h"}
Enter fullscreen mode Exit fullscreen mode

Use it where cache hit rate is high. Don't use it for small cache blocks or unpredictable traffic — you'll pay the write premium without the hit volume.

The takeaway

Prompt caching is the highest-ROI single change I've made to a production Claude app in the last year. If you're running a RAG, agent, or long-context workload on Claude and not using prompt caching, the savings are almost certainly 40-80% sitting on the table.

The cost to implement: two afternoons, including the instrumentation. The cost to ignore: compounding every month you don't do it.

Top comments (0)