DEV Community

Manish Ramavat
Manish Ramavat

Posted on • Originally published at kaggle.com

How to Save 90% on Claude API Input Costs With Prompt Caching (2026)

The Problem

If you're calling the Claude API with a large system prompt, every request reprocesses the same tokens from scratch. Production AI systems — agents, RAG pipelines, customer-facing assistants — routinely carry 10K–30K token system prompts (tool definitions, reference docs, few-shot examples). At $3/MTok across hundreds of thousands of daily requests, redundant prefix processing can easily run $500–$3,000+/day. That's pure waste for context the model has already seen.

Anthropic's prompt caching solves this. You mark a stable prefix as cacheable, pay a small one-time write surcharge (1.25×), and every subsequent request reads that prefix at 10% of the standard price.

I ran a controlled experiment to measure the real-world savings. Here are the numbers.

How Prompt Caching Works

The mechanism is straightforward:

  1. You attach cache_control: {"type": "ephemeral"} to a content block in your request
  2. The API caches everything up to and including that block (the "prefix")
  3. On the next request with a byte-for-byte identical prefix, the model reads from cache instead of reprocessing

Pricing (Claude Sonnet 4.5):

Operation Price / MTok Relative to Base
Standard input $3.00
Cache write $3.75 1.25×
Cache read $0.30 0.1×
Output $15.00

Constraints:

  • Minimum prefix: 1,024 tokens (model-dependent)
  • TTL: 5 minutes, refreshed on each hit
  • Max 4 explicit breakpoints per request
  • system must be passed as an array of content blocks (not a plain string)

Experiment Design

Three API calls. Same system prompt (~2,158 tokens). Same user question. The only variable is whether caching is enabled:

Call Configuration Expected Behavior
1 No cache_control Baseline — all tokens at standard rate
2 Explicit cache_control on system block Cache WRITE (1.25× on prefix)
3 Same as Call 2 Cache READ (0.1× on prefix)

Implementation

Baseline (no caching):

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=300,
    system=SYSTEM_PROMPT,  # plain string — caching not possible
    messages=[{"role": "user", "content": question}]
)
Enter fullscreen mode Exit fullscreen mode

With explicit cache breakpoint:

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=300,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": question}]
)
Enter fullscreen mode Exit fullscreen mode

One structural change: system becomes a list of content blocks. That's the only code difference.

Results

Call input_tokens cache_creation cache_read Total Input
1 (baseline) 2,180 0 0 2,180
2 (write) 22 2,158 0 2,180
3 (read) 22 0 2,158 2,180

The API usage fields tell the full story:

  • input_tokens = non-cached tail (the user message — 22 tokens)
  • cache_creation_input_tokens = prefix written to cache
  • cache_read_input_tokens = prefix served from cache

Call 3 reads 2,158 tokens from cache at $0.30/MTok instead of $3.00/MTok.

Cost Analysis

Call Actual Input Cost Baseline Cost Delta
1 (baseline) $0.006540 $0.006540
2 (write) $0.008159 $0.006540 +24.7% (write surcharge)
3 (read) $0.000713 $0.006540 −89.1%

The write costs 25% more than baseline. The read costs 89% less. Break-even: 2 requests.

Production Projection

At 10,000 requests/day with a 5-minute TTL, cache writes occur 288 times/day (once per TTL window). The remaining 9,712 requests pay cache-read pricing:

Metric With Caching Without Savings
Daily $54.28 $110.40 $56.12
Monthly (30d) $1,628 $3,312 $1,684
Savings 50.8%

This is with a ~2,158 token system prompt. For agent-style workloads with 10K–30K token system prompts (tool definitions, reference docs, few-shot examples), the write surcharge becomes negligible relative to the prefix size, and total savings approach 85–89%.

A Pitfall Worth Documenting

My first implementation used top-level automatic caching:

# ❌ Fails silently with varying user messages
response = client.messages.create(
    model="claude-sonnet-4-5",
    cache_control={"type": "ephemeral"},  # breakpoint at last block
    system=SYSTEM_PROMPT,
    messages=[{"role": "user", "content": question}]  # varies per request
)
Enter fullscreen mode Exit fullscreen mode

Every call triggered a cache write — never a read. The API returned cache_creation_input_tokens > 0 on every request.

Root cause: Top-level cache_control places the breakpoint at the last cacheable block, which includes the user message. Different messages produce different prefixes, so the cache key never matches.

Fix: Use explicit cache_control on the system prompt block. The cached prefix then covers only the stable system prompt, and varying user messages sit after the breakpoint.

This is not documented prominently in Anthropic's guides, but it's the critical distinction between "caching that works" and "caching that silently charges you 25% more on every call."

When Prompt Caching Makes Sense

Scenario Expected Input Savings
Static system prompt (>1K tokens) across requests ~89%
Multi-turn conversations (growing message history) 70–85%
RAG with stable reference documents 80–90%
Agent loops with large tool catalogues 60–80%

Implementation Checklist

  1. Verify system prompt exceeds the model's minimum (1,024 tokens for Sonnet 4.5)
  2. Restructure system from a plain string to a list of content blocks
  3. Add "cache_control": {"type": "ephemeral"} on the last stable block
  4. Place static content before dynamic content in the prompt
  5. Confirm cache reads by checking cache_read_input_tokens > 0 in responses
  6. Ensure request frequency stays within the 5-minute TTL window

Full Experiment

Reproducible notebook with all code: Kaggle →

References


Article #2 in the LLM Engineering Experiments series. Previous: How to Choose the Right Prompt Engineering Pattern

Top comments (0)