Manish Ramavat

Posted on May 10 • Originally published at kaggle.com

How to Save 90% on Claude API Input Costs With Prompt Caching (2026)

#machinelearning #ai #python #tutorial

The Problem

If you're calling the Claude API with a large system prompt, every request reprocesses the same tokens from scratch. Production AI systems — agents, RAG pipelines, customer-facing assistants — routinely carry 10K–30K token system prompts (tool definitions, reference docs, few-shot examples). At $3/MTok across hundreds of thousands of daily requests, redundant prefix processing can easily run $500–$3,000+/day. That's pure waste for context the model has already seen.

Anthropic's prompt caching solves this. You mark a stable prefix as cacheable, pay a small one-time write surcharge (1.25×), and every subsequent request reads that prefix at 10% of the standard price.

I ran a controlled experiment to measure the real-world savings. Here are the numbers.

How Prompt Caching Works

The mechanism is straightforward:

You attach cache_control: {"type": "ephemeral"} to a content block in your request
The API caches everything up to and including that block (the "prefix")
On the next request with a byte-for-byte identical prefix, the model reads from cache instead of reprocessing

Pricing (Claude Sonnet 4.5):

Operation	Price / MTok	Relative to Base
Standard input	$3.00	1×
Cache write	$3.75	1.25×
Cache read	$0.30	0.1×
Output	$15.00	—

Constraints:

Minimum prefix: 1,024 tokens (model-dependent)
TTL: 5 minutes, refreshed on each hit
Max 4 explicit breakpoints per request
system must be passed as an array of content blocks (not a plain string)

Experiment Design

Three API calls. Same system prompt (~2,158 tokens). Same user question. The only variable is whether caching is enabled:

Call	Configuration	Expected Behavior
1	No `cache_control`	Baseline — all tokens at standard rate
2	Explicit `cache_control` on system block	Cache WRITE (1.25× on prefix)
3	Same as Call 2	Cache READ (0.1× on prefix)

Implementation

Baseline (no caching):

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=300,
    system=SYSTEM_PROMPT,  # plain string — caching not possible
    messages=[{"role": "user", "content": question}]
)

With explicit cache breakpoint:

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=300,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": question}]
)

One structural change: system becomes a list of content blocks. That's the only code difference.

Results

Call	`input_tokens`	`cache_creation`	`cache_read`	Total Input
1 (baseline)	2,180	0	0	2,180
2 (write)	22	2,158	0	2,180
3 (read)	22	0	2,158	2,180

The API usage fields tell the full story:

input_tokens = non-cached tail (the user message — 22 tokens)
cache_creation_input_tokens = prefix written to cache
cache_read_input_tokens = prefix served from cache

Call 3 reads 2,158 tokens from cache at $0.30/MTok instead of $3.00/MTok.

Cost Analysis

Call	Actual Input Cost	Baseline Cost	Delta
1 (baseline)	$0.006540	$0.006540	—
2 (write)	$0.008159	$0.006540	+24.7% (write surcharge)
3 (read)	$0.000713	$0.006540	−89.1%

The write costs 25% more than baseline. The read costs 89% less. Break-even: 2 requests.

Production Projection

At 10,000 requests/day with a 5-minute TTL, cache writes occur 288 times/day (once per TTL window). The remaining 9,712 requests pay cache-read pricing:

Metric	With Caching	Without	Savings
Daily	$54.28	$110.40	$56.12
Monthly (30d)	$1,628	$3,312	$1,684
Savings			50.8%

This is with a ~2,158 token system prompt. For agent-style workloads with 10K–30K token system prompts (tool definitions, reference docs, few-shot examples), the write surcharge becomes negligible relative to the prefix size, and total savings approach 85–89%.

A Pitfall Worth Documenting

My first implementation used top-level automatic caching:

# ❌ Fails silently with varying user messages
response = client.messages.create(
    model="claude-sonnet-4-5",
    cache_control={"type": "ephemeral"},  # breakpoint at last block
    system=SYSTEM_PROMPT,
    messages=[{"role": "user", "content": question}]  # varies per request
)

Every call triggered a cache write — never a read. The API returned cache_creation_input_tokens > 0 on every request.

Root cause: Top-level cache_control places the breakpoint at the last cacheable block, which includes the user message. Different messages produce different prefixes, so the cache key never matches.

Fix: Use explicit cache_control on the system prompt block. The cached prefix then covers only the stable system prompt, and varying user messages sit after the breakpoint.

This is not documented prominently in Anthropic's guides, but it's the critical distinction between "caching that works" and "caching that silently charges you 25% more on every call."

When Prompt Caching Makes Sense

Scenario	Expected Input Savings
Static system prompt (>1K tokens) across requests	~89%
Multi-turn conversations (growing message history)	70–85%
RAG with stable reference documents	80–90%
Agent loops with large tool catalogues	60–80%

Implementation Checklist

Verify system prompt exceeds the model's minimum (1,024 tokens for Sonnet 4.5)
Restructure system from a plain string to a list of content blocks
Add "cache_control": {"type": "ephemeral"} on the last stable block
Place static content before dynamic content in the prompt
Confirm cache reads by checking cache_read_input_tokens > 0 in responses
Ensure request frequency stays within the 5-minute TTL window

Full Experiment

Reproducible notebook with all code: Kaggle →

References

Article #2 in the LLM Engineering Experiments series. Previous: How to Choose the Right Prompt Engineering Pattern

DEV Community