The Problem
If you're calling the Claude API with a large system prompt, every request reprocesses the same tokens from scratch. Production AI systems — agents, RAG pipelines, customer-facing assistants — routinely carry 10K–30K token system prompts (tool definitions, reference docs, few-shot examples). At $3/MTok across hundreds of thousands of daily requests, redundant prefix processing can easily run $500–$3,000+/day. That's pure waste for context the model has already seen.
Anthropic's prompt caching solves this. You mark a stable prefix as cacheable, pay a small one-time write surcharge (1.25×), and every subsequent request reads that prefix at 10% of the standard price.
I ran a controlled experiment to measure the real-world savings. Here are the numbers.
How Prompt Caching Works
The mechanism is straightforward:
- You attach
cache_control: {"type": "ephemeral"}to a content block in your request - The API caches everything up to and including that block (the "prefix")
- On the next request with a byte-for-byte identical prefix, the model reads from cache instead of reprocessing
Pricing (Claude Sonnet 4.5):
| Operation | Price / MTok | Relative to Base |
|---|---|---|
| Standard input | $3.00 | 1× |
| Cache write | $3.75 | 1.25× |
| Cache read | $0.30 | 0.1× |
| Output | $15.00 | — |
Constraints:
- Minimum prefix: 1,024 tokens (model-dependent)
- TTL: 5 minutes, refreshed on each hit
- Max 4 explicit breakpoints per request
-
systemmust be passed as an array of content blocks (not a plain string)
Experiment Design
Three API calls. Same system prompt (~2,158 tokens). Same user question. The only variable is whether caching is enabled:
| Call | Configuration | Expected Behavior |
|---|---|---|
| 1 | No cache_control
|
Baseline — all tokens at standard rate |
| 2 | Explicit cache_control on system block |
Cache WRITE (1.25× on prefix) |
| 3 | Same as Call 2 | Cache READ (0.1× on prefix) |
Implementation
Baseline (no caching):
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=300,
system=SYSTEM_PROMPT, # plain string — caching not possible
messages=[{"role": "user", "content": question}]
)
With explicit cache breakpoint:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=300,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": question}]
)
One structural change: system becomes a list of content blocks. That's the only code difference.
Results
| Call | input_tokens |
cache_creation |
cache_read |
Total Input |
|---|---|---|---|---|
| 1 (baseline) | 2,180 | 0 | 0 | 2,180 |
| 2 (write) | 22 | 2,158 | 0 | 2,180 |
| 3 (read) | 22 | 0 | 2,158 | 2,180 |
The API usage fields tell the full story:
-
input_tokens= non-cached tail (the user message — 22 tokens) -
cache_creation_input_tokens= prefix written to cache -
cache_read_input_tokens= prefix served from cache
Call 3 reads 2,158 tokens from cache at $0.30/MTok instead of $3.00/MTok.
Cost Analysis
| Call | Actual Input Cost | Baseline Cost | Delta |
|---|---|---|---|
| 1 (baseline) | $0.006540 | $0.006540 | — |
| 2 (write) | $0.008159 | $0.006540 | +24.7% (write surcharge) |
| 3 (read) | $0.000713 | $0.006540 | −89.1% |
The write costs 25% more than baseline. The read costs 89% less. Break-even: 2 requests.
Production Projection
At 10,000 requests/day with a 5-minute TTL, cache writes occur 288 times/day (once per TTL window). The remaining 9,712 requests pay cache-read pricing:
| Metric | With Caching | Without | Savings |
|---|---|---|---|
| Daily | $54.28 | $110.40 | $56.12 |
| Monthly (30d) | $1,628 | $3,312 | $1,684 |
| Savings | 50.8% |
This is with a ~2,158 token system prompt. For agent-style workloads with 10K–30K token system prompts (tool definitions, reference docs, few-shot examples), the write surcharge becomes negligible relative to the prefix size, and total savings approach 85–89%.
A Pitfall Worth Documenting
My first implementation used top-level automatic caching:
# ❌ Fails silently with varying user messages
response = client.messages.create(
model="claude-sonnet-4-5",
cache_control={"type": "ephemeral"}, # breakpoint at last block
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": question}] # varies per request
)
Every call triggered a cache write — never a read. The API returned cache_creation_input_tokens > 0 on every request.
Root cause: Top-level cache_control places the breakpoint at the last cacheable block, which includes the user message. Different messages produce different prefixes, so the cache key never matches.
Fix: Use explicit cache_control on the system prompt block. The cached prefix then covers only the stable system prompt, and varying user messages sit after the breakpoint.
This is not documented prominently in Anthropic's guides, but it's the critical distinction between "caching that works" and "caching that silently charges you 25% more on every call."
When Prompt Caching Makes Sense
| Scenario | Expected Input Savings |
|---|---|
| Static system prompt (>1K tokens) across requests | ~89% |
| Multi-turn conversations (growing message history) | 70–85% |
| RAG with stable reference documents | 80–90% |
| Agent loops with large tool catalogues | 60–80% |
Implementation Checklist
- Verify system prompt exceeds the model's minimum (1,024 tokens for Sonnet 4.5)
- Restructure
systemfrom a plain string to a list of content blocks - Add
"cache_control": {"type": "ephemeral"}on the last stable block - Place static content before dynamic content in the prompt
- Confirm cache reads by checking
cache_read_input_tokens > 0in responses - Ensure request frequency stays within the 5-minute TTL window
Full Experiment
Reproducible notebook with all code: Kaggle →
References
Article #2 in the LLM Engineering Experiments series. Previous: How to Choose the Right Prompt Engineering Pattern
Top comments (0)