Originally published at claudeguide.io/claude-prompt-caching-guide
Prompt Caching: The 90% Discount Most Claude Developers Miss
Prompt caching reduces cache-read token costs by up to 90% compared to standard input pricing. It works by storing a prefix of your prompt on Anthropic's servers so that subsequent requests reuse the cached version instead of re-processing the full text. For any application that sends the same system prompt or context repeatedly — agents, chatbots, document Q&A systems — caching is the single highest-ROI optimisation available. For the full cost break-even analysis, see Claude API Cost and Prompt Caching Break-Even.
Most developers don't use it because it requires a one-line change to their API call. This guide shows exactly what to add.
How prompt caching works
When you include "cache_control": {"type": "ephemeral"} in a message block, Anthropic stores that block's content in a temporary cache on their edge servers. The cache entry:
- Lives for 5 minutes (TTL resets each time it's hit)
- Is scoped to your API key — no other users share your cache
- Requires a minimum size of 1,024 tokens (Haiku) or 2,048 tokens (Sonnet/Opus) to be worth caching
On the first request (cache write), you pay a premium: 1.25× the standard input price. On every subsequent request that hits the cache (cache read), you pay only 0.1× the standard input price — a 90% discount.
Pricing breakdown (April 2026)
For claude-sonnet-4-6:
| Token type | Price per 1M tokens |
|---|---|
| Standard input | $3.00 |
| Cache write | $3.75 (1.25×) |
| Cache read | $0.30 (0.1×) |
| Output | $15.00 |
Break-even analysis: You recover the cache-write premium after just 2 cache reads on the same content. Any application that processes the same document more than twice per 5-minute window is leaving money on the table. Plug your own numbers into the Prompt Caching Break-Even Calculator to see the exact monthly savings for your traffic pattern.
For longer documents like a 50,000-token codebase or legal contract, the math becomes dramatic. At $3.00/M, reading that document 100 times costs $15.00. With caching, you pay $3.75 once and $0.30 × 99 times = $33.45... wait, that's higher? No — the break-even is 2 reads within the TTL window. See the real calculation:
Without caching (100 reads): 50,000 tokens × 100 × $3.00/M = $15.00
With caching (1 write + 99 reads): (50,000 × $3.75/M) + (50,000 × 99 × $0.30/M) = $0.1875 + $1.485 = $1.67
That's an 89% reduction for high-frequency document access.
When to use prompt caching
Prompt caching is most valuable when:
| Scenario | Cache target | Expected savings |
|---|---|---|
| Chatbot with long system prompt | System message | 60–80% |
| Document Q&A (same doc, many questions) | Document content | 80–90% |
| Agent with fixed tool definitions | Tool schemas | 40–60% |
| Code review bot (same codebase) | Repository context | 85–95% |
| Multi-turn conversations | Growing history | 50–70% |
It has minimal value for:
- Single-turn, one-off completions
- Short prompts under the minimum token threshold
- Requests more than 5 minutes apart
Implementation: basic example
Add cache_control to any content block you want cached. Here's a chatbot with a large system prompt:
python
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = """You are a senior Python engineer...
[... 5,000 tokens of instructions, examples, and context ...]"""
def chat(user_message: str, conversation_history: list) -
[→ Get the Cost Optimization Masterclass — $59](https://shoutfirst.gumroad.com/l/msjkda?utm_source=claudeguide&utm_medium=article&utm_campaign=claude-prompt-caching-guide)
*30-day money-back guarantee. Instant download.*
Top comments (0)