In January 2026 our monthly Claude bill crossed $4,200, up from $600 six months earlier. We were serving a RAG-backed customer-support assistant that retrieved ~12K tokens of context per query, ran through an 800-token system prompt, and called Claude an average of 4.2 times per user session.
Rolling out Anthropic's prompt caching dropped that to $920/month — a 78% reduction — without touching any user-facing behavior.
This post is the exact playbook.
What prompt caching does
Claude's prompt caching stores prefix portions of your prompt in Anthropic's infrastructure. When a subsequent request reuses that same prefix, the cached portion costs 10% of the normal input-token price and is processed much faster.
The pricing in 2026:
- Cache write: 1.25× input cost (on first use)
- Cache read (hit): 0.1× input cost
- TTL: 5 minutes by default, 1 hour available
Break-even is ~2 hits per cache write. In practice, a well-placed cache break point hits dozens to hundreds of times before it expires.
Where to cache — high, medium, low ROI
High ROI (always cache):
- System prompts (usually stable across all requests)
- Long tool-schema definitions
- Retrieved context chunks reused within a session (RAG)
- Few-shot example banks
Medium ROI:
- User conversation history early in a session (caches grow as the conversation progresses)
- Document chunks that appear frequently across queries
Low / anti-ROI:
- Per-request user input
- Anything that changes every call
- Caches smaller than 1024 tokens (minimum cache block size for Claude Opus/Sonnet)
The anatomy of a cached prompt
In the Python SDK, you add cache_control markers to the content blocks you want cached. Everything before the marker gets cached as a prefix.
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # stable, reusable
"cache_control": {"type": "ephemeral"},
}
],
tools=[
{
"name": "search_docs",
"description": "...",
"input_schema": {...},
"cache_control": {"type": "ephemeral"},
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": retrieved_context, # session-scoped RAG chunks
"cache_control": {"type": "ephemeral"},
},
{
"type": "text",
"text": f"User question: {user_input}",
# no cache marker — this changes every request
},
],
}
],
)
# inspect cache metrics
print(response.usage.cache_creation_input_tokens)
print(response.usage.cache_read_input_tokens)
The key insight: up to 4 cache break points per request. We use all 4:
- System prompt (changes ~monthly)
- Tool schemas (changes ~monthly)
- Retrieved RAG context (changes per session)
- Conversation history (grows within session)
Metrics from real traffic
Before caching, on a representative 1,000-request sample:
- Input tokens billed: 14.2M (≈ $42.60 at Opus 4.7 pricing)
- Output tokens billed: 380K (≈ $28.50)
- Total: $71.10
After caching, same workload:
- Cache write input tokens: 1.8M ($6.75)
- Cache read input tokens: 12.1M ($3.63)
- Uncached input tokens: 300K ($0.90)
- Output tokens: 380K ($28.50)
- Total: $39.78 (−44%)
Output tokens dominate what's left. Short of switching models, the input side is essentially solved.
Watch out for: cache invalidation footguns
Cache hits match on exact byte-level prefix equality. Any variance busts the cache. Things that silently broke ours early on:
-
Whitespace drift in system-prompt templating (a stray
\nfrom a template engine) -
Dict-ordering when serializing tool schemas from a Python dict — always use
json.dumps(..., sort_keys=True) -
Timestamp injection into system prompts (
"Today is {date}..."rebuilds the cache every day — move it to user content) - User-scoped data in system prompt — blows cache per user; move it down the prompt
Instrument cache_creation_input_tokens vs cache_read_input_tokens on every response and alert if the ratio drifts. A week of silent cache misses can cost you thousands.
The 1-hour cache tier
Anthropic added a 1-hour TTL option in mid-2025. It costs 2× the write price but lives 12× longer. For workloads with predictable hot paths — e.g. a support assistant where 80% of sessions hit the same product docs — the 1-hour tier amortizes beautifully.
"cache_control": {"type": "ephemeral", "ttl": "1h"}
Use it where cache hit rate is high. Don't use it for small cache blocks or unpredictable traffic — you'll pay the write premium without the hit volume.
The takeaway
Prompt caching is the highest-ROI single change I've made to a production Claude app in the last year. If you're running a RAG, agent, or long-context workload on Claude and not using prompt caching, the savings are almost certainly 40-80% sitting on the table.
The cost to implement: two afternoons, including the instrumentation. The cost to ignore: compounding every month you don't do it.
Top comments (0)