Anthropic's prompt caching looked like a slam dunk when it launched: 90% cost reduction on repeated context. But there's a silent regression most teams missed, a cliff at 270 seconds that makes caching actively hurt you in some architectures, and a flag that disables the feature without telling you.
Here's the full picture.
What Prompt Caching Actually Does
Prompt caching stores a prefix of your prompt in Claude's KV cache. Subsequent requests that share that prefix skip the prefill computation:
import anthropic
client = anthropic.Anthropic()
# First call: cache miss — full prefill cost
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": your_long_system_prompt, # 10k tokens
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": "Question 1"}],
)
# Usage: cache_creation_input_tokens=10000, cache_read_input_tokens=0
# Second call within 5 minutes: cache hit
response2 = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": your_long_system_prompt, # must be identical
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": "Question 2"}],
)
# Usage: cache_creation_input_tokens=0, cache_read_input_tokens=10000
# Cost: 10% of normal input token price
The cache stores the prefix. Every call with the same prefix pays 10% of the normal input token rate for those tokens. Writes cost 25% extra (one-time per cache creation).
The 270-Second Cliff
This is the one most documentation doesn't emphasize enough: the default cache TTL is 5 minutes.
If your architecture makes calls more than 5 minutes apart — async workflows, batch processors, overnight agents, user sessions that pause — your cache will always miss.
# This pattern destroys cache hit rate in async workflows:
async def process_batch(items):
for item in items: # 200 items, 2 seconds each = 400 seconds
response = await client.messages.create(
system=[{"type": "text", "text": SHARED_PROMPT,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": item}],
)
# Item 150 arrives 300 seconds after item 1
# Cache expired at 270s. Every call after that is a miss.
The fix: batch within the TTL window, or accept the cache miss and factor it out of your cost projections.
The March 2026 Silent Regression
In March 2026, Anthropic changed the default cache TTL from 1 hour to 5 minutes for new API keys. If you set up caching before that date, you may have had 1-hour TTL silently become 5-minute TTL.
The 1-hour TTL still exists but requires explicit opt-in via Anthropic's enterprise tier or specific account configuration.
If your cache hit rate dropped suddenly in March and you didn't change anything — this is why.
The Telemetry Flag That Kills Cache
If you disabled telemetry for privacy reasons, there's a known issue: the flag that disables telemetry also kills the 1-hour TTL, dropping you to 5 minutes. The 5-minute TTL remains functional, but if you were counting on 1-hour cache for long workflows, you lost it silently.
# Don't do this if you need 1-hour cache:
client = anthropic.Anthropic()
client.telemetry.enabled = False # also kills 1hr TTL
When Caching Actively Hurts You
Scenario: Write amplification on short contexts
# Short prompts < 1024 tokens: caching not supported
# Minimum cacheable prefix: 1024 tokens (Sonnet/Haiku)
# 2048 tokens (Opus)
# If your system prompt is 800 tokens:
# cache_control marker is ignored silently
# You pay 25% write surcharge on cache creation attempts
# Zero savings
Scenario: Dynamic system prompts
# If you template anything into your system prompt, you break the cache:
system_prompt = f"Today is {datetime.now().strftime('%Y-%m-%d')}. You are..."
# Cache miss every day (different prefix)
# Fix: move dynamic content to the user message:
system_prompt = "You are a helpful assistant."
user_message = f"Today is {today}. {actual_question}"
This is the #1 mistake. Any interpolation in the cached prefix = cache miss.
Measuring Your Cache Performance
def log_cache_stats(usage):
created = usage.cache_creation_input_tokens or 0
read = usage.cache_read_input_tokens or 0
regular = usage.input_tokens
total_tokens = created + read + regular
hit_rate = read / total_tokens if total_tokens > 0 else 0
# Cost calculation
# Cache write: 25% of input price
# Cache read: 10% of input price
# Regular input: 100%
base_price = 0.003 / 1000 # per token (Sonnet example)
actual_cost = (created * base_price * 1.25) + (read * base_price * 0.10) + (regular * base_price)
uncached_cost = total_tokens * base_price
savings = 1 - (actual_cost / uncached_cost) if uncached_cost > 0 else 0
print(f"Cache hit rate: {hit_rate:.1%}")
print(f"Cost savings: {savings:.1%}")
return hit_rate
Run this for a week. If hit rate < 60%, your architecture isn't benefiting from caching — and the 25% write surcharge may be costing you money.
The Multi-Turn Pattern That Works
class CachedConversation:
def __init__(self, system_prompt: str):
self.system = system_prompt
self.history = []
def send(self, user_message: str) -> str:
self.history.append({"role": "user", "content": user_message})
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=[
{
"type": "text",
"text": self.system,
"cache_control": {"type": "ephemeral"}, # cache the system prompt
}
],
messages=self.history, # growing history NOT cached (changes every turn)
)
assistant_msg = response.content[0].text
self.history.append({"role": "assistant", "content": assistant_msg})
return assistant_msg
Cache the static system prompt, not the conversation history. The history changes every turn and cannot be cached efficiently.
Batch API + Caching: The Hidden Win
For overnight batch jobs, use the Batch API to sidestep the TTL entirely:
# All requests in a batch share cache context
# Batch completes asynchronously — no TTL concern during processing
requests = [
{"custom_id": f"item-{i}",
"params": {
"model": "claude-sonnet-4-6",
"system": [{"type": "text", "text": SHARED_PROMPT,
"cache_control": {"type": "ephemeral"}}],
"messages": [{"role": "user", "content": item}]
}}
for i, item in enumerate(items)
]
Batch API processes requests with shared prefix — the first request in the batch pays write cost, remaining pay read cost (10%).
Caching Built Into Your Agent Architecture
The AI SaaS Starter Kit ($99) ships with prompt caching pre-configured: static system prompts cached, dynamic content routed to user messages, and cache stats logged per request. The Starter Kit has saved customers $40-200/month at production scale.
Need caching across automated workflows? The Workflow Automator MCP ($15/mo) handles cache-aware multi-step agent pipelines so you're never paying full price for repeated context.
Prompt caching is one of the highest-leverage optimizations in the Claude API — but only if your architecture respects the TTL window and keeps your system prompt static. Get those two things right and 80%+ savings is real.
What's your current cache hit rate?
Top comments (0)