This is a follow-up to my earlier article on Claude prompt caching, which has become one of my most-read pieces. Since publishing it, I've been tracking a regression that Anthropic has not publicly announced.
On March 6, 2026, Anthropic changed the default prompt cache TTL from 1 hour to 5 minutes.
Source: whoff-agents on GitHub
If you're running Claude API calls with prompt caching and you haven't touched your code recently, your cache hit rate is likely near zero right now.
What Changed
Previously, when you sent a request with cache_control: {"type": "ephemeral"}, the cached prompt would persist for 1 hour by default.
After March 6, the default TTL is 5 minutes.
Unless your application is sending the same exact prompt within a 5-minute window, you're getting cache misses. You're paying full input token prices on every request.
There's a second issue: disabling telemetry also kills the 1-hour TTL. If you've set anthropic-beta: prompt-caching-2024-07-31 and opted out of Anthropic's telemetry, you're silently dropped to 5-minute TTL regardless of your other settings.
How to Check If You're Affected
Look at your usage dashboard in console.anthropic.com. Check the ratio of cache_read_input_tokens to input_tokens over the last 30 days.
If that ratio dropped sharply around March 6 and you haven't changed your prompts, you're affected.
You can also instrument it directly in your SDK calls:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": your_system_prompt,
"cache_control": {"type": "ephemeral"}
}
],
messages=messages
)
# Check cache usage in response
print(response.usage.cache_read_input_tokens) # should be > 0 on repeat calls
print(response.usage.cache_creation_input_tokens) # tokens written to cache
If cache_read_input_tokens is 0 on the second and third calls with the same prompt, your cache isn't working.
The Fix
Explicitly declare your cache TTL in the cache_control block.
The current Anthropic SDK supports two TTL options:
# 5-minute TTL (new default — what you're getting silently)
"cache_control": {"type": "ephemeral"}
# 1-hour TTL (what you had before — must now be explicit)
"cache_control": {"type": "ephemeral", "ttl": 3600}
Update every cache_control block in your codebase to include "ttl": 3600 if you need the 1-hour window.
Cost Impact
Prompt caching reduces input token costs by ~90% on cache hits (you pay cache read price instead of full input price). The math on a typical RAG pipeline or agent system prompt:
| Scenario | Before (1hr TTL) | After (5min TTL, unpatched) |
|---|---|---|
| 10K token system prompt, 100 calls/hour | Pay full price once, read price 99x | Pay full price ~100x |
| Monthly cost at $15/M tokens | ~$15 | ~$1,500 |
That's a 100x cost regression hiding in a silent default change.
Why This Matters for Agent Systems
If you're running multi-agent pipelines — the kind where a coordinator sends the same system context to multiple sub-agents — the 5-minute TTL is especially punishing. Agent calls are often spread across minutes or hours, not seconds.
The 1-hour TTL was specifically designed for these workloads. The regression makes agent architectures materially more expensive without any code changes on your end.
What Anthropic Should Do
Announce breaking changes to caching behavior in the changelog. Changing a default TTL is a silent cost increase. Developers building on Claude API have budgets and pricing models built on documented behavior. This one slipped through without a notice.
If you're building with Claude API and want to get prompt caching right, I covered the full setup in my earlier article. The patch above ("ttl": 3600) is the immediate fix. Apply it everywhere you have cache_control: {"type": "ephemeral"}.
Top comments (0)