If you're running Claude API workloads and haven't checked your caching bill lately, you're in for a surprise.
Anthropic quietly changed the prompt cache TTL from 60 minutes down to 5 minutes in early 2026. For many production workloads, this single change increased effective API costs by 30–60%.
Here's what changed, who it hits hardest, and how to architect around it.
What Is Prompt Caching?
Claude's prompt caching lets you cache expensive prefill tokens (system prompts, long documents, tool definitions) and reuse them across requests. Instead of re-sending 50,000 tokens on every call, you send them once, cache them, then pay ~10% of the normal input price for subsequent requests that hit the cache.
The economics look like this (Claude Sonnet 4.6):
- Normal input: $3.00 / 1M tokens
- Cache write: $3.75 / 1M tokens (25% premium for the write)
- Cache read: $0.30 / 1M tokens (90% discount)
With a 60-minute TTL, a system prompt sent once could serve hundreds of requests. The math was extremely favorable.
The TTL Drop: Before vs. After
Before (60-minute TTL):
A background worker processing documents every few minutes would write cache once, then read it ~20 times before expiry. At 10,000 tokens for the system prompt:
1 write × 10k tokens × $3.75/1M = $0.0375
20 reads × 10k tokens × $0.30/1M = $0.060
Total for 21 requests = $0.0975
Without caching: 21 × 10k × $3.00/1M = $0.63
Savings: 84%
After (5-minute TTL):
The same worker now gets ~2 reads per cache write instead of 20:
1 write × 10k tokens = $0.0375
2 reads × 10k tokens = $0.006
Total for 3 requests = $0.0435
Without caching: 3 × 10k × $3.00/1M = $0.09
Savings: 52% (down from 84%)
For high-frequency workloads that were optimized for 60-minute caching, effective savings dropped from 80%+ down to 40–55%.
Who Gets Hit Hardest
Batch processing pipelines — If you process documents in bursts with gaps longer than 5 minutes, your cache expires between runs. Every burst starts cold.
Cron-based agents — Agents running every 15–30 minutes were perfectly tuned for 60-minute TTL. Now they write cache on nearly every invocation.
Chat applications with long sessions — User sessions that go idle for 10+ minutes lose cache state entirely. The next message re-pays the write premium.
Development/testing environments — Where requests are infrequent and cache was previously warm by default.
Architecture Patterns That Work With 5-Minute TTL
1. Keep-Alive Ping Pattern
If you have a high-value cache (large system prompt, big RAG context), send a lightweight "ping" request every 4 minutes to reset the TTL clock:
import anthropic
import threading
import time
class CachedClaudeClient:
def __init__(self, system_prompt: str):
self.client = anthropic.Anthropic()
self.system_prompt = system_prompt
self._start_keepalive()
def _start_keepalive(self):
def ping():
while True:
time.sleep(240) # 4 minutes — reset before 5-min expiry
self.client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1,
system=[{
"type": "text",
"text": self.system_prompt,
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": "ping"}]
)
t = threading.Thread(target=ping, daemon=True)
t.start()
def chat(self, message: str) -> str:
response = self.client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[{
"type": "text",
"text": self.system_prompt,
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": message}]
)
return response.content[0].text
When to use: Long-lived servers (API endpoints, chat backends) where a process is always running.
When NOT to use: Serverless functions, cron jobs — no persistent process to run the keepalive.
2. Request Batching
Instead of processing one item at a time, accumulate work and process in tight bursts:
import asyncio
from collections import deque
class BatchProcessor:
def __init__(self, max_batch=20, max_wait_ms=2000):
self.queue = deque()
self.max_batch = max_batch
self.max_wait_ms = max_wait_ms
async def process_batch(self, items: list) -> list:
# All items share the cache write within this burst
tasks = [self.call_claude(item) for item in items]
return await asyncio.gather(*tasks)
Result: 20 requests in 30 seconds = 1 cache write + 19 reads. Cache-efficient.
3. Reduce Cache Dependency
If cache hit rates are low with the new TTL, sometimes it's cheaper to NOT cache:
# Calculate breakeven: is caching worth it?
def should_cache(prompt_tokens: int, expected_requests_per_5min: float) -> bool:
write_premium = prompt_tokens * (3.75 - 3.00) / 1_000_000
read_savings = (expected_requests_per_5min - 1) * prompt_tokens * (3.00 - 0.30) / 1_000_000
return read_savings > write_premium
# Example: 10k token system prompt, 3 requests per 5 min
print(should_cache(10_000, 3)) # True: saves ~$0.05 per cycle
print(should_cache(10_000, 1.2)) # False: barely breaks even
Caching only pays off when you get more than ~1.3 reads per write (exact number depends on token count).
4. Structure Prompts for Maximum Reuse
Place the cacheable prefix as early as possible in the message structure, and make sure it's byte-identical across requests:
# BAD: timestamp in cached prefix invalidates cache every request
system = f"You are a helpful assistant. Current time: {datetime.now()}. [50k tokens of context]"
# GOOD: static prefix cached, dynamic content in user message
system = "[50k tokens of static context — cache_control: ephemeral]"
user_message = f"Current time: {datetime.now()}. User query: {query}"
Even a single character difference in the cached prefix creates a cache miss.
Measuring Your Cache Hit Rate
The API response includes usage stats that tell you exactly what's happening:
response = client.messages.create(...)
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
# Calculate hit rate
total_cached = usage.cache_creation_input_tokens + usage.cache_read_input_tokens
if total_cached > 0:
hit_rate = usage.cache_read_input_tokens / total_cached
print(f"Cache hit rate: {hit_rate:.1%}")
Log this across your production requests. If hit rate is below 60% and you're paying the write premium, you may be spending more than if you weren't caching at all.
The Uncomfortable Math
Here's the scenario where caching actively hurts you:
- System prompt: 20,000 tokens
- Requests per 5-minute window: 1.1 average (low traffic)
- Cache write cost: 20k × $3.75/1M = $0.075
- Cache read cost (0.1 reads on average): 0.1 × 20k × $0.30/1M = $0.0006
- Without caching (1.1 × 20k × $3.00/1M): $0.066
With caching you pay $0.0756. Without caching: $0.066. You're losing money.
This scenario is common in low-traffic production apps, staging environments, and any workload with irregular request patterns.
Summary
| Workload | 60-min TTL | 5-min TTL | Action |
|---|---|---|---|
| High-freq API (>10 req/5min) | ✅ Great | ✅ Good | Keep caching |
| Medium-freq (2–10 req/5min) | ✅ Great | ⚠️ Marginal | Add batching |
| Low-freq (<2 req/5min) | ✅ Good | ❌ Losing money | Disable caching |
| Cron jobs (15+ min gap) | ✅ Good | ❌ Cold every time | Batch or remove |
| Chat backend (active users) | ✅ Great | ✅ Good | Keep caching |
The 5-minute TTL isn't necessarily bad — it just requires more intentional architecture. Audit your cache hit rates, batch where you can, and don't cache prompts that won't generate enough reads to break even.
Building AI agents that actually stay within budget? The AI SaaS Starter Kit includes production-ready patterns for Claude cost optimization, caching strategy, and rate limit handling — pre-configured for Next.js + TypeScript.
Top comments (0)