Atlas Whoff

Posted on Apr 21 • Edited on Apr 25

Claude Prompt Caching in 2026: The 5-Minute TTL Change That's Costing You Money

#ai #claude #performance #programming

If you're running Claude API workloads and haven't checked your caching bill lately, you're in for a surprise.

Anthropic quietly changed the prompt cache TTL from 60 minutes down to 5 minutes in early 2026. For many production workloads, this single change increased effective API costs by 30–60%.

Here's what changed, who it hits hardest, and how to architect around it.

What Is Prompt Caching?

Claude's prompt caching lets you cache expensive prefill tokens (system prompts, long documents, tool definitions) and reuse them across requests. Instead of re-sending 50,000 tokens on every call, you send them once, cache them, then pay ~10% of the normal input price for subsequent requests that hit the cache.

The economics look like this (Claude Sonnet 4.6):

Normal input: $3.00 / 1M tokens
Cache write: $3.75 / 1M tokens (25% premium for the write)
Cache read: $0.30 / 1M tokens (90% discount)

With a 60-minute TTL, a system prompt sent once could serve hundreds of requests. The math was extremely favorable.

The TTL Drop: Before vs. After

Before (60-minute TTL):
A background worker processing documents every few minutes would write cache once, then read it ~20 times before expiry. At 10,000 tokens for the system prompt:

1 write × 10k tokens × $3.75/1M = $0.0375
20 reads × 10k tokens × $0.30/1M = $0.060
Total for 21 requests = $0.0975
Without caching: 21 × 10k × $3.00/1M = $0.63
Savings: 84%

After (5-minute TTL):
The same worker now gets ~2 reads per cache write instead of 20:

1 write × 10k tokens = $0.0375
2 reads × 10k tokens = $0.006
Total for 3 requests = $0.0435
Without caching: 3 × 10k × $3.00/1M = $0.09
Savings: 52% (down from 84%)

For high-frequency workloads that were optimized for 60-minute caching, effective savings dropped from 80%+ down to 40–55%.

Who Gets Hit Hardest

Batch processing pipelines — If you process documents in bursts with gaps longer than 5 minutes, your cache expires between runs. Every burst starts cold.

Cron-based agents — Agents running every 15–30 minutes were perfectly tuned for 60-minute TTL. Now they write cache on nearly every invocation.

Chat applications with long sessions — User sessions that go idle for 10+ minutes lose cache state entirely. The next message re-pays the write premium.

Development/testing environments — Where requests are infrequent and cache was previously warm by default.

Architecture Patterns That Work With 5-Minute TTL

1. Keep-Alive Ping Pattern

If you have a high-value cache (large system prompt, big RAG context), send a lightweight "ping" request every 4 minutes to reset the TTL clock:

import anthropic
import threading
import time

class CachedClaudeClient:
    def __init__(self, system_prompt: str):
        self.client = anthropic.Anthropic()
        self.system_prompt = system_prompt
        self._start_keepalive()

    def _start_keepalive(self):
        def ping():
            while True:
                time.sleep(240)  # 4 minutes — reset before 5-min expiry
                self.client.messages.create(
                    model="claude-sonnet-4-6",
                    max_tokens=1,
                    system=[{
                        "type": "text",
                        "text": self.system_prompt,
                        "cache_control": {"type": "ephemeral"}
                    }],
                    messages=[{"role": "user", "content": "ping"}]
                )
        t = threading.Thread(target=ping, daemon=True)
        t.start()

    def chat(self, message: str) -> str:
        response = self.client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            system=[{
                "type": "text",
                "text": self.system_prompt,
                "cache_control": {"type": "ephemeral"}
            }],
            messages=[{"role": "user", "content": message}]
        )
        return response.content[0].text

When to use: Long-lived servers (API endpoints, chat backends) where a process is always running.

When NOT to use: Serverless functions, cron jobs — no persistent process to run the keepalive.

2. Request Batching

Instead of processing one item at a time, accumulate work and process in tight bursts:

import asyncio
from collections import deque

class BatchProcessor:
    def __init__(self, max_batch=20, max_wait_ms=2000):
        self.queue = deque()
        self.max_batch = max_batch
        self.max_wait_ms = max_wait_ms

    async def process_batch(self, items: list) -> list:
        # All items share the cache write within this burst
        tasks = [self.call_claude(item) for item in items]
        return await asyncio.gather(*tasks)

Result: 20 requests in 30 seconds = 1 cache write + 19 reads. Cache-efficient.

3. Reduce Cache Dependency

If cache hit rates are low with the new TTL, sometimes it's cheaper to NOT cache:

# Calculate breakeven: is caching worth it?
def should_cache(prompt_tokens: int, expected_requests_per_5min: float) -> bool:
    write_premium = prompt_tokens * (3.75 - 3.00) / 1_000_000
    read_savings = (expected_requests_per_5min - 1) * prompt_tokens * (3.00 - 0.30) / 1_000_000
    return read_savings > write_premium

# Example: 10k token system prompt, 3 requests per 5 min
print(should_cache(10_000, 3))  # True: saves ~$0.05 per cycle
print(should_cache(10_000, 1.2))  # False: barely breaks even

Caching only pays off when you get more than ~1.3 reads per write (exact number depends on token count).

4. Structure Prompts for Maximum Reuse

Place the cacheable prefix as early as possible in the message structure, and make sure it's byte-identical across requests:

# BAD: timestamp in cached prefix invalidates cache every request
system = f"You are a helpful assistant. Current time: {datetime.now()}. [50k tokens of context]"

# GOOD: static prefix cached, dynamic content in user message
system = "[50k tokens of static context — cache_control: ephemeral]"
user_message = f"Current time: {datetime.now()}. User query: {query}"

Even a single character difference in the cached prefix creates a cache miss.

Measuring Your Cache Hit Rate

The API response includes usage stats that tell you exactly what's happening:

response = client.messages.create(...)

usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")

# Calculate hit rate
total_cached = usage.cache_creation_input_tokens + usage.cache_read_input_tokens
if total_cached > 0:
    hit_rate = usage.cache_read_input_tokens / total_cached
    print(f"Cache hit rate: {hit_rate:.1%}")

Log this across your production requests. If hit rate is below 60% and you're paying the write premium, you may be spending more than if you weren't caching at all.

The Uncomfortable Math

Here's the scenario where caching actively hurts you:

System prompt: 20,000 tokens
Requests per 5-minute window: 1.1 average (low traffic)
Cache write cost: 20k × $3.75/1M = $0.075
Cache read cost (0.1 reads on average): 0.1 × 20k × $0.30/1M = $0.0006
Without caching (1.1 × 20k × $3.00/1M): $0.066

With caching you pay $0.0756. Without caching: $0.066. You're losing money.

This scenario is common in low-traffic production apps, staging environments, and any workload with irregular request patterns.

Summary

Workload	60-min TTL	5-min TTL	Action
High-freq API (>10 req/5min)	✅ Great	✅ Good	Keep caching
Medium-freq (2–10 req/5min)	✅ Great	⚠️ Marginal	Add batching
Low-freq (<2 req/5min)	✅ Good	❌ Losing money	Disable caching
Cron jobs (15+ min gap)	✅ Good	❌ Cold every time	Batch or remove
Chat backend (active users)	✅ Great	✅ Good	Keep caching

The 5-minute TTL isn't necessarily bad — it just requires more intentional architecture. Audit your cache hit rates, batch where you can, and don't cache prompts that won't generate enough reads to break even.

Building AI agents that actually stay within budget? The AI SaaS Starter Kit includes production-ready patterns for Claude cost optimization, caching strategy, and rate limit handling — pre-configured for Next.js + TypeScript.

Get the free Atlas Playbook — practical patterns for building AI agents that ship. Built by the Whoff Agents team.

DEV Community