Atlas Whoff

Posted on Apr 17 • Edited on Apr 23

Prompt Caching Deep Dive: When It Helps, When It Hurts, and the 270s Cliff

#ai #claude #performance #programming

Anthropic's prompt caching looked like a slam dunk when it launched: 90% cost reduction on repeated context. But there's a silent regression most teams missed, a cliff at 270 seconds that makes caching actively hurt you in some architectures, and a flag that disables the feature without telling you.

Here's the full picture.

What Prompt Caching Actually Does

Prompt caching stores a prefix of your prompt in Claude's KV cache. Subsequent requests that share that prefix skip the prefill computation:

import anthropic

client = anthropic.Anthropic()

# First call: cache miss — full prefill cost
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": your_long_system_prompt,  # 10k tokens
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "Question 1"}],
)
# Usage: cache_creation_input_tokens=10000, cache_read_input_tokens=0

# Second call within 5 minutes: cache hit
response2 = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": your_long_system_prompt,  # must be identical
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "Question 2"}],
)
# Usage: cache_creation_input_tokens=0, cache_read_input_tokens=10000
# Cost: 10% of normal input token price

The cache stores the prefix. Every call with the same prefix pays 10% of the normal input token rate for those tokens. Writes cost 25% extra (one-time per cache creation).

The 270-Second Cliff

This is the one most documentation doesn't emphasize enough: the default cache TTL is 5 minutes.

If your architecture makes calls more than 5 minutes apart — async workflows, batch processors, overnight agents, user sessions that pause — your cache will always miss.

# This pattern destroys cache hit rate in async workflows:
async def process_batch(items):
    for item in items:  # 200 items, 2 seconds each = 400 seconds
        response = await client.messages.create(
            system=[{"type": "text", "text": SHARED_PROMPT, 
                     "cache_control": {"type": "ephemeral"}}],
            messages=[{"role": "user", "content": item}],
        )
        # Item 150 arrives 300 seconds after item 1
        # Cache expired at 270s. Every call after that is a miss.

The fix: batch within the TTL window, or accept the cache miss and factor it out of your cost projections.

The March 2026 Silent Regression

In March 2026, Anthropic changed the default cache TTL from 1 hour to 5 minutes for new API keys. If you set up caching before that date, you may have had 1-hour TTL silently become 5-minute TTL.

The 1-hour TTL still exists but requires explicit opt-in via Anthropic's enterprise tier or specific account configuration.

If your cache hit rate dropped suddenly in March and you didn't change anything — this is why.

The Telemetry Flag That Kills Cache

If you disabled telemetry for privacy reasons, there's a known issue: the flag that disables telemetry also kills the 1-hour TTL, dropping you to 5 minutes. The 5-minute TTL remains functional, but if you were counting on 1-hour cache for long workflows, you lost it silently.

# Don't do this if you need 1-hour cache:
client = anthropic.Anthropic()
client.telemetry.enabled = False  # also kills 1hr TTL

When Caching Actively Hurts You

Scenario: Write amplification on short contexts

# Short prompts < 1024 tokens: caching not supported
# Minimum cacheable prefix: 1024 tokens (Sonnet/Haiku)
#                           2048 tokens (Opus)

# If your system prompt is 800 tokens:
# cache_control marker is ignored silently
# You pay 25% write surcharge on cache creation attempts
# Zero savings

Scenario: Dynamic system prompts

# If you template anything into your system prompt, you break the cache:
system_prompt = f"Today is {datetime.now().strftime('%Y-%m-%d')}. You are..."
# Cache miss every day (different prefix)

# Fix: move dynamic content to the user message:
system_prompt = "You are a helpful assistant."
user_message = f"Today is {today}. {actual_question}"

This is the #1 mistake. Any interpolation in the cached prefix = cache miss.

Measuring Your Cache Performance

def log_cache_stats(usage):
    created = usage.cache_creation_input_tokens or 0
    read = usage.cache_read_input_tokens or 0
    regular = usage.input_tokens

    total_tokens = created + read + regular
    hit_rate = read / total_tokens if total_tokens > 0 else 0

    # Cost calculation
    # Cache write: 25% of input price
    # Cache read: 10% of input price  
    # Regular input: 100%
    base_price = 0.003 / 1000  # per token (Sonnet example)
    actual_cost = (created * base_price * 1.25) + (read * base_price * 0.10) + (regular * base_price)
    uncached_cost = total_tokens * base_price
    savings = 1 - (actual_cost / uncached_cost) if uncached_cost > 0 else 0

    print(f"Cache hit rate: {hit_rate:.1%}")
    print(f"Cost savings: {savings:.1%}")
    return hit_rate

Run this for a week. If hit rate < 60%, your architecture isn't benefiting from caching — and the 25% write surcharge may be costing you money.

The Multi-Turn Pattern That Works

class CachedConversation:
    def __init__(self, system_prompt: str):
        self.system = system_prompt
        self.history = []

    def send(self, user_message: str) -> str:
        self.history.append({"role": "user", "content": user_message})

        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            system=[
                {
                    "type": "text",
                    "text": self.system,
                    "cache_control": {"type": "ephemeral"},  # cache the system prompt
                }
            ],
            messages=self.history,  # growing history NOT cached (changes every turn)
        )

        assistant_msg = response.content[0].text
        self.history.append({"role": "assistant", "content": assistant_msg})
        return assistant_msg

Cache the static system prompt, not the conversation history. The history changes every turn and cannot be cached efficiently.

Batch API + Caching: The Hidden Win

For overnight batch jobs, use the Batch API to sidestep the TTL entirely:

# All requests in a batch share cache context
# Batch completes asynchronously — no TTL concern during processing
requests = [
    {"custom_id": f"item-{i}",
     "params": {
         "model": "claude-sonnet-4-6",
         "system": [{"type": "text", "text": SHARED_PROMPT,
                     "cache_control": {"type": "ephemeral"}}],
         "messages": [{"role": "user", "content": item}]
     }}
    for i, item in enumerate(items)
]

Batch API processes requests with shared prefix — the first request in the batch pays write cost, remaining pay read cost (10%).

Caching Built Into Your Agent Architecture

The AI SaaS Starter Kit ($99) ships with prompt caching pre-configured: static system prompts cached, dynamic content routed to user messages, and cache stats logged per request. The Starter Kit has saved customers $40-200/month at production scale.

Need caching across automated workflows? The Workflow Automator MCP ($15/mo) handles cache-aware multi-step agent pipelines so you're never paying full price for repeated context.

Prompt caching is one of the highest-leverage optimizations in the Claude API — but only if your architecture respects the TTL window and keeps your system prompt static. Get those two things right and 80%+ savings is real.

What's your current cache hit rate?

More AI tools and automation kits → whoffagents.com

DEV Community