DEV Community

Jangwook Kim
Jangwook Kim

Posted on • Originally published at effloow.com

LLM Prompt Caching in Production: Cut API Costs 78% With Claude

Prompt caching is the single highest-leverage cost optimization available for Claude API workloads in 2026 — yet most teams either skip it or implement it wrong. When it works, cache read tokens cost 10% of standard input tokens. When it fails, you pay a 25% write premium instead of a discount.

Effloow Lab ran a cost model using anthropic SDK 0.97.0, verified the CacheControlEphemeralParam type, confirmed current pricing multipliers, and documented the March 2026 TTL change that caught many teams off guard. The numbers below are computed, not sourced from marketing copy.

How Prompt Caching Works

When you mark a portion of your prompt with cache_control, Claude stores a server-side KV snapshot of the key-value attention states up to that breakpoint. On subsequent requests with the same prefix, the model reads from the cached snapshot instead of reprocessing input tokens.

The economics:

  • Cache write: 1.25× the normal input token price (you pay a premium to write)
  • Cache read: 0.10× the normal input token price (90% discount on reads)
  • Net benefit: positive once your cache hit rate is above ~15%

To add a cache breakpoint in the Python SDK:

from anthropic.types import CacheControlEphemeralParam

# SDK 0.97.0 — confirmed importable
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": large_system_context,           # must be 1024+ tokens
                "cache_control": {"type": "ephemeral"}  # mark as cacheable
            },
            {
                "type": "text",
                "text": user_question                    # not cached — changes per request
            }
        ]
    }
]
Enter fullscreen mode Exit fullscreen mode

The 2026 TTL Change You Probably Missed

On March 6, 2026, Anthropic silently changed the default prompt cache TTL from 1 hour to 5 minutes (documented in GitHub issue anthropics/claude-code#46829 and confirmed in official docs).

This single change increased effective API costs by 30–60% for teams running moderate-traffic apps because:

  • If two requests with the same cached prefix arrive more than 5 minutes apart, the second request pays the cache write premium (1.25×), not the cache read discount (0.10×)
  • A 1-hour TTL with 10 requests/hour meant near-perfect hit rate; a 5-minute TTL with 10 requests/hour means ~8% hit rate

Current TTL options:

Cache type Price to write Price to read Use when
5-min TTL (default) 1.25× input 0.10× input >1 req/5 min same prefix
1-hour TTL (explicit) 2.00× input 0.10× input Low-traffic, long sessions

To use the 1-hour TTL:

"cache_control": {"type": "ephemeral", "ttl": "1h"}
Enter fullscreen mode Exit fullscreen mode

For low-traffic apps (say, 5 requests per hour), the 1-hour TTL at 2× write price is still cheaper than paying full input price on every request if your cached prefix is large.

Cost Model: What You Actually Save

Effloow Lab computed these numbers using Claude Sonnet 4.6 pricing ($3.00/M input tokens).

Low-traffic scenario (100 req/day, 60% hit rate)

System prompt: 2,000 tokens

Without cache:
  (2000/1M) × $3.00 × 100 = $0.0600/day → $1.80/month

With 5-min cache (60% hit rate):
  writes: (2000/1M) × $3.00 × 1.25 × 40  = $0.0300
  reads:  (2000/1M) × $3.00 × 0.10 × 60  = $0.0036
  total:  $0.0336/day → $1.01/month

Savings: 44%
Enter fullscreen mode Exit fullscreen mode

High-traffic scenario (1,000 req/day, 90% hit rate)

Without cache: $0.600/day → $18.00/month

With 5-min cache (90% hit rate):
  writes: (2000/1M) × $3.00 × 1.25 × 100 = $0.0750
  reads:  (2000/1M) × $3.00 × 0.10 × 900 = $0.0540
  total:  $0.1290/day → $3.87/month

Savings: 78.5%
Enter fullscreen mode Exit fullscreen mode

The key takeaway: caching is most powerful for large system prompts at high request volume. For 500-token prompts or sub-1-request-per-hour traffic, the numbers are marginal.

Minimum Token Requirements

The cached prefix must meet a minimum token count or the cache is simply ignored (no error, no discount):

Model family Minimum cacheable prefix
Claude Opus 4.1, Sonnet 4.5, Opus 4, Sonnet 4, Opus 3 1,024 tokens
Claude Haiku 3.5 2,048 tokens

If your system prompt is 800 tokens, caching silently does nothing. The threshold is worth knowing before spending time on implementation.

Four Cache Breakpoint Patterns

You can define up to 4 cache breakpoints per request. Each cache_control marker caches everything up to and including that block. The automatic cache (enabled by default for certain workloads) consumes one of the 4 slots.

Pattern 1: Single system prompt breakpoint

The most common case. Cache the entire system prompt which is stable across requests:

system = [
    {
        "type": "text",
        "text": YOUR_SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"}
    }
]
Enter fullscreen mode Exit fullscreen mode

Pattern 2: Document + instructions breakpoint

Cache both a large reference document and the task instructions separately:

content = [
    {
        "type": "text",
        "text": large_document,         # ~5,000 tokens, cache here
        "cache_control": {"type": "ephemeral"}
    },
    {
        "type": "text",
        "text": task_instructions,      # ~1,200 tokens, cache here too
        "cache_control": {"type": "ephemeral"}
    },
    {
        "type": "text",
        "text": user_query             # per-request, not cached
    }
]
Enter fullscreen mode Exit fullscreen mode

Pattern 3: Conversation history breakpoint

In multi-turn conversations, cache the stable history prefix up to the last N turns and leave recent turns uncached:

content = stable_history + [
    *stable_history,
    {"type": "text", "text": stable_history[-1]["content"], "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": new_user_message}
]
Enter fullscreen mode Exit fullscreen mode

Pattern 4: Tool definitions breakpoint

For large tool lists (10+ tools with detailed schemas), cache the tool definitions:

# Pass tools with a cache marker via the `system` parameter
# or structure as a user message block if using message-level caching
tools_text = json.dumps(tool_schemas)  # large tool list → 2000+ tokens
Enter fullscreen mode Exit fullscreen mode

Note: caching tool definitions via cache_control on the tools parameter is not directly supported as of SDK 0.97.0. The pattern is to include tool docs as a text block in the system/user content with a breakpoint.

Measuring Cache Hit Rate in Production

The usage object in the API response contains cache telemetry:

response = client.messages.create(...)

usage = response.usage
print(f"Input tokens:         {usage.input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens:     {usage.cache_read_input_tokens}")
print(f"Output tokens:         {usage.output_tokens}")

hit_rate = usage.cache_read_input_tokens / (
    usage.cache_creation_input_tokens + usage.cache_read_input_tokens + 0.001
)
print(f"Cache hit rate:        {hit_rate:.1%}")
Enter fullscreen mode Exit fullscreen mode

Track cache_read_input_tokens vs cache_creation_input_tokens across requests to identify when TTL misses are eating your budget.

Six Production Gotchas That Kill Cache Hit Rates

1. Timestamp in the cached prefix
Including "Current time: 2026-04-30T21:07:00Z" in your system prompt invalidates the cache on every request. Move dynamic values (timestamps, request IDs) to the uncached part of the prompt.

2. User-specific content in the cached block
"You are helping {user.name} at {user.company}" makes every user a separate cache miss. Move user context to a separate uncached content block after the breakpoint.

3. Tools list changes invalidate all downstream breakpoints
If you conditionally include tools based on user permissions and the tools list changes between requests, every breakpoint after the first tool-use block will miss. Normalize your tools list so equivalent permission sets produce identical JSON.

4. Workspace isolation (Feb 5, 2026)
Caches are now isolated per workspace, not per organization. If you have multiple workspaces sharing the same system prompt, each workspace maintains a separate cache. There is no cross-workspace cache sharing.

5. Cached prefix must be byte-identical
Even a single character change in any content block before a breakpoint invalidates the cache. This includes whitespace, encoding differences, and locale-specific formatting. Canonicalize your templates.

6. The 5-min TTL with batch workloads
If you run nightly batch jobs that process documents sequentially, requests arrive 30-120 seconds apart. With the 5-min TTL, you may get a hit rate above 90%. But if the job takes longer than 5 minutes between cache writes, the TTL expires. Consider using the 1-hour TTL ("ttl": "1h") for batch jobs.

Prompt Caching on OpenAI

For comparison: OpenAI's automatic prompt caching provides a 50% discount on input tokens for cached prefixes. It is automatic — no cache_control markup needed. OpenAI caches in 128-token increments automatically. The tradeoff is less control: you cannot force a cache breakpoint at a specific location.

Comparison Table

Feature Claude (Anthropic) OpenAI
Cache write cost 1.25× input 1.0× (auto)
Cache read cost 0.10× input 0.50× input
Default TTL 5 minutes ~1 hour
Extended TTL 1 hour (2× write) Not configurable
Min prefix size 1,024 tokens 1,024 tokens
Max breakpoints 4 N/A (auto)
Control Explicit markers Automatic
Workspace isolation Yes (Feb 2026) Not documented

FAQ

Q: Does prompt caching work with streaming?
Yes. Cache hit/miss is determined before streaming begins. The usage object appears in the message_delta SSE event and includes cache telemetry.

Q: Is caching available on Amazon Bedrock?
Yes. Amazon Bedrock supports Claude prompt caching with similar semantics. TTL and pricing may differ — check the Bedrock documentation.

Q: Do tool results get cached?
No. Tool results are part of the user turn and change per execution. Cache breakpoints on tool result content blocks will write but rarely hit, wasting write costs.

Q: Can I cache images or files?
Yes. Image content blocks support cache_control. Caching large base64-encoded images can be particularly effective since they consume many tokens. The same 1,024-token minimum applies.

Q: Does the cache persist across API keys?
No. Caches are isolated per API key and workspace. Rotating keys resets cache state.

Verdict: Cache Everything Stable, Review After the TTL Change

If you built on Claude before March 2026 with caching, audit your hit rates now — the silent TTL reduction may have eroded your savings. The fix is either: ensure requests arrive within 5-minute windows, or explicitly set "ttl": "1h" for low-traffic workloads that justify the 2× write price.

For high-traffic apps (>100 req/day on a stable system prompt), caching should be your first cost-optimization step before any model downgrade or architecture change. Related: Claude Streaming + Tool Use guide and LiteLLM AI Gateway guide for proxy-level caching.

Top comments (0)