Anthropic's prompt caching is one of the highest-ROI LLM cost-reduction techniques shipped in the last two years, but the mechanics aren't immediately obvious from the docs. The pricing is non-uniform — a write premium on first writes balanced against a 90% discount on reads — and the marker syntax requires explicit opt-in rather than firing automatically the way OpenAI's does. The summary: tag the stable portion of your prompt with cache_control: { type: "ephemeral" }, pay 1.25x normal input price on the first request (5-minute TTL) or 2x (1-hour TTL), then 0.10x on every subsequent request within the cache TTL. Break-even on the 5-minute TTL arrives at the second cache hit; the 1-hour TTL takes a few more hits to pay back but survives much longer between requests. For most production workloads with a system prompt over a few hundred tokens, the discount kicks in by the second customer interaction. This post walks through the mechanics, the math, the gotchas, and the production patterns that turn the marker into actual savings.
The parent guide AI API caching covers the broader caching strategy; this article goes one level into Anthropic's specific implementation.
What it caches and why
Prompt caching is provider-side prefix-attention caching. When you send a request to Anthropic with cache_control: { type: "ephemeral" } on part of the prompt, Anthropic hashes the leading content up to that marker, checks an internal cache, and serves the cached attention state if a match exists. The actual model run still happens — Claude still generates the response token-by-token — but the expensive prefix-attention computation is skipped.
The "cache" here is not the response. It's the work the model does to encode the static context into the model's internal representation. Most production LLM workloads carry a long stable prefix (system prompt + retrieved context + tool definitions) followed by a short variable suffix (the user message). Re-encoding the stable prefix on every request is wasted compute. Anthropic charges less for the cached portion because it's doing less work.
The pricing math
The numbers that matter:
| Token category | Price multiplier (vs base input price) | Notes |
|---|---|---|
| Normal input (uncached) | 1.0x | Standard input pricing |
| Cache write — 5-minute TTL (default) | 1.25x | 25% premium for the short-window cache |
| Cache write — 1-hour TTL (extended) | 2.0x | 100% premium for the long-window cache |
| Cache read (subsequent requests within TTL) | 0.10x | The 90% discount — the wedge, same for either TTL |
| Output | normal output pricing | Unchanged |
The break-even threshold is when cumulative savings from cache reads exceed the one-time write premium. On the 5-minute TTL, two cache hits net out as (1.25 + 0.10) / 2 = 0.675x — already a 32.5% saving on the cached portion. Three hits drops the average to 0.483x (a 52% saving). The asymptotic limit as the cache stays warm forever approaches the 0.10x read price.
5-minute TTL — average cost per request on the cached portion, after N hits:
N=1: 1.25x (write only — break-even loses 25%)
N=2: 0.675x (32.5% saving)
N=3: 0.483x (52% saving)
N=5: 0.330x (67% saving)
N=10: 0.215x (78.5% saving)
N→∞: 0.10x (90% saving — the steady state)
1-hour TTL — average cost per request on the cached portion, after N hits:
N=1: 2.00x (write only — break-even loses 100%)
N=2: 1.05x (worse than uncached at 2 hits)
N=3: 0.733x (27% saving — first net win)
N=5: 0.480x (52% saving)
N=10: 0.290x (71% saving)
N→∞: 0.10x (90% saving — same steady state)
The 1-hour TTL pays back later — it needs ~3 hits to net out, vs the 5-minute TTL's 2 hits — but the cache survives 12x longer between requests, which is the entire point.
For workloads with stable prefixes that hit the cache many times per 5-minute window, the effective discount approaches 90% on the cached portion. Output tokens stay at full price; only the input-side computation gets the discount.
The cache_control marker
The syntax. You attach cache_control to a content block at the end of the portion you want cached:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-7",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a customer support agent for Acme Corp. Follow these guidelines: [...long stable instructions...]",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "How do I reset my password?"}
]
)
The marker tells Anthropic: "everything up to and including this content block should be cached as a prefix." The user message after the cached prefix isn't cached; it's processed normally and becomes the variable suffix.
The cache key is the byte-exact content of everything before and including the marker. Any change — a one-character difference in the system prompt, a different model parameter, a different tool definition — invalidates the cache.
You can place markers on multiple content blocks to cache nested levels of prefix. For example:
system=[
{
"type": "text",
"text": "You are a helpful assistant. [system instructions]",
"cache_control": {"type": "ephemeral"} # Block 1 — innermost cache
},
{
"type": "text",
"text": "Retrieved context: [long RAG passage from this user's query]",
"cache_control": {"type": "ephemeral"} # Block 2 — outer cache
}
]
This creates two cache entries: one for the system prompt alone (high reuse across all users), one for system+context (lower reuse, specific to this retrieval). The model checks the longest matching cached prefix first. If the retrieval changes per request but the system prompt is stable, the inner cache (block 1) still hits.
There's a documented cap on how many markers can appear per request (4 in current implementations); placement of the markers is its own discipline.
The TTL options
Two TTL choices:
Default ephemeral (5 minutes) — the standard option. Specified as:
"cache_control": {"type": "ephemeral"}
Extended TTL (1 hour) — opt-in by setting the ttl field. Specified as:
"cache_control": {"type": "ephemeral", "ttl": "1h"}
The 1-hour option carries a 2x write premium (vs 1.25x on the 5-minute TTL) but lets cache entries survive 12x longer between hits — the right call when traffic to a specific prefix is too sparse to keep a 5-minute cache warm.
The right choice depends on traffic density:
| Traffic to a specific stable prefix | TTL choice |
|---|---|
| Multiple hits per minute (active production chatbot) | Default 5-minute. The cache stays warm naturally. |
| Hits every few minutes (moderate-traffic chatbot or support tool) | Default 5-minute. Edge case where hits cluster around the TTL boundary; sometimes worth testing. |
| Hits every 10-30 minutes (low-volume backend integration) | 1-hour extended. The write premium is offset by the longer warm-cache window. |
| Hits every hour or less | Probably not worth caching. Either TTL expires before the second hit, or the extended TTL's premium dominates the savings. |
The 1-hour option is the right call for workloads with predictable but spaced-out traffic — a daily report generation that fires once an hour against the same prompt, for instance.
What you need to know about cache hits in the response
The usage block in the response tells you what hit and what wrote:
response.usage
# Usage(
# input_tokens=1234,
# output_tokens=456,
# cache_creation_input_tokens=0, # Tokens written to cache (paid 1.25x at 5-min, 2x at 1-hour)
# cache_read_input_tokens=1200 # Tokens read from cache (paid 0.10x)
# )
cache_read_input_tokens is the count of input tokens served from the cache. cache_creation_input_tokens is the count written on a fresh-write request (the first request that populates a cache entry pays this; subsequent reads have this at 0).
The actual cost calculation:
def calculate_cost(usage, input_price, output_price, cache_write_multiplier=1.25, cache_read_multiplier=0.10):
# Tokens that were normal (uncached) input
uncached_input = usage.input_tokens - usage.cache_read_input_tokens - usage.cache_creation_input_tokens
cost = (
uncached_input * input_price
+ usage.cache_creation_input_tokens * input_price * cache_write_multiplier
+ usage.cache_read_input_tokens * input_price * cache_read_multiplier
+ usage.output_tokens * output_price
)
return cost
The input_tokens field is the total count of all input tokens (regardless of cached/uncached); the cache fields are subsets of that total. Your accounting needs to subtract the cached portions before applying the normal input price to the residual.
What invalidates the cache
The cache hit requires byte-exact match of everything before and including the marker. Things that invalidate:
- Any change to the system prompt content (even whitespace). The fingerprint differs; cache misses.
-
Different
modelparameter. Cache entries are per-model; a request to claude-opus doesn't hit a claude-sonnet cache entry. - Different tool definitions before the marker. If tools are in the cached prefix, changing tools invalidates.
- Different placement of the marker. Moving a marker from block N to block N+1 creates a different cache key.
- 5-minute (or 1-hour) TTL elapsed without a hit. Cache entries age out.
Things that don't invalidate:
- Variable content after the marker. The user message is variable per request and doesn't affect the cached prefix.
-
Different sampling parameters (
temperature,top_p,max_tokens). These affect generation but not the prefix attention. - Different request IDs, metadata, headers. Not part of the cache key.
The cache discipline matches the broader cache-fingerprinting discipline in prompt cache fingerprinting pitfalls — get the boundaries right or the cache hits stop landing.
Production patterns
The shapes that hold up in production:
Stable system prompt + dynamic context + user message. The most common pattern. System prompt and tool definitions go in cached blocks; retrieved context and user message stay uncached. Almost every production LLM workload looks like this.
Two-level caching (system alone + system+context). When retrieved context changes per request but reuses a stable system prompt, mark both blocks for caching. The inner system-only cache still hits even when the outer system+context cache misses. Recovers a meaningful chunk of the saving.
Cache-warming on cold start. If your workload has predictable traffic patterns (e.g. business-hours support chatbot), fire a single warm-up request at the start of the active window to populate the cache. The first real user request hits the warmed cache instead of paying the write premium.
Per-user/per-session caching for personalised prompts. Each user gets their own cached prefix (with personalised system instructions). The cache hits within a single user's session but misses across users. The write premium is real but pays back across the second + third message of any conversation.
The anti-patterns
Three patterns that look like they should work but undermine the cache:
Injecting timestamps into the system prompt. "You are responding at [timestamp]. [Instructions...]" The cache fingerprint changes per request. Cache never hits. Strip dynamic content from the cached portion.
Marking everything for caching. The cache key is everything up to and including the marker. If you mark the very last content block (the user message itself), the cache key includes the user message, which makes it effectively useless — every request has a unique user message, so the cache never hits twice.
Caching prompts shorter than ~few hundred tokens. The write premium is real and the per-token savings are small on short prefixes. Anthropic's cache is most effective on prompts over 1,024 tokens; the breakeven on smaller prompts is rarely worth the complexity.
When OpenAI's automatic prompt cache is the better fit
OpenAI's prompt caching engages automatically with no caller-side configuration. The discount is smaller (50% vs Anthropic's 90%) but the operational simplicity is real. The trade:
- If your application is OpenAI-heavy → no work needed; the discount applies automatically on prompts ≥1,024 tokens.
- If your application is Anthropic-heavy → adopt the cache_control marker discipline; the 90% discount is materially larger.
- If your application uses both → set up both patterns. Most production gateways (Prism included) handle this transparently — markers passed through to Anthropic, cached_tokens read back from both providers.
The deeper comparison: provider-native caching glossary.
How Prism handles Anthropic prompt caching
Prism's request handler passes cache_control markers from customer requests through to Anthropic unchanged. The cache_creation_input_tokens and cache_read_input_tokens from the upstream response are read into the billing path, so the customer's bill is calculated against the discounted base rather than the gross input-token count.
Specifically:
-
Pass-through preservation. If your code attaches
cache_controlmarkers to a request, Prism forwards them to Anthropic. No marker stripping, no auto-modification. -
Discount pass-through. The 90% cache-read discount applies to the customer's bill, not absorbed as Prism margin. The
X-Prism-Native-Cache-Saved-Centsresponse header surfaces the per-request saving. - Auto-marking opt-in (planned for v1.9). For customers who don't want to manually attach markers, Prism will optionally inject markers on stable-prefix sections (system message + initial context blocks) based on heuristics. Currently customer-side opt-in; expanding behaviour TBD.
VERIFY (founder): confirm the auto-marking feature roadmap. Is this planned for v1.9 or later? If not on the roadmap at all, strike the auto-marking line and reframe as "Prism today preserves markers; customers attach them in their request code."
For broader prompt-caching context including the OpenAI equivalent: prompt caching glossary.
Decision framework
If you're standing up Anthropic prompt caching on a production workload:
- Identify your stable prefix. System prompt + static instructions + tool definitions. Sum the token count. If it's over ~500 tokens, the cache is probably worth setting up.
- Choose your TTL. Default 5-minute for active production traffic; 1-hour extended for spaced-out batch or daily-cron workloads.
-
Attach the marker.
cache_control: { type: "ephemeral" }on the final content block of the cached portion. -
Verify hits. Read
cache_read_input_tokensfrom the response usage block on the second and subsequent requests. Should be non-zero on cache hits. - Avoid the anti-patterns. No timestamps in the cached portion. Don't mark the user message itself. Don't bother caching short prompts.
- Layer with response-level caching for full coverage. Prompt caching discounts the calls that go through; response caching avoids many of them entirely. Read AI API caching for the full layered strategy.
The mechanic is simple once the pricing math is clear. The wedge is genuinely large — 90% off the dominant cost component on workloads where it applies. The discipline is keeping the cached prefix stable, which is mostly a code-hygiene problem.
Where to go next
For the parent layered caching framework: AI API caching. For the OpenAI equivalent: prompt caching glossary and the OpenAI-specific deep dive in OpenAI cost optimization. For the broader fingerprinting discipline: prompt cache fingerprinting pitfalls.
For modelling Anthropic-cached cost on your workload: savings calculator — the stable-prefix toggle drives the provider-native passthrough projection.
FAQ
What's the exact write premium?
25% above normal input price for the standard 5-minute TTL. The 1-hour extended TTL has a higher premium (confirm against Anthropic's current pricing page; pricing has moved historically). Both pay off within a small number of cache hits on most workloads.
Can I cache the user message?
You can, but it almost never makes sense. The cache key is everything up to and including the marker; if the user message is part of the key, the cache hits only on byte-identical user messages — which is rare in production. Mark the system prompt or tool definitions instead; let user messages stay uncached.
Does caching work with streaming responses?
Yes. The stream parameter doesn't affect cache behaviour. The cache_read_input_tokens and cache_creation_input_tokens appear in the final usage chunk of the stream (with stream_options.include_usage set). Streaming and prompt caching are independent.
What happens if I change the system prompt — do I have to invalidate the cache manually?
No. The cache fingerprint includes the system prompt content; any change automatically generates a different cache key, so old entries are unreachable for new requests. Old entries age out via TTL. No manual invalidation needed.
Can I use prompt caching with function calling?
Yes — and tool definitions are commonly part of the cached prefix. If your tools array is stable across requests, mark it for caching; the cache hits on the tool definitions even when user messages vary. Changing tools invalidates the cache for the affected requests.
Does the cache work across different models?
No. Cache entries are per-model. A request to claude-opus-4-7 doesn't hit cache entries from claude-sonnet-4-7. If you route between models per request (e.g. via a gateway like Prism), each model's cache warms independently.
What's the smallest prompt that benefits from caching?
Roughly 1,024 input tokens is the practical minimum where the cache machinery applies meaningfully — Anthropic's pricing and engineering are tuned for prompts at this scale and above. Caching a 200-token prompt is technically supported but the savings are negligible against the write premium and operational complexity. Use it on prompts that are actually long.
How does Prism handle this for non-Anthropic providers?
Prism passes provider-specific cache markers through to the target provider. OpenAI's automatic caching engages without markers; Anthropic's requires the cache_control attachment shown above. Customer code attaches markers explicitly; Prism doesn't auto-modify request shapes (with potential auto-marking opt-in for v1.9; see VERIFY tag above).
Anthropic's prompt cache is a real wedge on the right workloads. The AI API caching guide shows where it fits in the broader layered strategy; the savings calculator lets you model the impact on your bill.
Top comments (0)