A team I worked with shipped a Claude-powered support agent, turned on prompt caching, and watched their bill barely move. The cache hit rate hovered around 8%. The fix was three lines and one reordering — they were appending a timestamp to the top of their system prompt. Every request invalidated the entire cache on the first token.
Prompt caching with Claude is not a checkbox. It caches a prefix of your request, byte-for-byte, and the rules for what counts as a prefix are stricter and less intuitive than the docs make them look. Get the ordering wrong and you pay the write premium on every call while never collecting the read discount.
TL;DR
-
Prompt caching with Claude reuses the KV cache for an exact token prefix. The cache key is the literal prefix up to a
cache_controlbreakpoint; one changed byte anywhere before the breakpoint is a full miss from that point on. - Order matters: most-static first. Tools → system prompt → static context → conversation. Anything that changes per request goes after your last breakpoint, never before it.
- Cache writes cost more than base input; reads cost far less. The 5-minute write is ~1.25× base input tokens, the read is ~0.1×. You need roughly 2+ hits per write to break even — verify current multipliers in the API docs.
- Minimum cacheable prefix is ~1024 tokens (Sonnet/Opus) or ~2048 (Haiku). Below that, the breakpoint is silently ignored.
- TTL is 5 minutes by default, refreshed on every read. A steady stream of traffic keeps a cache alive indefinitely; a 1-hour TTL exists for bursty workloads at a higher write premium.
What does prompt caching with Claude actually cache?
It caches the attention key/value tensors for a prefix of your prompt, not the text. When Claude processes a request, every token's KV state depends on all tokens before it. If the first N tokens are identical to a request the server processed minutes ago, their KV states are identical too, so the server skips recomputing them and loads them from cache instead. That skipped prefill is where the savings come from — both the latency and the token cost.
The consequence that trips people up: caching is prefix-only and exact. There is no fuzzy matching, no "cache the parts that are similar." Token 5,000 can only be a cache hit if tokens 1 through 4,999 are byte-identical to the cached version. Change token 3 and tokens 3 through 5,000 are all recomputed.
That is why the timestamp-at-the-top bug is fatal. The model sees a different first token every request, so nothing after it can ever hit.
Where do you put the cache_control breakpoint?
Put it at the boundary between content that is stable across requests and content that changes. You mark the breakpoint on the last block you want included in the cached prefix:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_INSTRUCTIONS, # stable across all users
},
{
"type": "text",
"text": LARGE_STATIC_CONTEXT, # docs, schema, examples
"cache_control": {"type": "ephemeral"}, # <-- breakpoint here
},
],
messages=conversation, # per-request, NOT cached by this breakpoint
)
Everything from the start of the request up to and including the block carrying cache_control becomes the cached prefix. The messages array — the part that grows every turn — sits after it and is prefilled fresh each time. That is exactly what you want: pay full price only for the new tokens.
You get up to four breakpoints. Use them to cache nested layers that change at different rates. A common layout for an agent:
[ tool definitions ] breakpoint 1 (changes almost never)
[ system prompt ] breakpoint 2 (changes on deploy)
[ retrieved documents ] breakpoint 3 (changes per session)
[ conversation history ] breakpoint 4 (changes per turn)
The server checks breakpoints from the longest matching prefix backward, so placing one at the end of the conversation history lets multi-turn chats reuse everything up to the previous turn. Each new user message extends the prefix; the prior turns are a cache read.
Why does tool and system-prompt ordering decide your hit rate?
Because tools come first in the request, and a single edit to a tool definition invalidates every breakpoint after it. The canonical request order Claude assembles is: tools, then system, then messages. Whatever sits earliest is the most expensive to change.
This forces a discipline. Put the things that never change — tool schemas, the immutable parts of your system prompt — at the very front. Put volatile fragments as late as possible. Concretely:
-
Do not interpolate a request ID, timestamp, user name, or A/B flag into the system prompt. Move it into the
messagesarray, after your last breakpoint. - Do not reorder tools dynamically (e.g. "show the most relevant tool first"). Tool order is part of the cached prefix. Keep it fixed; let the model pick.
-
Do not pretty-print JSON context with a non-deterministic key order.
json.dumps(obj, sort_keys=True)or it will silently miss.
I have seen a 70-point swing in hit rate from nothing but moving a per-user greeting out of the system prompt and into the first user message.
What does prompt caching cost, and when does it pay off?
Cache writes cost more than normal input tokens; cache reads cost much less. With the default 5-minute TTL, a write runs about 1.25× the base input price and a read about 0.1×. (Treat these as the documented multipliers and confirm against the current pricing page — they are the figures that drive the math, so it is worth checking before you commit.)
The break-even is simple. If base input is 1 unit, you pay 1.25 to write and 0.1 to read. One write plus one read = 1.35 versus 2.0 for two uncached calls — already a win on the second hit. The rough rule: you need a little more than one cache read per write to come out ahead. Workloads with a fat, reused prefix (long system prompt, big RAG context, stable tool set) and many requests against it are where caching turns into 5–10× cost reductions on the cached portion.
Where it loses: low-traffic endpoints where the 5-minute TTL expires before the next request arrives. You pay the write premium, the cache dies, and the next call writes again. If your traffic is sparse but bursty, the 1-hour TTL ({"type": "ephemeral", "ttl": "1h"}) trades a higher write premium for a longer life — worth it only when you are confident the prefix gets reused within the hour.
Why is the minimum cacheable length a silent failure mode?
Because a breakpoint on a prefix shorter than the minimum is ignored without an error. For Sonnet and Opus the floor is around 1,024 tokens; for Haiku it is about 2,048. Put cache_control after a 300-token system prompt and you will see cache_creation_input_tokens: 0 in the response, no warning, and wonder why nothing caches.
Always read back the usage fields. Every response tells you exactly what happened:
"usage": {
"input_tokens": 412,
"cache_creation_input_tokens": 4096,
"cache_read_input_tokens": 0,
"output_tokens": 218
}
-
cache_creation_input_tokens > 0— you wrote a cache this turn (paid the premium). -
cache_read_input_tokens > 0— you hit a cache (paid the discount). - both zero on the cached portion — your breakpoint is below the minimum, or your prefix changed.
Log these three numbers per request. Your hit rate is cache_read / (cache_read + cache_creation) over a window, and it is the single metric that tells you whether caching is doing anything. If cache_creation is consistently high and cache_read near zero, something before your breakpoint is mutating.
How does multi-turn caching stay alive across a conversation?
Each read refreshes the TTL. The 5-minute clock resets every time the cached prefix is hit, so an active conversation — or a busy endpoint sharing one system prefix — keeps the cache warm indefinitely. It only expires after 5 minutes of no hits.
For a chat agent, put your final breakpoint at the end of the latest assistant turn. On the next user message, the entire history up to that point is a single cache read, and you pay full price only for the new user tokens plus the new generation. The cost of a long conversation grows with the new tokens per turn, not with the whole transcript replayed every time. This is the difference between linear and quadratic cost scaling on long sessions, and it is the main reason caching matters for agents that accumulate context.
One caveat: if you trim or summarize history mid-conversation (a common context-window tactic), you have rewritten the prefix. The summarized version is a fresh cache write. Plan summarization at deliberate boundaries, not every turn, so you amortize the rewrite.
Direct answer: where does the cache breakpoint go?
Place the cache_control breakpoint at the exact boundary between what is stable across requests and what changes per request, with the most static content (tool definitions, then system prompt, then static context) ordered first and anything volatile pushed after your last breakpoint. Prompt caching with Claude reuses the KV-cache for a byte-exact token prefix, so a single changed token before the breakpoint forces a full recompute from that point and destroys the discount; writes cost about 1.25× base input and reads about 0.1×, so you break even after roughly one reuse. Keep tool order fixed, strip timestamps and per-user strings out of the system prompt, ensure the cached prefix clears the ~1,024-token (Sonnet/Opus) or ~2,048-token (Haiku) floor, and watch cache_read_input_tokens versus cache_creation_input_tokens on every response to confirm it is actually working.
Top comments (0)