Last month I ran a side-by-side test on an AI agent that processes about 4,000 requests a day. The agent has a long system prompt (roughly 2,800 tokens of rules, tool definitions, and examples) that gets sent with every single call. Before prompt caching: $47/day. After enabling caching on that system prompt block: $6.80/day.
That's not a rounding error. That's an 85% cost reduction with a single configuration change and zero changes to the agent's behavior.
Here's exactly how prompt caching works and how to set it up without the gotchas.
What prompt caching actually does (and doesn't do)
Anthropic's prompt caching works at the prefix level. When you send a request, the API checks whether a prefix of your messages exactly matches a previously-cached prefix. If it does, those cached tokens are served from a KV store instead of re-processed through the full model — and you pay a dramatically lower per-token rate for them.
The pricing structure (as of mid-2026 on Claude 3.5 Sonnet):
- Normal input tokens: $3.00 per million
- Cache write (first use, or cache miss): $3.75 per million (a 25% premium to write the cache)
- Cache read (cache hit): $0.30 per million (90% discount vs. normal)
The cache lasts 5 minutes between requests (with the TTL resetting on each hit). For any agent that gets called more often than every 5 minutes — which is most production agents — this is almost always a win.
The exact API call
The key is the cache_control block. You add it as a "breakpoint" at the end of any message block you want cached. The API caches everything up to and including that breakpoint.
import anthropic
client = anthropic.Anthropic()
# Your long system prompt - tool definitions, rules, examples, etc.
SYSTEM_PROMPT = """
You are a support agent for Acme Corp...
[2,800 tokens of rules, tool definitions, persona, examples]
"""
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # <-- this is the entire setup
}
],
messages=[
{"role": "user", "content": user_message}
]
)
# Check what actually happened
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
The cache_creation_input_tokens field tells you a cache was written (you pay the 25% premium). On subsequent calls within 5 minutes, cache_read_input_tokens will be populated instead, and you pay $0.30/M instead of $3.00/M.
Where it saves money and where it doesn't
High-ROI scenarios:
Large system prompts repeated on every call. If your system prompt is 1,000+ tokens and you're calling the API more than once every 5 minutes, caching it is almost always net positive.
Tool definitions. Tool schemas count as input tokens, and they can be surprisingly large. A set of 10 reasonably-described tools might run 800-1,200 tokens. Cache the tools block.
Few-shot examples in the system prompt. This is the big one. People add 5-10 worked examples to their system prompts to improve output quality. Those examples might be 2,000-4,000 tokens. Cache them.
Document analysis at scale. If you're analyzing the same document with many different questions (think: extracting 20 different fields from a contract), cache the document text as a user message and issue all 20 queries against the same cache.
Low or negative ROI scenarios:
- Requests spaced more than 5 minutes apart. The cache expires and you pay the write premium on every call with no reads to amortize it. Check your actual request cadence before enabling.
- Very short system prompts (<500 tokens). The math just doesn't work — the write premium exceeds the read savings unless you have very high volume.
- One-shot or batch jobs that touch each prompt once. No repeated reads = no benefit.
Multiple cache breakpoints
You can have up to 4 cache breakpoints per request. This lets you cache different parts of the prompt independently:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": BASE_RULES, # Always the same
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": TOOL_DEFINITIONS, # Changes rarely
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": dynamic_context # Changes per request — NOT cached
}
],
messages=[...]
)
The prefix caching rule is strict: the API caches everything up to the last marked breakpoint in sequence. If your dynamic context goes between two cached blocks, the second cache hit won't work — the prefix has to be identical. Always put dynamic content at the end.
The gotcha that will burn you
Whitespace and character-level identity matter.
The cache key is the exact token sequence of the prefix. If your system prompt is generated dynamically — say, you interpolate a user's name or account tier into it — each variation produces a different token sequence and you get zero cache hits even though 95% of the content is identical.
The fix: move all dynamic content to the end, after your last cache breakpoint. Put only truly static content (rules, tool definitions, examples) in the cached block.
# Bad: dynamic content inside the cached block breaks caching
system = f"""
You are an agent for {company_name}. # <-- this makes every request unique
[2,800 tokens of static rules]
"""
# Good: static block cached, dynamic content appended outside the cache
STATIC_BLOCK = """
[2,800 tokens of static rules]
"""
system = [
{"type": "text", "text": STATIC_BLOCK, "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": f"Current context: working for {company_name}."}
]
Calculating your break-even
Before enabling caching, run this math:
Let:
T = tokens in your cached block
R = requests per hour
W = cache write cost = T * $3.75/M
S = savings per read = T * ($3.00 - $0.30) / M = T * $2.70/M
Break-even reads = W / S = $3.75 / $2.70 ≈ 1.4 reads per cache window
If you get more than 1.4 requests in a 5-minute window (that's about 17 requests/hour), caching is net positive. At 4,000 requests/day, you're hitting the cache hundreds of times per 5-minute window.
Verifying it's working
Always instrument your cache usage. The response usage object tells you exactly what happened:
usage = response.usage
total_input = usage.input_tokens
cache_writes = getattr(usage, 'cache_creation_input_tokens', 0)
cache_reads = getattr(usage, 'cache_read_input_tokens', 0)
# A healthy caching ratio: most calls should be reads, not writes
print(f"Cache write: {cache_writes} tokens (paid at $3.75/M)")
print(f"Cache read: {cache_reads} tokens (paid at $0.30/M)")
print(f"Regular: {total_input} tokens (paid at $3.00/M)")
If you're seeing mostly cache_creation_input_tokens and few cache_read_input_tokens, your request cadence is slower than 5 minutes or your prompt isn't actually static. Fix the content, not the caching setup.
The bottom line
Prompt caching is one of those rare API features where the implementation cost is 30 minutes and the payoff is immediate and ongoing. It doesn't change what your agent does — it just changes what you pay for the same work.
If your agent makes more than ~20 calls/hour with a system prompt over ~800 tokens, you should be caching. The cache_control block is a one-liner. The usage fields tell you instantly whether it's working.
If you're building reliable AI agents at production scale, the free Reliable Agent Field Guide covers reliability patterns, cost controls, and testing strategies: penloomstudio.com/field-guide.html
Top comments (0)