I noticed a pattern looking at three months of Anthropic invoices. The same 8 KB system prompt was getting billed full price on every request. Same instructions, same tool definitions, same RAG context, charged again every turn. The fix takes about ten lines of code and cuts the input bill by roughly 90 percent on cached tokens. This guide is the version of that fix I wish I had bookmarked a year ago.
TL;DR
| Question | Short answer |
|---|---|
| What does it do? | Stores a prefix of your prompt server-side so later requests skip re-encoding it |
| How much does it save? | Cache reads cost 10 percent of the base input rate, so up to 90 percent off cached tokens |
| What does it cost to write? | First write costs 1.25x base input (5-minute TTL) or 2x (1-hour TTL) |
| When does it pay off? | Any prefix reused at least twice within the TTL window |
| Smallest cacheable chunk? | 1024 tokens for Sonnet and Opus, 2048 tokens for Haiku |
| Where do I put the marker? | On the last block of the chunk you want cached |
1. What prompt caching actually is
When you send a Message to the Claude API, Anthropic encodes your full prompt every time. System instructions, tool schemas, retrieved documents, prior turns, every part. Caching adds an opt-in: you mark a content block as a cache breakpoint, and Anthropic stores the encoded state of everything up to that point. The next request that starts with the same exact bytes reads from the cache instead of recomputing.
Three things to internalize:
The cache is a prefix cache. Order matters. Same system prompt, same tools, same first message in the array. If anything in the prefix differs by even one token, you get a cache miss and pay full input price.
Each breakpoint is ephemeral with a TTL of either 5 minutes (default) or 1 hour. The clock resets every time you read the cache, so a chatty conversation keeps its cache warm without re-paying the write cost.
You get up to four breakpoints per request. Smart layering matters: tools at the deepest level, then system, then static context, then the conversation tail. Each breakpoint extends the cached prefix one layer outward.
2. The cost math, with numbers
For Claude Sonnet 4.6 at the public rate of three dollars per million input tokens, here is what one mid-size system prompt costs across ten requests in a five-minute window.
| Scenario | Per request | 10 requests | Effective rate |
|---|---|---|---|
| No caching, 8 KB prefix (~2k tokens) | $0.0060 | $0.060 | $3.00/Mtok |
| 5m cache, first miss + 9 hits | $0.0075 first, $0.0006 after | $0.0129 | $0.65/Mtok |
| 1h cache, first miss + 9 hits | $0.0120 first, $0.0006 after | $0.0174 | $0.87/Mtok |
The 5-minute TTL pays for itself after the second request. The 1-hour TTL pays for itself after the third. Use 1-hour only when you know the prefix lives longer than 10 minutes between calls; otherwise stick with the default 5m and let the cache renew on each read.
Output tokens are not affected. Caching only changes the input side.
3. The smallest useful implementation
Take the static parts of your prompt and tag the last block with cache_control. Everything before that block gets cached; everything after stays dynamic.
TypeScript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const SYSTEM_PROMPT = `You are a code reviewer for a TypeScript monorepo.
[... 1500 more tokens of style guide, examples, repo conventions ...]`;
async function review(diff: string) {
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [
{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" }, // 5m TTL by default
},
],
messages: [{ role: "user", content: diff }],
});
console.log({
cache_creation: response.usage.cache_creation_input_tokens,
cache_read: response.usage.cache_read_input_tokens,
fresh_input: response.usage.input_tokens,
});
return response;
}
The first call returns a non-zero cache_creation_input_tokens and bills the 1.25x write multiplier on the system prompt. Every later call within 5 minutes returns a non-zero cache_read_input_tokens and bills the 0.1x read rate. The diff itself stays uncached because it sits after the breakpoint.
Python
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = """You are a code reviewer for a TypeScript monorepo.
[... 1500 more tokens of style guide, examples, repo conventions ...]"""
def review(diff: str):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}, # 5m default
}
],
messages=[{"role": "user", "content": diff}],
)
print({
"cache_creation": response.usage.cache_creation_input_tokens,
"cache_read": response.usage.cache_read_input_tokens,
"fresh_input": response.usage.input_tokens,
})
return response
For a 1-hour TTL, change the breakpoint to {"type": "ephemeral", "ttl": "1h"}. That is the only difference.
4. Layering breakpoints in a real request
The four-breakpoint cap is generous when you stack the layers right. Here is the pattern that has worked across every production prompt I have audited.
┌─────────────────────────────┐
│ tools (rarely change) │ breakpoint 1
├─────────────────────────────┤
│ system prompt │ breakpoint 2
├─────────────────────────────┤
│ static context / RAG docs │ breakpoint 3
├─────────────────────────────┤
│ conversation history │ breakpoint 4
├─────────────────────────────┤
│ current user turn │ uncached
└─────────────────────────────┘
Each breakpoint extends the cached prefix down one layer. If the user adds a turn but the docs and tools and system stay the same, you pay full rate only on the new user turn and the model's response.
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 2048,
tools: [
{ name: "search_docs", description: "...", input_schema: {/*...*/} },
{
name: "run_query",
description: "...",
input_schema: {/*...*/},
cache_control: { type: "ephemeral" }, // breakpoint 1
},
],
system: [
{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" }, // breakpoint 2
},
],
messages: [
{
role: "user",
content: [
{ type: "text", text: RETRIEVED_DOCS,
cache_control: { type: "ephemeral" } }, // breakpoint 3
{ type: "text", text: priorTurnSummary,
cache_control: { type: "ephemeral" } }, // breakpoint 4
{ type: "text", text: currentUserMessage }, // uncached
],
},
],
});
A nuance worth knowing: a single breakpoint actually creates two cache entries internally, one for the 5m TTL and one for the 1h TTL if you have configured both anywhere in the request. The usage object reports both with cache_creation.ephemeral_5m_input_tokens and cache_creation.ephemeral_1h_input_tokens.
5. The failure modes
Caching is unforgiving about exact-prefix matching. The four ways I have seen it silently miss in production:
A timestamp slipped into the system prompt. Someone added Today is ${new Date().toISOString()} at the top. Every request now has a unique prefix. Every request misses cache. Move the timestamp to the user turn, or round it to the nearest hour.
Trailing whitespace. Two engineers concatenated the prompt with + "\n" in different places. The bytes differ. The cache misses.
Tool order changed. Tools are part of the prefix. If you sort them by frequency or load them from a Map (insertion order), make sure the order is stable across processes.
Below the size floor. Sonnet caches blocks of at least 1024 tokens; Haiku needs 2048. A 600-token system prompt will not cache at all, and you will see no cache_creation_input_tokens in the response. Either pad it or accept the full rate.
6. When to pick 1h over 5m
Decision rule that has held up:
| Use case | TTL |
|---|---|
| Interactive chat, agent loops | 5m |
| Background batch jobs every minute | 5m |
| Cron job every 15 minutes | 1h |
| Long-running review session, hourly digests | 1h |
| Single-shot one-off | no cache |
The 1-hour TTL costs 2x base on the write. You need at least three reads inside that hour to come out ahead versus the 5m TTL, which only needs two. If your traffic is spiky and unpredictable, 5m is the safer default.
7. The honest tradeoffs
Four caveats nobody mentions in the marketing copy.
Caching shifts cost, it does not remove it. Output tokens still bill at the full rate. If your bill leans on output (long generated reports, code synthesis), caching helps less than the 90 percent headline suggests.
Cache hit rate is observable but not always predictable. Anthropic does not guarantee a write will be readable on the next request; in practice it always lands for me, but the docs are clear that the cache write runs on a best-effort basis. Build with the assumption that any individual request might miss.
Tool definitions count as part of the prefix. Adding a new tool invalidates the cache for every prompt that uses tools. Plan tool schema changes during low-traffic windows.
No public API exists for manually evicting a cache. You either let the TTL expire or change the prefix. This matters when you ship a system-prompt update and want to confirm the old version no longer lives in the cache.
8. The bottom line
Prompt caching is the highest-ROI single change you can make to a Claude API integration today. Eight extra lines of config, no architecture change, and a 90 percent cut on the largest line item in most invoices. The reason for the under-use is not technical; the docs treat caching as a feature flag, not as the default.
Set the breakpoint, log cache_read_input_tokens for a week, and watch what your bill does. If you are running tools, set four breakpoints, not one. If you are running interactive sessions, stick with the 5m TTL.
The cheaper the input gets, the more the architecture starts to look like the cache is the prompt and the prompt is the cache. That shift is worth thinking about before the next round of API price changes lands.
GDS K S · thegdsks.com · building Glincker · follow on X @thegdsks
The bill that caching hides is not the next one. The one you would have paid in twelve months if you never set the breakpoint.
Top comments (0)