System-prompt caching alone cut repeat-call costs by half
Tool definitions cache separately, perfect for agent loops
Conversation history caching pays off after turn three
1-hour TTL beats the default 5 minutes for batch jobs
My Anthropic API bill dropped 70 percent last month and I did not change a single model. I changed where the cache breakpoints went. Here are the five patterns I now use on every Claude integration I ship.
Pattern 1: Cache The System Prompt First
The system prompt is the cheapest win and most people skip it. My agents run with a 4,000 token system prompt that explains the role, the output format, the safety rules, and a few examples. That prompt never changes inside a session. Before caching, I paid full input price for those 4,000 tokens on every single call. With an agent that loops 30 times to finish a task, that is 120,000 tokens of pure repetition.
The fix is one parameter. I add a cache_control block with type: "ephemeral" to the last content item in the system prompt array. The first call writes the cache and costs slightly more (cache writes carry a small premium). Every call after that reads the cache at roughly one tenth the input price.
Here is the rule I follow: the cached block has to be at least 1,024 tokens for Claude Sonnet, or it gets ignored silently. My 4,000 token prompt clears that easily. If your system prompt is short, this pattern does nothing, so do not bother adding the breakpoint to a 200 token instruction.
The order matters more than people expect. The cache works as a prefix. Everything before the breakpoint gets stored. Everything after it is read fresh. So I put the stable stuff (role, rules, examples) up top and the volatile stuff (user query, current date) down below the breakpoint. Reorder this wrong and your cache hit rate collapses because the prefix changes on every call.
One real number from my logs: a document-classification job that runs 2,000 times a day. The system prompt is 3,800 tokens. Caching it saved me around 6.8 million billed input tokens daily. That is the single largest line item I cut. If you only do one thing from this article, cache the system prompt. It took me four minutes to add and the savings showed up in the next billing window.
I covered the broader setup in Claude Blueprint, which walks through how I structure these calls from the start.
Pattern 2: Cache Tool Definitions Separately
Tool definitions are the second forgotten cost. If you run Claude as an agent with function calling, your tool schemas can be huge. I have one agent with 14 tools and the JSON schemas alone come to 9,200 tokens. Those definitions sit at the very front of the request, before the system prompt, before anything else. They are also completely static across a session.
You can place a cache_control breakpoint right after the last tool definition. Now the tool block and the system prompt both get cached, but as two separate prefixes if you want. In practice I set one breakpoint after tools and a second one after the system prompt. Anthropic lets you set up to four breakpoints per request, so use them.
Why split into two instead of one big block? Flexibility. Sometimes I swap the system prompt for an A/B test but keep the same tools. With separate breakpoints, the tool cache survives the system-prompt change. Only the part that actually changed gets re-billed at full price. The tool prefix keeps reading from cache.
The math here is brutal in your favor for agent loops. A single agent task might fire 20 to 40 model calls as it reasons, calls a tool, reads the result, reasons again. Every one of those calls resends the full tool definitions. With 9,200 tokens of tools and 30 calls, that is 276,000 tokens you would pay full price for. Cached, it is one write and 29 cheap reads.
I measured a customer-support agent over a week. It handled 1,400 sessions. Average 18 model calls per session. Tool definitions cached saved me an estimated 215 million billed input tokens that week. The cache write overhead across all those sessions was tiny by comparison because each session writes once.
The trap: if you generate tool schemas dynamically and the JSON key order shifts between calls, the prefix breaks. I now serialize tools with sorted keys so the bytes are identical every time. That one habit fixed a mysterious cache miss rate I chased for two days.
Pattern 3: Cache Conversation History As It Grows
This is the pattern that needs the most care and pays off the most in long chats. In a multi-turn conversation, the history grows every turn. Turn one sends 500 tokens. Turn ten might send 8,000 tokens of accumulated back-and-forth. Without caching, you pay full price for the entire growing history on every turn.
The trick is a moving breakpoint. After each assistant response, I move the cache_control marker to the latest stable point in the message list. The whole conversation up to that point becomes the cached prefix. The next user turn appends after the breakpoint and reads everything before it from cache.
There is real subtlety here. The cache is a prefix match. So the messages before the breakpoint have to be byte-identical to the previous call. If you trim, summarize, or reorder history mid-conversation, you invalidate the cache from that point forward and pay full price to rebuild it. I learned to never edit history in place. I only append.
My rule: place the breakpoint on the second-to-last message, not the last. That keeps a stable boundary while the newest exchange stays uncached and cheap to write next round. By turn three, the cache is reliably paying for itself. Before turn three the write overhead can slightly exceed the read savings, so for very short interactions I skip history caching entirely.
A concrete result: a coding assistant I run averages 11 turns per session, and history reaches about 14,000 tokens by the end. Caching the growing prefix cut the per-session input billing by roughly 60 percent compared to resending everything fresh. Across 600 sessions a month that was a serious chunk of the bill.
If you want the deeper context on running Claude as a long-lived agent, see Claude Agent SDK in Production. It goes through the message-management side in more detail. The short version: append-only history plus a moving breakpoint is the combination that works.
Pattern 4: Use The 1-Hour TTL For Batch And Bursty Work
The default cache lifetime is five minutes. Every cache read resets that clock, so an active conversation keeps the cache warm on its own. The problem is bursty or scheduled work. If your job pauses for seven minutes between batches, the cache expires and you pay the write premium again on the next batch.
Anthropic offers a one-hour TTL. You opt into it per cache block by setting the TTL on the cache_control object. The write costs a bit more than the five-minute write, but the cache survives gaps up to an hour with no reads at all.
I use the one-hour TTL on three kinds of work. First, scheduled batch jobs that run every few minutes against the same system prompt. The five-minute cache kept dying in the gaps; the one-hour cache stays alive across the whole batch window. Second, user sessions with long think-time, like a research tool where someone reads a 2,000 word answer for ten minutes before replying. The five-minute cache was always cold by their next message. Third, multi-stage pipelines where stage one finishes, a separate process does work for several minutes, then stage two reuses the same cached context.
The decision is simple arithmetic in your head. If your average gap between calls is under five minutes, the default is free and fine. If your gap is regularly between five minutes and an hour, the one-hour TTL almost always wins despite the higher write cost, because you avoid repeated full-price rebuilds. If your gap is over an hour, no caching strategy helps and you should accept the fresh cost.
One detail that bit me: the TTL applies to the specific breakpoint. So if I have a one-hour system prompt cache and a five-minute history cache, they expire independently. I had a job where the history cache died but the system cache lived, and the partial cache hit confused my cost dashboard until I aligned both to one hour.
For scheduling the social side of all this around batch windows, I run posts through Buffer so the publishing cadence does not collide with my heavy API runs.
Pattern 5: Structure The Whole Prompt For Cache Friendliness
The last pattern is not a single trick. It is a layout discipline that makes the other four work together. Order your request from most stable to least stable, top to bottom. Tools first (rarely change), then system prompt (stable per session), then conversation history (grows but append-only), then the current user input (always new). Place breakpoints at the boundaries between these layers.
This ordering matters because the cache only matches a contiguous prefix from the start. The moment one byte changes, everything after it is uncached. So you want the volatile content as far down as possible and the stable content as far up as possible. I think of it like a stack: heavy, unchanging stuff on the bottom, light, changing stuff on top.
Two things ruin cache hits even with perfect ordering. Timestamps and random IDs injected into the stable layers. I once put the current date into my system prompt, thinking it harmless. It changed the prefix at midnight every day and silently halved my hit rate. Now any time-sensitive value lives below the lowest breakpoint, in the user turn where it belongs.
The second ruiner is non-deterministic serialization. JSON libraries do not always emit keys in the same order. Floating point formatting can differ. I normalize everything that goes into a cached block: sorted keys, fixed number formatting, no trailing whitespace differences. After I enforced this, my cache hit rate went from a frustrating 71 percent to a steady 96 percent on the same workload.
Measure your hit rate. The API response returns cache_creation_input_tokens and cache_read_input_tokens. I log both on every call and compute the read ratio nightly. When that ratio drops, something in my stable layers changed and I go hunting. This monitoring is what turned caching from a one-time setup into a thing I trust. Background: Claude Blueprint goes through how I wire this logging into the rest of the stack.
If your image pipeline also hits an API, the same prefix logic applies; I use Magnific for upscaling and keep its static config out of the per-image payload for the same reason.
Bottom Line
Caching is the highest-return change I have made to my Claude bill, and it touched zero of my actual logic. The five patterns stack: cache the system prompt, cache tool definitions on their own breakpoint, cache conversation history with a moving append-only marker, reach for the one-hour TTL when your calls are bursty, and lay out every request from stable to volatile so the prefix holds.
The combined effect on my workload was a 70 percent drop in billed input tokens. Most of that came from the system prompt and tool definition patterns, because agent loops repeat those on every single call. The history and TTL patterns cleaned up the rest.
Start with the system prompt this week. It is one parameter and four minutes of work. Then add tool caching if you run agents. Watch your cache_read_input_tokens climb, and let the number tell you whether the boundaries are placed right. If you want the full picture of how I structure Claude work end to end, Claude Blueprint is where I keep the running notes.
This article contains affiliate links. If you sign up through them, I may earn a small commission at no extra cost to you. (Ad)
Top comments (0)