If you're running AI agents in production, there's a cost you're probably not thinking about.
Every turn in an agentic conversation sends the full prompt to the model. That includes the system instructions, all the tool definitions, any project context that was loaded earlier, and the entire conversation history. The model processes all of it. From the top. Every single time.
For a quick two-turn interaction, this doesn't matter much. But for a 50-turn coding session where the system prompt alone is 20,000 tokens? That's 1 million tokens of repeated computation across the session, all billed at full input price, all producing zero new insight. The model already processed that system prompt 49 turns ago. It's just doing it again because nothing told it not to.
This is the problem prompt caching solves. And Claude Code is probably the best case study of how to do it right.
The Two Parts of Every Prompt
The first thing to understand is that not all tokens in a prompt are created equal.
Look at any agentic API call and you'll see two distinct layers:
The foundation. This is everything that stays the same from turn to turn. System instructions, tool schemas, project-level context like a CLAUDE.md file, behavioral rules. If you looked at turn 1 and turn 47 side by side, this part would be identical.
The conversation. This is everything that's different each turn. The user's latest message, tool call results, file contents that were just read, terminal output. This grows with every interaction and is genuinely new information the model needs to process.
The entire trick behind prompt caching is recognizing that the foundation doesn't need to be reprocessed. You compute it once, store the result, and reuse it on every subsequent turn. The model only does fresh work on the conversation layer.
What's Actually Being Cached: The Transformer Angle
This isn't just skipping a string comparison. To understand why caching cuts costs so dramatically, you need to know what the model does when it reads a prompt.
LLM inference has two stages:
Prefill: the model takes your entire input and runs it through dense matrix multiplications, token by token, building an internal representation. This is computationally expensive and it's where most of the time and cost goes.
Decode: the model generates its response one token at a time, mostly just reading from the state it already built during prefill.
During prefill, the model computes three vectors for every token: Query, Key, and Value. These are the building blocks of the attention mechanism, how the model figures out which parts of the input matter for which other parts.
The important property: Key and Value vectors for any given token only depend on the tokens before it. They're deterministic. If the input is the same, the output is the same.
So once you've computed the Key-Value pairs for a 20,000-token system prompt, you can store them. Next time a request comes in with that same prefix, you skip the entire prefill computation for those 20,000 tokens and go straight to processing the new content.
Anthropic's infrastructure does this by hashing the input prefix. Same hash, same cached tensors, no recomputation. Different hash (even one byte different), full recomputation.
The Economics
Here's where this gets concrete. Anthropic's caching pricing has three tiers:
| Operation | Multiplier | What it means |
|---|---|---|
| Cache reads | 0.1x base input price | 90% discount on every cached token |
| 5-minute cache writes | 1.25x base input price | Small premium to store the KV tensors |
| 1-hour cache writes | 2x base input price | Extended TTL for longer sessions |
For Claude Sonnet 4.6 ($3/MTok base input), here's what that looks like in practice:
Standard input: $3.00 / MTok
Cache read: $0.30 / MTok (90% savings)
5-min cache write: $3.75 / MTok (25% premium, one-time)
1-hour cache write: $6.00 / MTok (2x premium, one-time)
A cache hit costs 10% of standard input. That means caching pays for itself after just one subsequent read for the 5-minute duration. For a 50-turn session reusing a 20,000-token prefix, the savings compound on every single turn.
Tracking a Real Claude Code Session
Theory is nice. Let's trace the actual token economics of a single debugging session to see where the money goes.
You open Claude Code in a Next.js project. The moment the session starts, it loads the system prompt, all available tool definitions (file read, file write, bash, grep, glob, and others), and your project's CLAUDE.md. That initial payload lands somewhere around 20,000 tokens. Every single one of those tokens is processed fresh. This is the only time you pay full price for them.
You type:
"There's a race condition in the checkout flow. Orders are occasionally duplicating when users double-click the submit button."
Claude Code doesn't just start editing files. First, it spins up an Explore subagent to understand the codebase. That subagent reads your API routes, checks your database schema, looks at your order processing logic, and examines the frontend form handler. All of those file reads and grep results get appended to the growing conversation as tool outputs.
Here's the key: none of that new content touches the 20,000-token prefix. The system prompt, the tool definitions, the CLAUDE.md, all of that is still sitting in cache from turn one. Every subsequent API call reads those 20,000 tokens at $0.30/MTok instead of $3.00/MTok. You're only paying full price for the new stuff: your message and the tool outputs.
The Explore subagent finishes and hands its findings back to the main agent. But it doesn't dump 15,000 tokens of raw file contents into the conversation. It passes a condensed summary: which files are relevant, what the current logic does, where the race condition likely lives. This is a deliberate design choice. Keeping the dynamic tail compact means the cache ratio stays high.
Now the Plan subagent kicks in. It takes the summary, reasons through the fix (idempotency key on the frontend, deduplication check on the API, database unique constraint as a safety net), and produces a step-by-step implementation plan. You approve it. Claude Code starts writing code.
Over the next 15 minutes, you go back and forth. It writes the idempotency logic, you ask it to also handle the case where the page refreshes mid-checkout, it adjusts. Each of these turns adds new content to the dynamic tail. But the foundation, those 20,000 tokens, is read from cache every single time. Each cache hit also resets the TTL, so the cache never expires as long as you keep working.
By the end of the session, you've gone through maybe 25 turns. The total tokens processed easily exceeds 1.5 million. But if you run /cost, the bill tells a very different story than 1.5M tokens at full price. The vast majority were cache reads at a 90% discount.
That's the difference between a $4.50 session and a $0.90 session. For one debugging task.
The Production Numbers
This isn't theoretical. Claude Code's production metrics:
| Metric | Value |
|---|---|
| Cache hit rate | 92% |
| Cost reduction | 81% |
| First-token latency reduction | 79% |
In active sessions, 95%+ of input tokens are typically cache hits, billed at 0.1x the base price. Out of 400K tokens in a session, maybe 20K to 40K are billed at full price.
Without prompt caching, a long Opus coding session (100 turns with compaction cycles) can cost $50 to $100 in input tokens. With it, $10 to $19.
The One Thing That Will Tank Your Cache Hit Rate
Prompt caching has a gotcha that trips up almost everyone the first time.
The cache key is a hash of the exact byte sequence of your prompt prefix. Not the meaning. Not the content. The exact bytes, in the exact order. If you rearrange two paragraphs in your system prompt, the hash changes. Full cache miss. Everything recomputed at full price.
This has three practical consequences:
1. Don't change your tool set mid-session
Tool definitions are part of the cached prefix. If you add a tool on turn 12 that wasn't there on turn 1, every token after the change point is a cache miss. Load everything you might need at the start.
2. Don't switch models mid-conversation
Each model has its own cache. Moving from Opus to Sonnet to save money on a later turn means rebuilding the cache from zero for the new model. You'll spend more on the rebuild than you saved on the cheaper rate.
3. Don't edit the system prompt to update state
If your agent needs to track something (like "user is now authenticated"), don't inject that into the system prompt. Append it as a note in the next user message instead. The system prompt stays byte-identical, the cache stays valid.
Claude Code follows all three of these rules religiously. That's how it maintains a 92% hit rate across millions of sessions.
Applying This to Your Own Agents
If you're building on the Anthropic API, the same principles apply. Here's the practical playbook.
Prompt structure matters
Put the most stable content at the top:
1. System instructions and rules (most stable, cached first)
2. Tool definitions (stable for session duration)
3. Reference documents / retrieved context
4. Conversation history + tool outputs (dynamic, grows each turn)
The cache works from the top down. Everything above the first change point stays cached. Everything below it gets recomputed.
Use auto-caching
Anthropic's API now supports automatic cache management. You add a single cache_control field to your request and the system handles breakpoint placement for you:
{
"model": "claude-sonnet-4-6-20260514",
"max_tokens": 1024,
"cache_control": { "type": "ephemeral" },
"system": "Your system prompt here...",
"messages": [...]
}
It moves the cache boundary forward as the conversation grows and more content becomes stable. Before this existed, you had to manually calculate token boundaries. Getting it wrong meant missing the cache entirely.
Compact without breaking the cache
When your conversation hits the context limit and you need to summarize it down, keep the system prompt and tool definitions identical. Add the compaction instruction as a new user message. The cached prefix stays valid. You only pay fresh tokens for the compaction prompt itself.
Monitor your hit rate
Every API response includes three fields you should be tracking:
{
"usage": {
"cache_creation_input_tokens": 15200,
"cache_read_input_tokens": 184800,
"input_tokens": 3400
}
}
-
cache_creation_input_tokens: tokens written to cache (first time processing) -
cache_read_input_tokens: tokens read from cache (the cheap ones) -
input_tokens: tokens processed at full price (no cache available)
The ratio of cache_read_input_tokens to total input tokens is your cache efficiency score. Track it like you'd track uptime. A sudden drop means something in your prompt structure changed and invalidated the cache.
Key Takeaways
Prompt caching isn't a setting you flip on and forget about. It's an architectural pattern that has to be baked into how your agent constructs its prompts, manages its tools, and handles long conversations.
Claude Code shows what this looks like when it's done well: 92% cache hit rate, 81% cost reduction, built on stable prefixes, subagent summarization, and cache-aware context management.
If you're building agents and not thinking about your cache architecture, you're leaving most of your budget on the table.
We break down AI infrastructure and tooling like this regularly at Web After AI. Practical, no hype, explained so it actually makes sense.
Top comments (0)