The team behind Claude Code just explained — publicly, casually, in a tweet — that prompt caching is the single piece of infrastructure that makes the entire product viable. Not a nice optimization. Not a cost-saving hack. The load-bearing wall.
TL;DR: Every time you send a message to Claude (API or Claude Code), the system checks if it already processed the beginning of your prompt. If yes: 10x cheaper, 85% faster. If no: full price, full wait. The Claude Code team monitors this metric like a heartbeat and declares incidents when it drops. Most devs are breaking their cache without knowing it. This article explains what prompt caching is, why it matters, and the mistakes that are costing you money right now.

Their motto? “Cache Rules Everything Around Me.”
If you grew up on Wu-Tang, you recognize the reference. If you didn’t — CREAM was about cash ruling everything. The Claude Code version is about KV tensors ruling everything. Same energy, different currency. And just like the original, if you ignore the message, you’re gonna pay for it.
Most devs I talk to don’t know what prompt caching is. They definitely don’t know they’ve been benefiting from it every single time they open a Claude Code session. And they absolutely dont know that the way they structure their prompts can either hit the cache or miss it — costing them speed, money, or both.
This is the simplified version.
And yes, you’re probably doing it wrong.
$0.045 vs $0.005 — same request, same model, same result
I’ll start with the number that made me actually care about this topic.
One of my Convex projects has an AI assistant — ~6,000 token system prompt, 12 tool definitions (Clerk auth, Supabase queries, a few custom actions), typical conversations running 15–20 turns. Standard stuff. Nothing exotic.
I pulled my API usage after a week and ran the math on a 20-turn conversation:
- Without caching: each late-conversation request processes ~15,000 tokens of context at full price. ~$0.045 per request.
- With caching: only the last ~500 tokens (new message + previous response) need fresh computation. Everything else hits the cache. ~$0.005 per request.
Nine. Times. Cheaper.
Scale that to 1,000 conversations a day and you’re looking at $1,350/month vs $150/month. I didn’t discover prompt caching through a blog post or a conference talk. I discovered it by staring at a Stripe dashboard wondering why my test environment was burning through credits like a kid with a birthday gift card at GameStop. Anyway — the point is, I would have saved myself two weeks of confusion if someone had explained this to me the way I’m about to explain it to you.
What prompt caching actually is (the Wu-Tang version)
Every time you call the Claude API — or every time Claude Code runs an action under the hood — a prompt gets assembled. That prompt has four layers:
- The system prompt (your instructions, personality, constraints)
- Tool definitions (all the functions Claude can call)
- Conversation history (previous messages in the session)
- The new user message (what you just typed)
Without caching, Claude reprocesses every single one of those tokens from scratch. Every time. Your 8,000-token system prompt? Reprocessed. Your 15 tool definitions? Reprocessed. The 40 previous messages in the conversation? All of it, from zero — like a goldfish reading the same book for the first time, forever.
Prompt caching stores the computed Key-Value tensors (the heavy math the model does during the “prefill” phase) for an exact prefix of your prompt. Next request with the same prefix? Skip the math, reuse the stored result.
The cache lives for 5 minutes by default. The newer Claude 4.5 models support a 1-hour TTL at 2x write cost. And the matching is exact — one different character in your prefix and the entire cache is invalidated. It’s like a fingerprint scanner that also checks your mood.

CREAM. Cache Rules Everything Around Me. Every request you send either hits the cache and pays 0.1x the input price, or misses it and pays full price. There is no in-between.
The 3 mistakes that are burning your money
This is the part where I stop being nice. A research paper called “Don’t Break the Cache” tested caching strategies across OpenAI, Anthropic, and Google with 500+ agent sessions. Combined with what the Claude Code team shared, the pattern is clear: most devs break their cache the same three ways.
Mistake #1: Injecting dynamic crap into your system prompt.
Timestamps. Request IDs. User-specific data. Session tokens. Every time you shove something dynamic into your system instructions, the entire prefix becomes unique. Cache miss. Full price. Every single time.
I’ve done this. I had a "Current time: ${new Date().toISOString()}" in my system prompt for "context." Cost me probably $40 in unnecessary API calls before I realized. The fix took 8 seconds — move the timestamp to the user message instead.
{
"system": [
{
"type": "text",
"text": "You are a code review assistant for a Next.js + Convex app...",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": "Current time: 2026-02-19T14:30:00Z\n\nReview this function..."}
]
}
The cache_control tag tells Anthropic: "cache everything up to here." Static system prompt = cached. Dynamic timestamp in the user message = doesn't break the prefix. One restructure. 90% savings on that chunk.
If you want the full framework for structuring system prompts that don’t just cache well but actually produce reliable output, I wrote about Prompt Contracts — same philosophy, different angle.
Mistake #2: Reordering tools between requests.
Anthropic processes the cacheable prefix in a specific order: tools → system → messages. If your code assembles tool definitions from a Map or an unordered Set, the order can shuffle between requests. Different order = different prefix = cache miss. You didn’t change a single tool. You didn’t add or remove anything. You just let JavaScript’s object ordering quirks cost you money.
Fix: sort your tools alphabetically or by a fixed index before sending. Dead boring. Extremely profitable. And if you’re wondering whether you even need all those tool definitions in the first place — CLIs might be a better bet than MCPs. Fewer tools = smaller prefix = easier cache hit.
Mistake #3: Not using breakpoints at all (Anthropic-specific).
Unlike OpenAI (which caches automatically for prompts over 1,024 tokens), Anthropic requires you to explicitly mark what to cache with cache_control breakpoints. If you don't add them, nothing gets cached. You're paying full price and wondering why Claude feels slow.
For multi-turn conversations, you get up to 4 breakpoints total. The research found the optimal split: 1 on the system prompt, 3 on recent user messages. This caches both your static instructions and your recent conversation history. Without the user message breakpoints, heavy tool use between turns creates uncached gaps that keep growing — like a memory leak but for your wallet.
Why your Claude Code sessions feel fast (until they don’t)
Quick detour for the subscription crowd. If you’re on Claude Code (not the API), you don’t configure any of this manually. The harness handles cache optimization — it’s literally what the team spends their engineering time on, and apparently declares emergencies about.
But understanding how it works explains a few things youve probably noticed:
The first message in a new session is always slower. That’s the cache write. The system prompt, your CLAUDE.md, all the tool definitions — getting computed and stored for the first time. Every message after that benefits from the cache. The cold start is the warm-up.
Switching projects mid-session feels sluggish. Different project context = different system prompt = cache miss. The system has to build a new cache from scratch. Not a bug. Just physics.
Long sessions generally feel fast. The conversation history keeps growing but mostly hitting the cache. Turn 30 reuses the cached prefix from turns 1–29 and only processes the new stuff. This is why the Claude Code team built their entire architecture around this — without caching, your 50th message would process 50x more tokens than your first. The product would be dead by message 15. Like a Slack thread that loads every previous message from scratch every time someone types “sounds good” — but I digress.
And sometimes it randomly slows down. If the team ships a change that accidentally modifies the prompt assembly order, or if an internal update shifts a tool definition, every user’s cache breaks simultaneously. That’s the SEV. That’s the pager going off. “Cache Rules Everything Around Me” isn’t a cute motto. It’s an operations manual.
What this actually means for you
If you’re building on the Claude API: go check your system prompts right now. If there’s anything dynamic in them — a date, a user ID, a random seed, anything that changes between requests — move it to the user message. Add cache_controlbreakpoints. Sort your tools. Do it today. The ROI is immediate and embarrassing. I learned this the hard way when Anthropic killed my $200/month OpenClaw setup — once you rebuild on API instead of a subscription, every dollar of optimization matters.
If you’re using Claude Code on a subscription: you don’t need to do anything. But now you understand why the team treats prompt caching like the oxygen supply on a space station. And next time your first message takes 6 seconds and your tenth message takes 1.5, you’ll know exacty what happened.
Either way, CREAM. Cache rules everything around you, whether you know it or not. The only question is whether you’re on the cheap side or the expensive side.
The devs who understand their infra ship cheaper. The ones who don’t subsidize everyone else’s rate limits.
Next up: Anthropic just killed third-party harnesses using Claude subscriptions. My $200/month OpenClaw setup died overnight. I rebuilt the whole thing for $15 — and prompt caching is half the reason it works.
Top comments (0)