By Jessie, COO at EvoLink
You open Claude Code at 9am. By noon, you're rate-limited. Your colleague does twice the work and still has quota left at 5pm. Same Max subscription. What's going on?
I ran into this exact situation and went digging. Turns out Anthropic published an internal engineering blog — "Lessons from building Claude Code: Prompt Caching is Everything" — that explains the whole thing. The short version: your daily habits are probably destroying your cache hit rate, and that's costing you 10-20x more tokens per message than necessary.
Here's what I learned and what I changed.
The Core Mechanic: Prefix Caching
Every request Claude Code sends to the model follows this structure:
System prompt + Tool definitions → Project docs (CLAUDE.md) → Session context → Messages
The API caches this sequence from the front. On the next request, if the prefix matches what was cached before, it reuses the prior computation. A cache hit costs one-tenth of normal price for those tokens.
But if any single byte in the prefix changes, everything from that point onward is invalidated. Full price recalculation.
The ordering is intentional. Anthropic's design principle: the less something changes, the earlier it goes. System prompt and tool definitions rarely change — they sit at the front. CLAUDE.md changes occasionally — middle. Messages change every turn — last. Each new turn just appends to the end. Everything before it stays cached.
Four Things That Kill Your Cache
1. Switching Models Mid-Conversation
This one hurts the most. You're mid-session with Opus, a simple task comes up, you run /model to switch to Haiku, handle it, switch back.
Cache is bound to the model. One switch = all accumulated cache invalidated, rebuilt from scratch. The rebuild cost often exceeds what letting Opus answer the simple question would have cost.
Anthropic's internal approach: keep one model for the main conversation. When a smaller model is needed, use a sub-agent — independent context and cache, does its work, passes the result back without touching the main session's cache chain.
2. Changing Tool Configuration Mid-Session
Adding an MCP tool, removing one, or updating parameters — tool definitions are part of the cached prefix. Any change breaks the chain.
This is why Claude Code keeps tool definitions in place even when unused. The cost of extra definition tokens is negligible compared to a full cache invalidation.
Plan Mode follows the same logic: instead of removing execution tools when entering planning mode, it adds EnterPlanMode/ExitPlanMode as special tools. The tool set never changes. The cache stays valid.
For users with many MCP tools, Claude Code uses lazy loading: start with lightweight stubs (tool name + one-line description), pull full schemas only when the model actually needs to call a tool.
3. Opening New Sessions Constantly
Every fresh claude invocation starts cache from zero. If your habit is "ask two questions, quit, reopen" — you never accumulate cache benefit.
4. Switching Between Accounts
Cache is isolated per account. Rotating through account pools resets the cache each time.
What to Do Instead
Keep conversations long. Longer conversation = thicker cache = cheaper messages toward the end. Stop opening new sessions unnecessarily.
You might worry about context window overflow. Don't. Claude Code has built-in compaction — automatic history compression when context gets too long. Anthropic designed Cache-Safe Forking: the compaction request reuses the exact same system prompt and tool definitions, sharing the same cache chain. The only new cost is the compression instruction itself.
Long conversations don't get more expensive. They get cheaper.
Don't switch models mid-conversation. If you need a different model, open a separate conversation for that task.
Configure MCP tools before the session starts. Don't add or remove mid-session.
Use --resume to continue previous sessions.
claude --resume
This restores your last session. The cache chain picks up where it left off. No rebuild. This single flag is probably the most underrated cost-saving habit in Claude Code.
Quick Reference
| Action | Cache Impact | Cost Impact |
|---|---|---|
| Switch model mid-conversation | Full invalidation | Up to 20x |
| Add/remove MCP tools | Full invalidation | 10-20x |
| Open new session | Start from zero | First turns at full price |
| Switch accounts | Full invalidation | 10-20x |
| Long continuous conversation | Accumulates | Gets cheaper over time |
Use --resume
|
Continues chain | Near-free |
One More Detail Worth Knowing
Claude Code never modifies the system prompt to update state information (current time, file changes). Instead, it injects updates using <system-reminder> tags inside messages. Because modifying the prompt would break the cache. The prompt is treated as immutable infrastructure. Messages are the fluid information layer.
That's the level of obsession Anthropic has about this. They monitor cache hit rate with the same severity as server uptime. A drop is treated as an incident.
The Model-Switching Problem
"Never switch models" is painful advice in practice. Sonnet for everyday coding, Opus for architecture decisions, Haiku for quick questions — that's a normal workflow.
Anthropic's answer is "use sub-agents," but most users can't orchestrate sub-agents themselves. If you're running Claude Code through a gateway like EvoLink, model routing can happen at the infrastructure level without breaking your session's cache chain. Worth knowing that option exists.
Caching is not an optimization technique. It is the foundation of the entire system. Now you know what Anthropic knows.
Sources:
Jessie is COO at EvoLink, a Claude API gateway for teams and developers.
Top comments (0)