I'm building an app that monitors my token consumption in Claude Code. A few days ago, looking at the raw numbers, I found this:
cacheReadInputTokens: 4,241,579,174
inputTokens: 1,293,019
Four billion two hundred million tokens read from cache. One million three hundred thousand "fresh" tokens. That's a 99.97% cache hit rate.
My first reaction was thinking something was broken. Nobody has a 99% cache hit rate. Not Redis. Not Cloudflare. Not your mom when she claims she already knows what you're going to ask for dinner.
But it turns out it's not broken. This is exactly how it works. And the reason is as elegant as it is counterintuitive.
What Gets Cached Isn't Text
This is where most explanations fall short. When you read "prompt caching," you think of something like Redis: store the question, store the answer, if someone asks the same question, return the same answer.
Not at all.
What gets cached are KV tensors — the Key and Value matrices that the transformer calculates during the prefill phase. In simpler terms: when an LLM receives your prompt, the first thing it does is convert all that text into internal numerical representations (embeddings) and multiply them by weight matrices to get the "keys" (K) and "values" (V) that the attention mechanism needs to generate the response.
That calculation is expensive. In a 200,000-token prompt (normal for Claude Code, where conversation history accumulates), we're talking about billions of matrix multiplication operations. It's the most GPU-intensive part, the slowest part, the most expensive part.
The key insight: between one of your messages and the next, 99% of that prompt doesn't change. The system prompt is identical. The previous conversation history is identical. The files it read are the same. The only new thing is your latest message.
Why recalculate what you already calculated 30 seconds ago?
How Matching Works
Caching isn't enough. You need to know when the cache is valid. Anthropic uses an elegant trick: cumulative prefix hashing.
Each block of the prompt (system, tools, messages) generates a hash. But not an individual hash: a cumulative hash. The hash of block 3 includes the content of blocks 1, 2, and 3. If anything changes in a previous block, the hash of all following blocks changes too.
When a new request arrives, the system searches backwards from the point marked with cache_control, comparing hashes block by block, until it finds the longest matching prefix. Everything that matches → read from cache. Only the new stuff → gets calculated.
It's like a movie you've seen 40 times. You don't need to watch the whole thing to know what happens. You only need to watch from the point where it differs from what you remember.
The system only checks up to 20 blocks backwards. Beyond that, it stops searching. This is a practical decision to avoid spending more time searching the cache than calculating tensors directly.
Why Claude Code Has a 99% Cache Hit Rate
Now that you know how matching works, the 99% stops being mysterious. Look at what happens in a typical Claude Code session:
Message 1 (first in the session):
System prompt (8K tokens) + Tools (2K tokens) + Your message (500 tokens)
= 10,500 tokens → EVERYTHING calculated, EVERYTHING written to cache
Message 2:
System prompt (8K) + Tools (2K) + Message 1 (500) + Response 1 (3K) + Your message 2 (500)
= 14,000 tokens
→ First 10,500 → CACHE HIT (already calculated before)
→ The 3,500 new ones → calculated and added to cache
Cache hit: 75%
Message 10:
System prompt + Tools + 9 messages + 9 responses + Your message 10
= ~150,000 tokens
→ First ~149,500 → CACHE HIT
→ The ~500 new ones → calculated
Cache hit: 99.7%
See it? The conversation history only grows. Each new message is a tiny fraction of the accumulated total. The cache ratio converges to 99% with the certainty of a natural logarithm.
It's not magic. It's geometry: the numerator (new tokens) grows linearly; the denominator (accumulated tokens) also grows linearly, but it has a huge head start.
Where Those Tensors Live
This is where it gets interesting. Because caching KV tensors isn't like caching strings in Redis. We're talking about gigabytes of numerical data that need to be available with microsecond latency.
Anthropic uses a two-level system:
Level 1: VRAM (5-minute TTL)
The tensors live directly in the GPU memory that will serve the next request. Zero copy, zero network latency. Cache hits are nearly instantaneous because the data is already where it's needed.
TTL: 5 minutes. If nobody makes a request in 5 minutes, they get evicted. This is the cache you use with the standard API. Cache write price: 1.25x normal input price.
Level 2: GPU Node SSD (1-hour TTL)
If you pay for extended cache write (2x input price), tensors don't get evicted after 5 minutes. Instead, when they leave VRAM due to memory pressure, they get offloaded to the local SSD of the GPU node.
When a cache hit comes in, they're reloaded from SSD to VRAM. Slower than level 1, but infinitely faster than recalculating tensors from scratch.
The interesting part: no network involved. It's not a remote Redis. It's not S3. It's an SSD physically attached to the server that has the GPU. The architecture is designed to minimize data movement.
Request → In VRAM? → Yes → Instant cache hit
→ No → In local SSD? → Yes → Load to VRAM → Cache hit (~ms)
→ No → Calculate KV tensors → Cache miss
Since February 2026, isolation is per workspace (previously per organization). This means tensors from your development team don't mix with the marketing team's, even if they're in the same Anthropic organization.
The Numbers
If you're evaluating whether this matters for your use case, here are the hard facts:
| Concept | Value |
|---|---|
| Cache read | 0.1x input price (90% discount) |
| Cache write 5 min | 1.25x input price |
| Cache write 1 hour | 2x input price |
| Latency reduction | ~85% on long prompts |
| Minimum cacheable | 1,024 tokens per checkpoint |
With Sonnet, input costs $3.00/M tokens. A cache read costs $0.30/M. In a Claude Code session with 200K tokens of history, the difference between recalculating and reading from cache is the difference between $0.60 and $0.06 per message.
Multiply that by the hundreds of messages you might exchange in a long session and you understand why Anthropic invested in building this: without prompt caching, long conversations with huge context would be economically unfeasible.
My Real Data
Back to my numbers from the beginning. In my Claude Code usage over a month:
cacheReadInputTokens: 4,241,579,174 (4.2 billion — read from cache)
cacheCreationInputTokens: 196,596,243 (197 million — written to cache)
inputTokens: 1,293,019 (1.3 million — calculated without cache)
outputTokens: 2,517,666 (2.5 million — generated by the model)
Global cache hit rate: 95.5%. And within individual long sessions, it easily exceeds 99%.
Notice the asymmetry: I've read 4.2 billion tokens from cache, but the model has only generated 2.5 million tokens of output. The cache-read to actual-work ratio is 1,685:1. For every token the model produces, it reuses 1,685 tokens of previous context.
This also means cacheReadInputTokens isn't a good productivity metric. It doesn't measure how much you've "used" the model. It measures how much history the model has reread. It's like measuring your productivity by how many times you've opened the same file in your editor.
What Anthropic Doesn't Tell You
There are things that aren't public:
- User→GPU affinity: How do they ensure your next request lands on the same node that has your cache? Probably sticky routing per session, but they don't confirm it.
- SSD type: NVMe? CXL-attached? KV tensors for a 200K token prompt take up several GB. SSD speed matters a lot.
- PagedAttention: vLLM (the most popular open-source serving engine) uses a technique called PagedAttention that manages KV tensors like virtual memory pages. Does Anthropic use something similar, or do they have something proprietary? Unknown.
- Cluster topology: How many GPUs, how they're interconnected, whether they use InfiniBand or Ethernet. Nothing public.
The Analogy That Explains Everything
Think of prompt caching as a surgeon's working memory during an operation.
The surgeon (the model) has to process all the patient information (the prompt) to decide each move (the output). Without cache, they'd have to reread the complete medical history before each cut. With cache, they remember everything they already read and only need to process new information — the latest blood work, the tissue's response to the previous cut.
What gets saved isn't the patient's documents (the text). It's the intermediate conclusions the surgeon already extracted from those documents (the KV tensors). They don't need to reread the blood work. They already know what it says. They just need to integrate the new information with what they already know.
The 99% cache hit rate simply reflects that, in a conversation with an LLM, the amount of "what we already know" grows much faster than the amount of "new stuff to process."
And that's what makes it possible to have 200K token context conversations without each message costing you an arm and a leg.
Related: If you're interested in what happens when the app monitoring those tokens is based on data invented by the AI itself, read Silent failure: when your AI makes things up and tests say everything's fine. And if you want to see how I manage API secrets without 1Password asking for Touch ID every 30 seconds, authorization fatigue and a 40-line cache.
Top comments (0)