Tawan Shamsanor

Posted on Apr 26

How Prompt Caching Cuts AI Costs by 90%

#ai #anthropic #programming #webdev

The 90% Discount Most API Users Never Claim

Anthropic's cache cuts API costs by 90% — yet most developers sending requests to Claude, GPT, or Gemini have never configured it. Prompt caching, which Anthropic launched in July 2024, reduces input token costs from $3 per million to $0.30 per million for cached portions on Claude 3.5 Sonnet. That's not a theoretical optimization. It's a billing line item that appears in your API usage the moment you add a single cache_control parameter to your request.

Every major LLM provider now offers some form of prompt caching. OpenAI does it automatically for GPT-4o and o3 when your prefix exceeds 1,024 tokens. Google Gemini 1.5 Pro offers a 75% discount but only for prompts exceeding 32,768 tokens. The mechanics differ, but the principle is the same: when the model has already computed the internal state (the KV-cache) for a portion of your prompt, it doesn't need to recompute it on the next request. You pay less. It responds faster. Everyone wins.

But how does the system actually know it's seen your prompt before? What happens inside the API infrastructure when a cache hit occurs versus a miss? And why does the cache expire after exactly 5 minutes of inactivity on Claude? Let's break down the 8-step pipeline that makes prompt caching work under the hood.

"Anthropic's prompt caching launched July 2024 reduces input token costs from $3 per million to $0.30 per million for cached portions on Claude 3.5 Sonnet"

Step 1: The API Hashes Your Prompt Prefix

When your request hits the API endpoint, the first thing the system does is compute a SHA-256 hash over the prompt prefix — the portion of your prompt that you've marked as cacheable using the cache_control parameter. This hash becomes the unique cache identifier. The prefix must match exactly between requests: same tokens, same order, same content blocks. A single character change in a 10,000-token system prompt produces a completely different hash, and the cache won't match.

This is why prompt structure matters. If you're building a multi-turn chatbot and your system prompt changes every message (say, injecting a timestamp), you'll never get a cache hit. The fix: put dynamic content after the cache breakpoint. Keep your system instructions, tool definitions, and few-shot examples in the prefix, and append variable content in the suffix.

Step 2: The System Queries the Distributed Cache

With the SHA-256 hash in hand, the system queries a distributed Redis cache cluster to check for existing KV-cache tensor states. These aren't simple key-value lookups — each cache entry references precomputed attention key-value tensors for every layer of the transformer, for every token in the prefix. On Claude 3.5 Sonnet with its 28 attention layers, that's 28 separate tensor matrices stored per cache entry.

The Redis cluster is distributed across availability zones for redundancy. If the node holding your cache entry goes down, the system falls back to a replica — or takes the cache miss and recomputes. The lookup itself takes under 2 milliseconds, which is why cached responses feel near-instantaneous compared to the 500ms+ latency of a full prompt processing.

Step 3: On a Cache Miss, the Model Processes Everything

If the hash doesn't match any entry in Redis (a cache miss), the transformer model processes the entire prompt through all attention layers, generating key-value pairs at each layer for each token. This is the expensive path — every token in your 100,000-token system prompt gets embedded, projected through Q/K/V matrices, and run through multi-head attention. On Claude 3.5 Sonnet, processing 100,000 input tokens costs $0.30 at standard rates and takes roughly 1-2 seconds of GPU compute time.

This is also where the cache write surcharge comes in. On Anthropic's pricing, writing a cache entry costs 25% more than standard input tokens ($3.75 per million vs. $3.00 per million for Claude 3.5 Sonnet). You pay this premium once. Every subsequent cache read costs just 10% of the standard rate.

Step 4: KV-Cache Tensors Get Compressed

Once the model has computed the KV-cache tensors for your prefix, those tensors need to be stored efficiently. Raw KV-cache data for a 100,000-token prompt on a 28-layer model occupies roughly 1.5-2 GB of memory in FP16 format. The system serializes these tensors into a compressed binary format using LZ4 compression, reducing storage by 60-70%. That brings the per-entry footprint down to 450-800 MB — still substantial, but manageable across a distributed cache fleet.

LZ4 is chosen specifically for its decompression speed (over 3 GB/s on modern hardware). When a cache hit occurs, the system needs to load tensors into GPU memory as fast as possible. A slower compression algorithm like Zstandard might save more space, but the decompression overhead would negate the latency benefit of caching.

Step 5: The Cache Entry Is Written With a 300-Second TTL

The compressed cache entry is written to Redis with a TTL (time-to-live) metadata set to exactly 300 seconds — that's 5 minutes. It's also tagged with a model version identifier. If Anthropic updates the model weights (even a minor deployment), cached entries from the previous version become invalid. This tag prevents stale KV-cache tensors from being loaded into a model with different weights, which would produce garbage outputs.

Cached prompts on Claude consume 10% of original processing cost but expire after exactly 5 minutes of inactivity, requiring strategic request timing to maintain savings. This means if your application sends requests more than 5 minutes apart, every request is a cache miss. The solution: implement a keep-alive mechanism that sends a lightweight cached request every 4.5 minutes during active sessions, or batch your API calls into tighter windows.

Step 6: On a Cache Hit, Tensors Load Directly Into GPU Memory

When the SHA-256 hash matches an entry in Redis (a cache hit), the system deserializes the cached KV tensors and loads them directly into GPU memory, bypassing the entire transformer computation for the prefix. This is the core insight: the model doesn't re-read your system prompt, doesn't re-encode your tool definitions, doesn't recompute attention patterns for your few-shot examples. The GPU simply resumes from where the cached prefix ends.

The latency reduction is dramatic. A 100,000-token prompt that takes 1.5 seconds to process from scratch can be "replayed" from cache in under 50 milliseconds. For applications with long system prompts (legal document analysis, code review with full repository context, multi-tool agent workflows), this isn't just a cost optimization — it's the difference between a responsive application and an unusable one.

Step 7: Only the Uncached Suffix Gets Processed

After loading the cached prefix tensors, the model processes only the uncached suffix tokens through the attention mechanism, using the cached keys and values as context. If your system prompt is 50,000 tokens and your new user message is 200 tokens, the model runs full attention computation on just those 200 tokens — referencing the cached KV pairs from the prefix for all cross-attention calculations.

This is also where the economics become compelling. A 200,000-token context window with 90% cache hit rate reduces costs from $600 to $78 per million requests on Claude 3.5 Sonnet according to Anthropic's August 2024 benchmarks. That's an 87% reduction, and it compounds across every request in a session. A customer support chatbot that processes 10,000 conversations per day with a 5,000-token system prompt could cut its monthly API bill from $4,500 to under $600 with caching enabled.

Step 8: The Billing System Applies Separate Rate Multipliers

The final step happens in the billing layer. The system calculates charges using separate rate multipliers: 1.0× for uncached tokens, 0.1× for cache-read tokens, and 1.25× for cache-write tokens. On Claude 3.5 Sonnet specifically:

Standard input: $3.00 per million tokens
Cache write: $3.75 per million tokens (25% premium for first request)
Cache read: $0.30 per million tokens (90% discount on subsequent hits)

The breakeven point is remarkably low. If you send the same prefix twice within the 5-minute TTL window, you've already saved money. The first request costs 1.25× (the write premium). The second request costs 0.1× for the cached portion. Total for two requests: 1.35× versus 2.0× without caching — a 32.5% saving after just one cache hit. By the third request, you're at 1.45× total versus 3.0× — a 51.7% cumulative saving.

Provider Comparison: Not All Caching Is Equal

The three major providers implement caching differently, and the differences matter for your architecture decisions:

Anthropic (Claude): Explicit cache breakpoints with cache_control parameter. Minimum 1,024 tokens for Sonnet, 2,048 for Opus. 5-minute TTL. 90% read discount. Supports up to 4 cache breakpoints per request.
OpenAI (GPT-4o/o3): Automatic caching — no code changes needed. Minimum 1,024 tokens in prefix. 5-minute TTL. 50% read discount (less aggressive than Anthropic). Only caches the longest common prefix across requests.
Google (Gemini 1.5 Pro): Explicit context caching API. Minimum 32,768 tokens (far higher threshold). Configurable TTL (default 5 minutes, can be extended to hours). 75% read discount. Best for very long documents that don't change frequently.

When Prompt Caching Doesn't Help

Prompt caching isn't a silver bullet. It provides zero benefit when:

Every request is unique — one-shot analysis tasks with no repeated prefix
Requests are spaced more than 5 minutes apart — the TTL expires before the next call
Your prefix is under 1,024 tokens — it never gets cached in the first place
Your system prompt changes per request — dynamic timestamps, user IDs in the prefix

For these cases, other cost optimizations (shorter prompts, model distillation, batching) are more appropriate. Caching rewards repetitive, structured API usage — which happens to be exactly what production applications generate.

The Bottom Line

Prompt caching is the single highest-ROI optimization available for LLM API costs today. It requires minimal code changes (one parameter on Anthropic, zero on OpenAI), delivers up to 90% cost reduction on cached tokens, and simultaneously reduces latency by 95%+ for long prompts. The 5-minute TTL means it's most effective for conversational applications, batch processing pipelines, and agent workflows that make frequent, structured API calls.

The math is simple: if your application sends the same prompt prefix more than twice within 5 minutes, enable caching. You'll save money on the second request and keep saving on every request after that. But there's a hidden cost when your cache expires mid-conversation that nobody talks about.

DEV Community