If you ship a chatbot, a RAG app, or an AI agent against a large language model, prompt caching is the single optimization that gives you back 50–90% of input cost and 3–10× of time-to-first-token at no quality cost. It isn't a bolt-on trick — it falls directly out of how Transformer attention is defined. Once you understand that, the rest of the stack (TTLs, provider differences, prompt structure) lines up cleanly.
This page is the index to a four-part series that takes you from the theory to a production decision matrix. Pick where to enter based on what you already know.
Where to enter
| If you want to... | Start at |
|---|---|
| Understand why caching exists and what KV cache actually is | Part 1 — How KV Cache & TTL Work |
| Pick a provider and know what's different about each | Part 2 — Compare Claude, GPT, Gemini, DeepSeek |
| Copy-paste working Python and measure your own numbers | Part 3 — Working Python Tutorial |
| Match a chatbot / RAG / agent workload to the right model | Part 4 — Best Model for Chat, RAG & Agents |
Each part stands alone but they're written so reading them in order builds the picture without redundancy.
Part 1 — How LLM Prompt Caching Works
LLM Prompt Caching #1: How KV Cache & TTL Work →
The architectural article. Walks through self-attention as a single equation, explains why the K and V vectors of a stable prefix are mathematically reusable, and shows how the memory-vs-compute tradeoff produces the TTL behavior every developer has to design around.
Key takeaways:
- Prompt caching isn't an optimization layered on top — it's a direct consequence of causal-masked attention. K/V at position
iis a deterministic function of tokens1…i, so identical prefixes give bit-identical K/V. - Prefill (compute-bound, O(N²)) is what caching saves; decode (memory-bandwidth-bound, O(N) per token) is what every inference engine already optimizes.
- TTLs exist because KV cache is enormous (~10 GB for a 32K context on a 70B model). 5 minutes is the GPU memory-pressure horizon; hours-to-days are only possible with disk-backed caches (DeepSeek's MLA architecture).
- Caching wins both cost (50–90% off input on cache hits) and latency (TTFT drops 3–10× for prompts in the 5–10K-token range and much more for 100K+).
Part 2 — Compare LLM Prompt Caching Across Providers
LLM Prompt Caching #2: Compare Claude, GPT, Gemini, DeepSeek →
The buyer's guide. Five providers expose prompt caching in five very different shapes — explicit markers (Claude), fully automatic (GPT-5, DeepSeek-v4), hybrid implicit+explicit (Gemini, Qwen), or architectural disk-backing (DeepSeek's MLA). The article gives a feature-by-feature comparison plus a 5-dimension evaluation framework to score them for your specific workload.
Key takeaways:
- Don't compare base prices — compare effective cost weighted by your hit rate (formula in §4.1).
- Claude has the deepest single-call discount (~90%) but requires explicit
cache_controlmarkers. - DeepSeek-v4 is the only provider with disk-backed caches at scale; partial-prefix matches earn discounts because the granularity is 64 tokens instead of 1,024.
- Gemini's explicit cache costs hourly storage fees — break-even depends on call frequency.
- API ergonomics, hit-rate predictability, TTL fit, latency under miss, and migration cost are the five dimensions that actually distinguish providers once you control for hit rate.
Part 3 — Working Python Tutorial
LLM Prompt Caching #3: Working Python Tutorial →
The hands-on article. One OpenAI SDK + one Anthropic SDK against a single gateway, with measured numbers from 2026-05-25 across the full Claude family (haiku-4-5 through opus-4-7), GPT-5.x, Gemini 2.5, DeepSeek-v4, and Qwen3.
Key takeaways:
-
Claude with
cache_controlmarkers: measured 88–89% cost reduction uniformly across haiku/sonnet/opus 4-x. Use the Anthropic SDK withbase_url="https://synthorai.io/". - GPT-5.4-mini auto-cache: 5× TTFT improvement (3.6 s → 0.73 s on a 7K-token prompt), 93% cache hit rate on the system tokens.
- Gemini 2.5-flash implicit: 88% cost reduction on cache hits when streaming usage is captured.
- DeepSeek-v4-flash: 74% off, disk-backed (cache survives hour-scale idle).
- TTL-aware patterns: keep-alive heartbeat for cron, prefix stability rules, what to log per call.
Part 4 — Best Model by Use Case
LLM Prompt Caching #4: Best Model for Chat, RAG & Agents →
The decision article. Different workloads pull the cost/latency levers differently — chat is naturally cache-friendly, RAG fights the prefix-stability problem, agents depend on cumulative prefix discipline. The article gives a model recommendation by workload shape with cost estimates.
Key takeaways:
-
Chatbots: any model with auto-cache works; sessions hit naturally. Pick on cost/quality.
gpt-5.4-nanocheapest,gpt-5.4-minifastest cached TTFT,claude-haiku-4-5best instruction-following at modest premium. -
RAG: retrieved-doc reordering kills mid-prompt cache hits. Three fixes — push references to the end, deterministic chunk ordering, or Claude's multi-
cache_controlbreakpoints. -
Agents: tool calls and results must be append-only and byte-identical step-to-step.
claude-sonnet-4-5with 4cache_controlmarkers gives the strongest cumulative-prefix discount;gpt-5.4-miniworks without code changes at 50% savings. - TTL match: 5 min for chat, 1 hour for agents with human-in-the-loop steps, disk-backed for sporadic batch.
How to read this
- Engineer new to the topic: read in order. The architecture in Part 1 makes Parts 2–4 click instantly.
- PM or architect doing vendor selection: jump to Part 2 + Part 4. Reference Part 1 if a teammate asks "but why TTL exists".
- Engineer with a specific workload to ship today: Part 4 first (find your row in the matrix), then Part 3 for the exact code.
- Anyone optimizing an existing app: Part 3 §6 cross-provider benchmark — reproduce it against your own prompt; that's a one-day exercise, not a multi-week migration.
Numbers in this series
All measured numbers were captured on 2026-05-25 against the Synthorai gateway (https://synthorai.io/v1 for OpenAI-compat, https://synthorai.io/ for Anthropic-native), single-tenant, single sequential run, no concurrent load. Your numbers will move with region, time-of-day, and competing tenant load — treat them as a starting point and reproduce against your own traffic before quoting them.
Pricing tables and TTL behavior reflect vendor public documentation as of 2026-05. Providers update these every few months; the architectural reasoning (Part 1) is stable, the comparative numbers (Part 2 & 3) drift.
Top comments (0)