DEV Community

synthorai
synthorai

Posted on • Originally published at synthorai.io

LLM Prompt Caching: The Complete 2026 Guide

If you ship a chatbot, a RAG app, or an AI agent against a large language model, prompt caching is the single optimization that gives you back 50–90% of input cost and 3–10× of time-to-first-token at no quality cost. It isn't a bolt-on trick — it falls directly out of how Transformer attention is defined. Once you understand that, the rest of the stack (TTLs, provider differences, prompt structure) lines up cleanly.

This page is the index to a four-part series that takes you from the theory to a production decision matrix. Pick where to enter based on what you already know.


Where to enter

If you want to... Start at
Understand why caching exists and what KV cache actually is Part 1 — How KV Cache & TTL Work
Pick a provider and know what's different about each Part 2 — Compare Claude, GPT, Gemini, DeepSeek
Copy-paste working Python and measure your own numbers Part 3 — Working Python Tutorial
Match a chatbot / RAG / agent workload to the right model Part 4 — Best Model for Chat, RAG & Agents

Each part stands alone but they're written so reading them in order builds the picture without redundancy.


Part 1 — How LLM Prompt Caching Works

LLM Prompt Caching #1: How KV Cache & TTL Work →

The architectural article. Walks through self-attention as a single equation, explains why the K and V vectors of a stable prefix are mathematically reusable, and shows how the memory-vs-compute tradeoff produces the TTL behavior every developer has to design around.

Key takeaways:

  • Prompt caching isn't an optimization layered on top — it's a direct consequence of causal-masked attention. K/V at position i is a deterministic function of tokens 1…i, so identical prefixes give bit-identical K/V.
  • Prefill (compute-bound, O(N²)) is what caching saves; decode (memory-bandwidth-bound, O(N) per token) is what every inference engine already optimizes.
  • TTLs exist because KV cache is enormous (~10 GB for a 32K context on a 70B model). 5 minutes is the GPU memory-pressure horizon; hours-to-days are only possible with disk-backed caches (DeepSeek's MLA architecture).
  • Caching wins both cost (50–90% off input on cache hits) and latency (TTFT drops 3–10× for prompts in the 5–10K-token range and much more for 100K+).

Part 2 — Compare LLM Prompt Caching Across Providers

LLM Prompt Caching #2: Compare Claude, GPT, Gemini, DeepSeek →

The buyer's guide. Five providers expose prompt caching in five very different shapes — explicit markers (Claude), fully automatic (GPT-5, DeepSeek-v4), hybrid implicit+explicit (Gemini, Qwen), or architectural disk-backing (DeepSeek's MLA). The article gives a feature-by-feature comparison plus a 5-dimension evaluation framework to score them for your specific workload.

Key takeaways:

  • Don't compare base prices — compare effective cost weighted by your hit rate (formula in §4.1).
  • Claude has the deepest single-call discount (~90%) but requires explicit cache_control markers.
  • DeepSeek-v4 is the only provider with disk-backed caches at scale; partial-prefix matches earn discounts because the granularity is 64 tokens instead of 1,024.
  • Gemini's explicit cache costs hourly storage fees — break-even depends on call frequency.
  • API ergonomics, hit-rate predictability, TTL fit, latency under miss, and migration cost are the five dimensions that actually distinguish providers once you control for hit rate.

Part 3 — Working Python Tutorial

LLM Prompt Caching #3: Working Python Tutorial →

The hands-on article. One OpenAI SDK + one Anthropic SDK against a single gateway, with measured numbers from 2026-05-25 across the full Claude family (haiku-4-5 through opus-4-7), GPT-5.x, Gemini 2.5, DeepSeek-v4, and Qwen3.

Key takeaways:

  • Claude with cache_control markers: measured 88–89% cost reduction uniformly across haiku/sonnet/opus 4-x. Use the Anthropic SDK with base_url="https://synthorai.io/".
  • GPT-5.4-mini auto-cache: 5× TTFT improvement (3.6 s → 0.73 s on a 7K-token prompt), 93% cache hit rate on the system tokens.
  • Gemini 2.5-flash implicit: 88% cost reduction on cache hits when streaming usage is captured.
  • DeepSeek-v4-flash: 74% off, disk-backed (cache survives hour-scale idle).
  • TTL-aware patterns: keep-alive heartbeat for cron, prefix stability rules, what to log per call.

Part 4 — Best Model by Use Case

LLM Prompt Caching #4: Best Model for Chat, RAG & Agents →

The decision article. Different workloads pull the cost/latency levers differently — chat is naturally cache-friendly, RAG fights the prefix-stability problem, agents depend on cumulative prefix discipline. The article gives a model recommendation by workload shape with cost estimates.

Key takeaways:

  • Chatbots: any model with auto-cache works; sessions hit naturally. Pick on cost/quality. gpt-5.4-nano cheapest, gpt-5.4-mini fastest cached TTFT, claude-haiku-4-5 best instruction-following at modest premium.
  • RAG: retrieved-doc reordering kills mid-prompt cache hits. Three fixes — push references to the end, deterministic chunk ordering, or Claude's multi-cache_control breakpoints.
  • Agents: tool calls and results must be append-only and byte-identical step-to-step. claude-sonnet-4-5 with 4 cache_control markers gives the strongest cumulative-prefix discount; gpt-5.4-mini works without code changes at 50% savings.
  • TTL match: 5 min for chat, 1 hour for agents with human-in-the-loop steps, disk-backed for sporadic batch.

How to read this

  • Engineer new to the topic: read in order. The architecture in Part 1 makes Parts 2–4 click instantly.
  • PM or architect doing vendor selection: jump to Part 2 + Part 4. Reference Part 1 if a teammate asks "but why TTL exists".
  • Engineer with a specific workload to ship today: Part 4 first (find your row in the matrix), then Part 3 for the exact code.
  • Anyone optimizing an existing app: Part 3 §6 cross-provider benchmark — reproduce it against your own prompt; that's a one-day exercise, not a multi-week migration.

Numbers in this series

All measured numbers were captured on 2026-05-25 against the Synthorai gateway (https://synthorai.io/v1 for OpenAI-compat, https://synthorai.io/ for Anthropic-native), single-tenant, single sequential run, no concurrent load. Your numbers will move with region, time-of-day, and competing tenant load — treat them as a starting point and reproduce against your own traffic before quoting them.

Pricing tables and TTL behavior reflect vendor public documentation as of 2026-05. Providers update these every few months; the architectural reasoning (Part 1) is stable, the comparative numbers (Part 2 & 3) drift.

Top comments (0)