Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%
Your chatbot deploys 70B Llama on 8x H100s. Steady-state TTFT sits around 180 ms for short prompts, and the team is fine with that. Then you turn on a RAG feature: every request sends a 6,000-token context stuffed with retrieved documents, plus a short system prompt, plus the user's question. TTFT jumps to 1.4 seconds. p99 hits 2.1 s. A surprising share of those tokens are the same on every request — the system prompt, the same 6k retrieved chunks for the top queries, the tool definitions. The model is recomputing the same attention state over and over, then throwing it away. This is the problem prefix caching solves, and last week's post on KV cache quantization closed with it as the next topic — because the two features compose: a quantized prefix cache is cheaper to keep warm than a BF16 one, and the saved memory buys you either more concurrent users or a longer shared prefix.
Here's what prefix caching actually is, how vLLM and SGLang implement it differently, and where production deployments quietly lose most of the benefit.
Why this matters in practice
A modern LLM serving stack has two phases per request: prefill (process the entire prompt to build the KV cache) and decode (generate one token at a time, attending against the growing cache). For long-context workloads, prefill dominates. On a 70B Llama-3 with 8k of input, prefill accounts for roughly 70–85% of TTFT — decode is fast in comparison.
Most "long input" workloads are not actually long and unique on every request. They're long and repetitive:
- RAG pipelines. The same retrieved chunks hit the same top queries. The system prompt and tool schema are byte-for-byte identical across every request. The user question is the only variable part, and it's tiny.
- Multi-turn chat. Each turn is a strict prefix extension of the previous one. Round 2 shares everything except the latest assistant message and the new user turn.
- Agent loops. The same tool schema, planning prompt, and few-shot examples get prepended every step. Only the latest tool result differs.
- Long-document QA. Users repeatedly ask questions about the same 200-page PDF. The document is the prefix; the question is the suffix.
Prefix caching is the optimization that says: if the first N tokens of this request match a request I already processed, hand me back the KV cache for those N tokens instead of recomputing them. In the textbook case, the model output is bit-identical to a no-cache run, but prefill drops to a fraction of the cost. The reported "80% prefill saved" numbers come from RAG with 90%+ prefix overlap. The 5% numbers come from workloads where the prefix rarely matches, or the cache is constantly evicted before reuse.
What "prefix caching" actually is
The high-level idea is simple. The implementation has three decisions that drive the rest of the system: what unit do you hash on, how do you look it up, and what do you do when the cache is full.
flowchart LR
A[New request<br/>tokens 0..N-1] --> B[Tokenize &<br/>split into blocks]
B --> C[Hash each block<br/>tokens + parent hash]
C --> D{Lookup in<br/>block table}
D -- hit --> E[Reuse KV blocks<br/>skip prefill]
D -- miss --> F[Compute KV<br/>for that block]
F --> G[Insert block<br/>into table]
E --> H[Continue with<br/>remaining prefill]
G --> H
H --> I[Decode normally<br/>+ append new blocks]
Three things matter. First, prefix caching is prefix-only: you can only skip the leading tokens, never a middle substring. If two requests share tokens 1000–2000 but differ on 0–999, you reuse nothing. Second, the cache is block-grained, not token-grained. A request has to match a whole block (default 16 tokens) to get a hit. A request that diverges at token 14,003 of a 14,016-token shared prefix still recomputes almost everything. Third, prefix caching does not change decoding — every saved token is a saved prefill token.
How vLLM does it: hash-based blocks
vLLM's Automatic Prefix Caching (APC) is block-based and content-addressed. Each KV-cache block (default 16 tokens) is keyed by a hash of three things: the parent block's hash, the tokens in the block, and a small set of "extra hashes" for LoRA adapter IDs, multimodal input hashes, and per-tenant cache salts.
The block-size choice is the lever most teams miss. A small block (4–8 tokens) gives finer reuse — a divergence only kills the divergent block. A large block (32–64 tokens) cuts hash-table overhead and improves batching, but wastes more work on partial-prefix misses. The 16-token default is a reasonable middle for chat; for RAG with 4k–8k chunks, 16 or 32 is common.
The hash function got a security upgrade in v0.11 (April 2026). Before that, the default used Python's hash() of the serialized block — a salted SipHash, randomized per process, fine for collision avoidance but non-reproducible across restarts. As of v0.22.1, the default is sha256, with a new --prefix-caching-hash-algo CLI flag:
| Algorithm | Hash | Serialization | Reproducible | Notes |
|---|---|---|---|---|
sha256 |
SHA-256 | pickle |
No | Default. Secure, but pickle is Python-version-sensitive. |
sha256_cbor |
SHA-256 | cbor2 |
Yes | Recommended for multi-process or multi-language tiers. |
xxhash |
xxHash 128-bit | pickle |
No | Faster, non-cryptographic. Multi-tenant risk must be assessed. |
xxhash_cbor |
xxHash 128-bit | cbor2 |
Yes | Fastest with reproducibility. Same caveat. |
The multi-tenant caveat is the one to take seriously. If you serve multiple customers out of one engine and your hash function is non-cryptographic, a deliberate collision in a crafted prompt can evict another tenant's cache, or — in pathological cases — substitute their KV blocks with attacker-controlled values. If you don't control the prompts, stay on sha256 or sha256_cbor.
A typical vLLM deploy turns APC on at serve time:
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 8 \
--enable-prefix-caching \
--prefix-caching-hash-algo sha256_cbor \
--max-model-len 32768 \
--gpu-memory-utilization 0.92
APC is a server-level decision, not per-request — correct, because the cache is a shared resource.
How SGLang does it: a radix tree
SGLang keeps a radix tree of cached prefixes. Each node represents a shared prefix across one or more requests; each leaf is a request-specific tail. The engine traverses the tree per request, reuses the longest matching prefix, and forks new branches where requests diverge.
The practical differences that matter in production:
- Match granularity is one token, not one block. SGLang reuses down to a single divergent token, recovering more of the cache than vLLM's block-level scheme on chatty workloads with mid-prompt variations (an inserted tool result). The trade is per-token tree-walk overhead per request.
- Eviction is LRU on nodes, not blocks. When memory pressure forces a prune, the whole subtree under the coldest node goes. Faster than vLLM's per-block LRU but coarser — a cold tail can take a warm subtree with it.
- Multi-LoRA / multimodal. SGLang stores per-request metadata at the leaves, so different LoRA adapters and image inputs sit naturally on different branches. vLLM achieves the same via the "extra hashes" component.
For most RAG and chat workloads, the two implementations deliver comparable hit rates. SGLang tends to win on many short shared prefixes (per-token matching helps); vLLM tends to win on very long shared prefixes (block-hash lookups are O(1) with a tiny constant).
What you actually get at the metric level
| Workload | Median prefill saved | TTFT reduction | Caveat |
|---|---|---|---|
| RAG with 6k static context | 88–94% | 70–85% | Hit rate near 1.0 if the retrieved set is stable |
| Multi-turn chat, 8 turns | 60–80% (avg) | 30–55% | First turn is a miss; later turns reuse aggressively |
| Long-doc QA on a single PDF | 92–97% after first query | 75–90% | First query is a miss, all subsequent reuse |
| Open-ended Q&A (no shared prefix) | 0–5% | 0–5% | Don't bother enabling it |
| Tool-using agent loop | 40–70% per step | 20–45% | Tool result insertion breaks prefix mid-prompt |
Hit rate — the fraction of blocks already in the cache when a request arrived — is the single most useful number to instrument. If you turn on APC and your hit rate is below 30%, something is wrong: prefixes don't match, or the cache is being evicted before reuse.
Common pitfalls
-
Eviction is a silent killer. vLLM evicts blocks under GPU memory pressure with LRU. A mix of long-prefix and short-prefix traffic often evicts long-prefix blocks first (they take more slots), and they're the only ones whose loss actually hurts. Raise
--gpu-memory-utilizationfrom 0.85 to 0.92 and the working set of cached prefixes typically doubles. Monitor cache hit rate after 60 seconds of warmup — a rate that decays over the day is an eviction problem, not a workload problem. - LoRA and multimodal mix badly if you forget the salt. vLLM's block hash includes LoRA IDs and image hashes; swap adapters at request time and you get cache thrash. Same for image inputs that vary per request — caching the multimodal prefix is essentially useless.
- Prefix caching does not save decode. A common dashboard mistake is to credit the entire speedup to APC. Decode time is unchanged. If your workload is decode-bound, APC helps very little.
-
Hash algorithm migrations are not transparent. Changing
--prefix-caching-hash-algobetween deploys makes the new engine see zero hits until it warms back up. One-time cost, but a real incident if unexpected. Bake the algo into your Helm chart. -
Cross-replica cache sharing is hard. vLLM's APC lives in GPU memory; each replica has its own cache. A request landing on a cold replica pays full prefill. Disaggregated architectures (vLLM v0.22's
kv_connector, SGLang'sDistServe) can route prefix-matched requests to warm replicas, but that needs explicit config. - The "first request after restart" problem. A rolling deploy invalidates the entire cache. The first 30–60 seconds after each deploy are prefill-bound. Schedule rolling deploys during low-traffic windows, or pre-warm with a synthetic-traffic sidecar.
When NOT to use it
Prefix caching is the wrong choice (or a wasted flag) if:
- Your prompts have no shared structure. Open-ended completion APIs, code-gen on a fresh repo per request, single-turn Q&A with no system prompt — there's nothing to reuse. Hit rate near zero, and you're paying hash-table overhead for nothing.
- You're under a strict determinism SLO that includes cache state. A cache hit and a cache miss produce the same output for the same model and same prompt, but float-rounding in the attention kernel can give a divergent token at extreme depths. If you need bit-exact reproducibility across requests, disable APC and accept the prefill cost.
- You can't budget enough GPU memory for the working set. A cache that misses more than it hits is worse than no cache: you spent memory on entries that never get reused, pushing decode batch sizes down. Measure first, enable second.
- Your traffic is dominated by mid-prompt insertions. Agent loops, multi-modal chat with per-turn image insertion, RAG with dynamic chunk re-ordering — these frequently insert new tokens mid-prompt, breaking the prefix. SGLang's per-token matching recovers more here, but workloads that are 50%+ mid-prompt insertions still see sub-30% hit rates in either engine.
- You're already prefill-bound on a single giant request. A 100k-token analysis pass per request, one request at a time, will hit a 100% miss on the first call and a 100% hit on the second if it ever comes. The amortized win depends entirely on whether those requests repeat, and most one-shot analytics workloads don't repeat.
TL;DR
- Prefix caching reuses the KV cache for the leading tokens of a request when a previous request already computed them. It only affects prefill; decode is unchanged.
-
vLLM's Automatic Prefix Caching (APC) is a content-addressed block store. Each block is hashed by parent hash + block tokens + LoRA/multimodal/salt extras. Default block size is 16 tokens. Default hash since v0.22.1 is SHA-256, with
sha256_cbor,xxhash, andxxhash_cboravailable via--prefix-caching-hash-algo. - SGLang uses a radix tree of token-level prefixes, which gives finer-grained matching at the cost of per-request tree-walk overhead.
- The win is real but workload-shaped. RAG with a stable retrieved set: 88–94% prefill saved. Multi-turn chat: 60–80% averaged. Open-ended Q&A: 0–5%. Measure your hit rate before you trust the marketing numbers.
- Eviction is the silent killer. Long-prefix blocks get evicted first under memory pressure. Size the cache budget explicitly and monitor hit rate over the day, not just at startup.
- Don't enable it on open-ended workloads, on a multi-tenant engine with a non-cryptographic hash, or when you can't afford the working-set memory. Measure first.
Next post: structured output at the decoding layer — JSON mode vs grammar-constrained decoding vs function calling, where the three diverge in latency and reliability, and the failure modes that show up only in production.
Top comments (0)