With a closed model, prompt caching is one documented contract. Claude has cache_control breakpoints; OpenAI and Gemini cache automatically above a token floor; the discounts are published and stable. You read one page and you're done.
Open weights break that assumption. The same Qwen or Llama checkpoint is served by a dozen hosts, and caching is not a property of the model — it's a property of where the model runs. To show how far that goes, here's one measured request — an identical ~4.7K-token prompt sent to the same Qwen model through a multi-provider router six times, no upstream pinned:
| Call | Upstream the router picked | Cost | Cached tokens |
|---|---|---|---|
| 1 | Upstream A | $0.0141 | 0 |
| 2 | Upstream B | $0.000709 | 0 (cold) |
| 3–6 | Upstream B | $0.000286 | 4,224 (warm) |
Same model, same router, same prompt: the bill ranged from $0.0141 to $0.000286 — a 49× spread — purely on which upstream the router chose and whether that upstream had the prefix warm.
Summary
- Prompt caching for open-weight models is a routing outcome, not a model feature. It's implemented — free and automatic — in the inference engine, then preserved or destroyed by every layer above it.
- Five layers: one provides caching, three can break it. The model (sets cacheability, serves no cache) → the inference engine (caching, free) → the compute host (productizes it, unevenly) → the gateway (multi-cluster routing) → the router (scatters across vendors with disjoint caches).
- Measured. An identical request, scattered by a router, cost 49× more on one pick than another; on one model, one host delivered 59.6% off and another 0%; published cache discounts span 0% to ~98% across models.
-
What to do. Pin your route so repeated prefixes hit the same warm cache; audit by the cost delta, not the
cached_tokensfield (which often reads 0 on a real hit); weigh latency separately — warm prefills run 2–10× faster even at ~0% cost discount.
Live figures were measured on 2026-06-14 against a multi-provider router and our own gateway, with a fixed ~4.7K-token English prompt, small
max_tokens, sequential runs. Documented pricing was checked against primary provider docs the same day and cross-verified adversarially. Ratios (percent discount, latency change) are the portable part; absolute dollars depend on the venue, your prompt, and load. Reproduce before quoting.
The cache types you'll actually meet
Before the stack, the vocabulary. Across open-weight hosts there are four distinct cache shapes, and they bill differently.
1. Automatic prefix caching (no markers). The dominant pattern. The server hashes your prompt prefix, reuses the KV state if it matches an earlier request, and applies the discount on its own — no cache_control, no code change, often impossible to disable. DeepSeek, Zhipu GLM, and most open-weight hosts do this. Writes are free; the cache lives anywhere from VRAM (minutes) to disk (DeepSeek keeps prefixes "a few hours to a few days").
2. Explicit breakpoint caching (cache_control). The Anthropic shape, which a few open-weight hosts also offer. Alibaba's Model Studio takes "cache_control": {"type": "ephemeral"} on a Qwen message block; some serving platforms expose an equivalent marker. You mark the boundary, pay a write surcharge, and get a deeper read discount in return.
3. Rented cache objects (with a storage fee). The one to watch. Moonshot's legacy moonshot-v1 family makes you POST /v1/caching to create a cache, then bills a write fee, a per-token-per-minute storage fee, and a per-call hit fee. Google's explicit Gemini caching is the same idea — input cost plus storage at roughly $1.00–$4.50 per 1M-tokens per hour. The cache is a resource you rent and must garbage-collect.
4. Self-host KV reuse (free). Run the weights yourself and the inference engine caches for free and automatically. No write fee, no read rate, no storage rental — a hit just skips prefill.
| Cache type | Markers? | Write fee | Storage fee | Where you meet it |
|---|---|---|---|---|
| Automatic prefix | No | Free | No | Most open-weight hosts; DeepSeek, GLM |
| Explicit breakpoint | cache_control |
Surcharge | No | Qwen (explicit mode); some platforms |
| Rented cache object | Create/TTL/delete | Yes | Yes | Moonshot moonshot-v1, Gemini explicit |
| Self-host KV reuse | No | Free | No | vLLM, SGLang, TensorRT-LLM |
Qwen on Model Studio offers both automatic and explicit modes, with a real tradeoff: implicit bills a hit at 20% of input with free writes; explicit bills a hit at 10% of input but charges 125% on the write and bounds the entry to a 5-minute TTL. Deeper discount, but you pay to populate and pay again each time it expires.
Where caching lives in the stack
Here is the key idea. Prompt caching for open weights is solved at exactly one layer and endangered at every layer above it. Walk the stack from the weights up, and at each layer ask: does this layer provide caching, or merely forward it — and can it break what the layer below already did?
request
|
v
+--------------------------------------------------+
| L5 router scatters across vendors | can break it
| L4 gateway multi-cluster routing | can break it
| L3 compute host uneven delivery | can break it
|==================================================|
| L2 inference engine CACHING LIVES HERE, free | <-- the cache is born here
|==================================================|
| L1 model cacheability: MLA / GQA | sets the ceiling
+--------------------------------------------------+
A cache hit is born at L2 and must survive L3-L5 routing to reach you;
every layer above L2 is a chance to land where your prefix isn't.
Layer 1 — The model: cacheability, not a cache
This is the layer most people think caching lives in — "DeepSeek has caching" — so it's the first one to get precise about. A checkpoint is a bag of weights; it runs the same attention whether or not a KV cache exists. It ships no cache, no discount, no TTL, no cache_control marker — those are serving-layer features. In that strict sense the weights provide no caching product.
But the weights are not neutral, and DeepSeek is the perfect example of why. The model's attention architecture decides how big the KV cache is, and therefore how cheap caching can ever be:
- DeepSeek's Multi-head Latent Attention (MLA) compresses the KV cache into a low-rank latent — to roughly 4–14% of a standard multi-head cache. That compression is exactly what lets DeepSeek's API persist prefixes to disk and price a cache read at ~2% of input. The architecture is the enabler; the disk cache is a product built on top of it.
- Grouped-Query Attention (GQA) — used by Llama, Qwen, Mistral, and DeepSeek — shares KV heads to shrink the cache by the group factor (≈8× on Llama-3).
So Layer 1's contribution is cacheability, not a cache: the architecture sets the ceiling on how cheap every layer above can make caching, but the weights never serve a cached token themselves. And "DeepSeek has caching" quietly merges two different things wearing the same name — the weights (this layer, which give you MLA) and DeepSeek's API and serving stack (Layers 2–3, which give you the disk cache, the discount, and the usage fields). Download the open weights and run them yourself and you keep MLA's small KV cache, but the disk-cache product stays on DeepSeek's servers — you inherit whatever Layer 2 you deploy in its place. So the operational move still holds: stop asking whether a model caches and start asking where it's served — just don't mistake that for the architecture not mattering. It sets the ceiling; the path sets what you actually get.
Layer 2 — The inference engine: where caching is built, and free
One layer up, caching is not just present — it's solved, and free. Modern inference engines cache prefixes automatically:
- vLLM — Automatic Prefix Caching: hashes each KV block, reuses any block whose prefix hash it has seen, LRU-evicts. On by default in V1.
- SGLang — RadixAttention: stores the KV cache in a radix tree so any shared prefix is reused, with cache-aware scheduling.
-
TensorRT-LLM — block reuse (
enable_block_reuse, default on), with optional offload of KV blocks to host memory.
Projects like LMCache extend this further — offloading KV to CPU/disk and sharing it across instances, which is the seed of solving the routing problem we're about to hit. The point: if you self-host, you are done. Caching is automatic, costs nothing beyond the GPUs you already run, evicts by LRU, and you own it — a hit simply skips prefill, lowering TTFT and raising throughput. There is no cached_tokens billing field because nothing is billed; the payoff shows up in your own latency metrics. For a closed model you rent caching; for an open one you can own it outright. The catch is the inverse of the hosted world: the cache is ephemeral (VRAM, LRU), so it survives only while the prefix stays hot — which is precisely what the layers above must preserve.
Layer 3 — The compute host: productizing it, unevenly
Commercial inference hosts wrap Layer 2 and run fleets of replicas. They inherit free automatic caching — the question is whether they implement it well, and the answer is mixed on two axes.
First, exposure and price vary wildly. Among the major open-weight hosts: one applies a flat 50% to cached input and lets cached tokens skip rate limits; another defaults to 50% off on serverless; a third prices cached input per model (e.g. a Qwen tier at ~80% off) and exposes a cache-key hint to improve affinity; a fourth makes caching always-on and undiscloseable on dedicated endpoints. Same underlying engine, four pricing philosophies.
Second — and this is the first place caching breaks — the multi-replica problem. Your warm prefix lives in the VRAM of the one replica that served the cold request. The host's own load balancer may send your next request to a different replica with a cold cache. We saw exactly this: pinning the same Qwen model to one upstream at a time and running cold→warm:
| Pinned upstream | Cold | Warm | Discount | cached_tokens |
|---|---|---|---|---|
| Provider A | $0.000709 | $0.000286 | 59.6% | 4,224 ✓ |
| Provider B | $0.000662 | $0.000662 | 0% | 0 |
Provider A cached cleanly and reported it. Provider B — which advertises a cache-read price for this model — returned no discount across a cold call and two warm calls in our test. Whether that's eligibility, replica fan-out, or a longer warm-up than two requests, the measured result on this path was zero. The capability is solved at Layer 2; whether you actually receive it is a Layer-3 execution detail, and it differs by host.
Layer 4 — The gateway: the multi-cluster problem
A gateway sits in front of one or more upstreams and multiplies the replica problem into a cluster problem. If it round-robins requests across clusters or providers without cache affinity, the warm cache becomes structurally unreachable — every request lands somewhere the prefix isn't. A cache-aware gateway must route by prefix hash so identical prefixes stick to the same upstream, the same way Layer 2 sticks them to the same KV blocks.
We ran a cold→warm battery across open-weight models on a third-party gateway, reading the per-request cost directly:
| Model | Cold | Warm | Discount | Latency |
|---|---|---|---|---|
deepseek-v4-pro |
$0.00189 | $0.0000155 | 99.2% | 6.0s → 1.1s |
deepseek-v4-flash |
$0.000564 | $0.0000116 | 97.9% | 4.9s → 1.2s |
qwen3.5-flash |
$0.000561 | $0.0000853 | 84.8% | 10.2s → 1.0s |
kimi-k2.5 |
$0.00242 | $0.000469 | 80.6% | 3.2s → 1.2s |
qwen3-max |
$0.00350 | $0.00336 | 3.8% | 2.2s → 1.1s |
qwen3.5-plus |
$0.00114 | $0.00114 | 0.0% | 1.8s → 1.0s |
DeepSeek-V4 hit 97–99% (affinity working end to end); qwen3.5-plus and qwen3-max returned ~0% on the warm call despite carrying a cache-read price in the catalog. Two more gateway lessons hide in this table:
-
The usage field lies; the cost doesn't.
cached_tokensread 0 on every call here, including the 99% cost drops. Many OpenAI-compatible gateways don't populate the cached-token field for upstreams that cache automatically. Audit by thecostdelta between a cold and warm call, not by the token field — the same lesson as auditing a gateway's cache claims. -
Latency improves even when cost doesn't. Every warm call was 2–10× faster —
qwen3.5-flashwent 10.2s→1.0s — including the ~0%-discount ones. A hit skips prefill regardless of how the host prices it, so caching can pay off in TTFT on a gateway that gives you nothing on the bill.
A gateway that doesn't preserve affinity hands you a cache you can't reach; one that doesn't surface cache cost hands you one you can't verify.
Layer 5 — The router: random distribution across providers
At the top, a multi-provider router load-balances one model ID across different companies' clusters — each with a separate cache. Now even perfect affinity within a provider can't save you: if call 1 goes to one vendor and call 2 to another, there is no shared cache to hit. This is the scatter from the top of this post, and it compounds Layer 4 — not just multiple clusters, but multiple vendors with disjoint cache state and disjoint prices (the priciest pick billed 20× the cheapest upstream's base rate). The cache only engaged once routing happened to stick to one provider.
The fix is to collapse the randomness — make routing deterministic so repeated prefixes land on the same warm cache:
# Pin the upstream; otherwise load-balancing scatters you across disjoint caches.
# (field names follow a common multi-provider router's API)
import requests
requests.post(f"{ROUTER_BASE}/chat/completions",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "qwen/qwen3.5-35b-a3b",
"messages": messages,
"usage": {"include": True}, # return cost + cached_tokens
"provider": { # the part that makes caching work
"order": ["<your-chosen-upstream>"],
"allow_fallbacks": False,
},
})
To its credit, the router did report cached_tokens (4,224 on the hit) and a per-request cost, so you can verify both — better than the Layer-4 gateway that read 0. But the routing is yours to constrain. Caching is a routing problem dressed up as a pricing feature: the cache is free at Layer 2, and Layers 3, 4, and 5 are three escalating ways to route yourself away from it.
How deep is the discount? It's all over the map
When the routing does line up, how much do you save? For closed models the cache-read discount clusters near 90%. For open weights the published cache-read price ranges from a token gesture to near-total, even within one vendor's lineup. First-party published rates:
| Model (first-party / mode) | Input $/M | Cache read $/M | Discount | Layer-2 type |
|---|---|---|---|---|
| DeepSeek-v4-flash | 0.14 | 0.0028 | ~98% | auto disk |
| DeepSeek-v4-pro | 1.74 | 0.145 | ~92% | auto disk |
| Qwen (explicit mode) | base | 0.10× base | 90% | explicit |
| Kimi K2.6 | 0.95 | 0.16 | ~83% | auto |
| GLM-5 | 1.0 | 0.20 | 80% | auto implicit |
| Qwen (implicit mode) | base | 0.20× base | 80% | auto |
DeepSeek's automatic disk cache is the deepest in the field — deepseek-v4-flash reads cached input at $0.0028/M against a $0.14/M miss, a 1:50 ratio, which our Layer-4 test reproduced at 97.9%. Third-party hosts of these same open weights price cached input independently — some apply a flat ~50%, others vary per model from ~50% to ~90% — so the discount you get is a function of which host you land on, not just the model. Same feature name, a 48-point spread.
Because the discount is a venue property, one model carries different cache economics everywhere it's served. deepseek-v4-pro, four ways:
| Where (layer) | Cache-read discount | Source |
|---|---|---|
| First-party API (L3) | ~92% ($1.74 → $0.145) | documented |
| Third-party host A (L3) | ~89% ($1.74 → $0.20) | documented |
| Third-party host B (L3) | ~92% ($1.6 → $0.135) | documented |
| Third-party gateway (L4) | 99.2% | measured (cold→warm) |
"DeepSeek-V4-Pro supports caching" is true and nearly useless; the operational question is "supports caching where, at what rate, reported how."
A decision checklist
- ✅ The model sets the ceiling, not the cache (Layer 1). Its attention architecture (MLA, GQA) decides how cheap caching can be, but it never serves a cached token — so still ask where it's served and what that host's stack does.
- ✅ Self-hosting? You already have it free (Layer 2). Confirm automatic prefix caching is on (it is by default in vLLM/SGLang) and watch your prefix hit rate.
- ✅ On a compute host, verify delivery, not the price column (Layer 3). A cache-read price is a claim; measure a cold→warm cost delta. Use a cache-key affinity hint where the host offers one.
- ✅ Through a gateway, demand cache-affinity routing and cost reporting (Layer 4). If identical prefixes don't stick to one upstream, or
costdoesn't drop on a warm call, the cache is unreachable or unverifiable. - ✅ On a router, pin the upstream (Layer 5). Constrain routing (e.g. a provider-order field with fallbacks off), or you forfeit hits to load-balancing across disjoint caches — and risk a 20–50× pricier upstream.
- ✅ Weigh latency separately from cost. Warm prefills are 2–10× faster even when the dollar discount is ~0.
- ✅ Watch for storage-fee cache types. Rented caches (Moonshot
moonshot-v1, Gemini explicit) bill per-token-time for an idle cache; automatic prefix caches don't.
Bottom line
For closed models, "does it cache?" has one answer. For open weights the capability was solved years ago at the inference-engine layer — vLLM and SGLang cache every prefix, for free, automatically. Everything above that layer is plumbing that either preserves the hit or scatters you away from it: a compute host's replica balancer, a gateway's cluster routing, a router's random spread across vendors. The model's architecture sets the ceiling on how cheap caching can be — MLA and GQA are real, model-level wins — but the path your request takes decides what you actually get. Treat cache behavior as a routing property — measure it in cost terms on the exact path you'll run, pin the route so the cache you warmed is the one you hit, and remember that the deepest discount in the world is worth nothing if request two lands somewhere request one never touched.
For the mechanics of why a KV cache exists and how TTLs work, start with How KV Cache & TTL Work; to audit a gateway's cache claims, see Does Your LLM Gateway Lie About Cache?.
FAQ
Do open-weight models support prompt caching?
The weights set how cheap caching can be — attention architectures like MLA and GQA shrink the KV cache — but the cache itself, the discount, and the API come from the serving stack. Caching is implemented in the inference engine (vLLM, SGLang, TensorRT-LLM), inherited by compute hosts, and forwarded (or scattered) by gateways and routers. Ship the same checkpoint to three hosts and you can get free automatic caching, none, or explicit-only.
Why did the same model cost 49× more on one call than another?
On a multi-provider router, an un-pinned request is load-balanced across different vendors' clusters with different base prices and disjoint cache state. One call hit a pricey provider cold; another hit a cheap one warm. Pin the upstream (constrain provider order, fallbacks off) to control both.
If I self-host, do I need to pay for caching?
No. Automatic prefix caching in vLLM, SGLang, and TensorRT-LLM is on by default and free — a hit just skips prefill. You pay only for the GPUs you already run, and the cache is yours, evicted by LRU when VRAM is needed.
The API says cached_tokens: 0 but my bill dropped — did caching work?
Probably yes. Many gateways don't populate cached_tokens for upstreams that cache automatically. Trust the cost field: a large drop between a cold and an identical warm call means the cache hit.
Which open-weight model has the deepest cache discount?
DeepSeek's automatic disk cache: deepseek-v4-flash reads cached input at ~$0.0028/M against $0.14/M uncached (~98% off), reproduced at 97.9–99.2% across the V4 line in our cold→warm tests. Many third-party hosts apply a flat ~50% instead.
Is there a catch with caches that charge a storage fee?
Yes — Moonshot's moonshot-v1 explicit cache and Gemini's explicit cache bill per-token-time to keep the cache alive (Gemini ~$1–4.50 / 1M-tokens / hour). An idle cache you forgot to delete keeps charging. Automatic prefix caches have no storage fee.
Verification: live cost/latency figures measured 2026-06-14 against a multi-provider router and our own gateway, using a fixed ~4.7K-token prompt, small max_tokens, sequential cold→warm runs; discounts computed from the returned per-request cost. Documented pricing and cache mechanics checked against primary provider docs the same day and cross-verified adversarially; a few vendor figures (notably Moonshot's explicit-cache fees) move frequently — confirm current values before quoting. Your numbers will vary with provider, prompt, region, and load.
Sources
- DeepSeek — Pricing
- DeepSeek — KV cache / Context Caching guide
- DeepSeek-V3 Technical Report — MLA (KV-cache compression)
- GQA: Training Generalized Multi-Query Transformer Models (Ainslie et al.)
- Alibaba Cloud Model Studio — context cache & pricing
- Moonshot AI — Context Caching
- Zhipu / Z.AI — pricing & caching
- vLLM — Automatic Prefix Caching
- SGLang — RadixAttention / cache
- LMCache — KV cache offloading & sharing
- Google — Gemini context caching
All checked 2026-06-14. Not financial advice; verify current pricing before relying on it.
Top comments (0)