You turned on prompt caching, the hit counter ticks now and then, but your bill barely moved. Before blaming your prompt structure, look at something the dashboard hides: which upstream actually served each request.
Multi-provider gateways spread a single model across several upstream providers and pick one per request. Prompt caches are per-provider (often per-node inside a provider). So when your second identical request lands on a different upstream than the first, it is a cache miss, even though your prompt did not change one byte. This is provider drift, and on a pay-per-token model it quietly multiplies your cost.
The two conditions that trigger it
This is not a misconfiguration you opted into. It is what you get out of the box:
- Default auto routing. The request is sent to the model without pinning an upstream, so the gateway chooses one per call.
- Default provider sort = "default (balanced)". The gateway load-balances across eligible upstreams rather than sticking to one.
Both are the factory defaults. You do not have to touch anything to get drift; you have to touch settings to avoid it.
What 20 identical requests look like
We sent the same ~8K-token prefix 20 times in a row to one popular multi-provider gateway, on the defaults above, asking for the upstream's own reported provider and cache fields each time. For a disk-cached model in the DeepSeek family:
-
9 distinct upstreams served the 20 calls:
N***a,S***w,M***h,D***a,A***L,P***l,S***e,V***e,A***d. - Cache hit rate: 4/20 (20%). You only hit on the calls that happened to land on an upstream that had already cached your prefix.
Run the same 20 calls against a single-backend gateway (one model, one upstream, no balancing) and the hit rate is 19/20 (95%) on the identical workload. Same model, same prompt, same number of calls. The only variable is whether routing drifts.
For contrast, on the very same multi-provider gateway a GPT-class model was routed to one upstream (A***e) for all 20 calls and hit 19/20. Drift is not uniform; it bites whichever model the gateway happens to spread, and on this run that was the DeepSeek-family model.
Conclusion A: the cost you expected vs the cost you paid
Per-call cost on the drifting model split cleanly by cache outcome:
| call type | median cost / call |
|---|---|
| cache hit | ~$0.00015 |
| cache miss | ~$0.00062 |
A miss costs about 4x a hit on this model (on raw input tokens the published gap is wider still, roughly 50x). Now total it across the 20 calls:
| scenario | hit rate | cost for 20 identical calls |
|---|---|---|
| expected (cache reachable) | 95% | $0.0026 |
| actual (default drift) | 20% | $0.0102 |
Same model, same prompt, same 20 requests. Provider drift made the run cost ~3.9x more. The caching was "on" the whole time; the routing layer simply billed most of your tokens at the miss rate. Scale that to a production endpoint replaying a large stable prefix all day and the gap is the bulk of your input spend.
Conclusion B: no cache also means no latency win
Caching is not only a cost lever. A warm prefill returns the first token sooner. When drift denies you the cache, you forfeit that speedup too. We measured time-to-first-token (TTFT) on repeated identical calls:
GPT-class model (routed to one consistent upstream, cache reachable):
| call | TTFT |
|---|---|
| 1st (cold, miss) | ~1760 ms |
| subsequent (warm, hit) | ~1130 ms |
Caching buys roughly a 36% faster first token, and it is steady: every warm call lands in a tight band.
DeepSeek-family model (default drift, cache rarely reachable):
- Cache hits across a 10-call repeat: 0.
- TTFT swung from ~1000 ms to ~4500 ms call to call, with occasional empty responses.
Because almost every request is a fresh upstream, you stay at cold-prefill latency and inherit the variance of whichever provider answered. The GPT model got a 36% TTFT improvement from a reachable cache; the drifting model got none, plus a 4.5x spread between its fastest and slowest call.
Audit your own setup in five minutes
Do not trust these numbers, or anyone's. Send the same long prefix several times and watch two fields. No domains hardcoded; point it at your own gateway with env vars.
import os, uuid
from openai import OpenAI
client = OpenAI(api_key=os.environ["GW_KEY"], base_url=os.environ["GW_BASE"])
SYS = f"[probe {uuid.uuid4().hex}]\n\n" + ("You are a support assistant. " * 300)
seen, hits = {}, 0
for i in range(20):
r = client.chat.completions.create(
model=os.environ["GW_MODEL"], max_tokens=16,
messages=[{"role": "system", "content": SYS},
{"role": "user", "content": f"q{i}"}],
extra_body={"usage": {"include": True}})
d = r.model_dump()
det = r.usage.prompt_tokens_details
cached = (getattr(det, "cached_tokens", 0) or 0) if det else 0
seen[d.get("provider")] = seen.get(d.get("provider"), 0) + 1 # populated when exposed
hits += 1 if cached else 0
print(f"hit rate {hits}/20; upstreams seen: {len(seen)}")
More than one upstream for the same model means drift. A hit rate well below your prompt stability means it is taxing you. The fuller method is in Does Your LLM Gateway Lie About Cache?.
What to look for
The cure for drift is structural: route a given model to a consistent backend so a warm cache is actually reachable on the next request, instead of load-balancing each call onto a fresh upstream that has never seen your prefix. When you evaluate a gateway, send the same prefix 20 times and count the upstreams. One is what you want. Nine is a tax.
A fair caveat: prompt caching is best-effort everywhere, and on disk-cached models the hit rate still softens over long idle gaps even with a single backend. Eliminating drift does not hand you an infinite cache. It removes the largest and most wasteful source of misses, the one you never agreed to and cannot see.
Closing
"Supports prompt caching" and "your cache is reachable" are different claims. A gateway that scatters one model across a rotating cast of upstreams can report cache support truthfully while delivering a 20% hit rate, a ~4x bill, and first-token latency that swings 4.5x. The number to watch is not whether caching is advertised. It is your measured hit rate and how many upstreams your identical requests touch. Run the probe and let the data settle it.
For the broader audit method see Does Your LLM Gateway Lie About Cache?; for why caches exist at all, see How KV Cache & TTL Work.
FAQ
Is this a misconfiguration on my side?
No. It happens on the factory defaults: auto routing with the provider sort left at "default (balanced)." Avoiding drift requires actively pinning an upstream, not the other way around.
Does pinning one upstream fix it?
It removes cross-provider drift, but a single upstream often runs multiple replicas without prefix affinity, so hits can still flip-flop. Measure after pinning rather than assuming.
Why did the GPT-class model not drift?
On this run the gateway happened to route it to a single upstream. Drift is per-model and depends on how many eligible upstreams the gateway balances across; it is not uniform.
Is the cost gap really ~4x?
On the per-call totals we measured, a miss was ~4x a hit; on raw input-token pricing for this model class the published hit-vs-miss gap is closer to 50x. Either way, turning expected hits into misses is the expensive part.
What single metric should I monitor?
Cache hit rate per model over time, alongside the count of distinct upstreams per model. If hit rate falls or upstream count rises, your effective token cost just went up.
Top comments (0)