DEV Community

synthorai
synthorai

Posted on • Originally published at synthorai.io

Does Your LLM Gateway Lie About Cache? A 5-Min Audit

A gateway sits between your code and the model provider. You read cached_tokens back from the response, you see a smaller number, and you trust the dollars saved are real. But you never see the upstream call. The gateway could report a cache hit and still bill the full input rate. It could fail to cache at all behind a perfectly clean response. It could strip usage metadata on streaming, the path most of your production traffic runs on, so you can't tell either way.

This isn't hypothetical. A Hacker News PSA reported that routing DeepSeek V4 through a popular gateway returned 2–3× fewer cached tokens than calling DeepSeek directly; one commenter posted bills showing the caching stats weren't reported through the gateway at all. The gateway's team replied that they couldn't reproduce it and were investigating. That disagreement is the whole point. When two parties can't agree on whether your cache is working, the only tiebreaker is a measurement you ran yourself.

Usually this isn't malice. It's a translation gap or an unfinished code path. The effect on your invoice is the same either way. This post is one runnable script that audits both styles of prompt caching, automatic (DeepSeek) and marker-based (Claude), against any gateway, including this one. It prints a side-by-side scorecard in under five minutes.


Four ways a gateway can lie about cache

Failure mode What you see What's actually happening
Silent no-cache A clean response, no error Nothing was cached; you pay full price every call
Cache theater cached_tokens > 0 in the response …but the billed cost is the full input rate
Markup creep A plausible cost number The gateway's markup quietly eats the discount
Metadata blackout Clean text output Usage fields stripped (esp. on streaming), so you can't audit it

The dangerous ones are the first two: the response looks like caching is working. You find out at the end of the month.


Two cache mechanisms, one audit

Providers expose caching in two shapes, and a real gateway has to pass both through faithfully:

  • Automatic (DeepSeek, GPT, Gemini, Qwen): the provider caches any sufficiently long prefix on its own. No markers. Hits appear in usage.prompt_tokens_details.cached_tokens.
  • Marker-based (Anthropic Claude): you tag cacheable spans with cache_control. Hits appear as cache_read_input_tokens.

The script hides that difference behind a thin Lane adapter, then runs all five checks against both. Here is the whole thing: two lanes and one audit() that performs every check.

import os, time, uuid
from openai import OpenAI
from anthropic import Anthropic

KEY  = os.environ["GATEWAY_KEY"]
oai  = OpenAI(api_key=KEY,    base_url="https://synthorai.io/v1")   # auto lane
anth = Anthropic(api_key=KEY, base_url="https://synthorai.io/")     # marker lane

class AutoLane:      # DeepSeek / GPT / Gemini / Qwen: provider caches automatically
    mode = "auto"
    def __init__(self, model): self.model = model
    def call(self, sys, q, stream=False):
        if stream:
            cached = cost = None
            s = oai.chat.completions.create(model=self.model, max_tokens=48, stream=True,
                stream_options={"include_usage": True},
                messages=[{"role":"system","content":sys},{"role":"user","content":q}])
            for ev in s:
                if ev.usage:
                    d = ev.usage.prompt_tokens_details
                    cached, cost = (d.cached_tokens if d else None), getattr(ev.usage,"cost",None)
            return {"cached": cached or 0, "cost": cost, "prompt_total": None}
        u = oai.chat.completions.create(model=self.model, max_tokens=48,
            messages=[{"role":"system","content":sys},{"role":"user","content":q}]).usage
        cached = u.prompt_tokens_details.cached_tokens if u.prompt_tokens_details else 0
        return {"cached": cached or 0, "cost": u.cost, "prompt_total": u.prompt_tokens}

class MarkerLane:    # Anthropic Claude: explicit cache_control markers
    mode = "marker"
    def __init__(self, model): self.model = model
    def call(self, sys, q, stream=False):
        block = {"type":"text","text":sys,"cache_control":{"type":"ephemeral"}}
        if stream:
            with anth.messages.stream(model=self.model, max_tokens=48, system=[block],
                    messages=[{"role":"user","content":q}]) as s:
                for _ in s.text_stream: pass
                u = s.get_final_message().usage.model_dump()
            return {"cached": u.get("cache_read_input_tokens") or 0,
                    "cost": u.get("cost"), "prompt_total": None}
        u = anth.messages.create(model=self.model, max_tokens=48, system=[block],
            messages=[{"role":"user","content":q}]).usage.model_dump()
        read, created = u.get("cache_read_input_tokens",0), u.get("cache_creation_input_tokens",0)
        return {"cached": read, "cost": u.get("cost"),
                "prompt_total": u.get("input_tokens",0) + read + created}

def audit(lane, long_prompt):
    SYS = f"[audit {uuid.uuid4().hex}]\n\n" + long_prompt    # unique => guaranteed cold start
    r = {"lane": lane.model, "mode": lane.mode}

    # CHECK 1: cache engages. Cold misses; a repeat should hit. A cache can
    # take a moment to become readable, so poll the warm read (sleep 1s between
    # attempts) before concluding "no cache".
    cold = lane.call(SYS, "Q1")
    warm = cold
    for i in range(4):
        warm = lane.call(SYS, f"warm {i}")
        if warm["cached"] > 0: break
        time.sleep(1.0)
    r["cold"], r["warm"] = cold, warm
    r["check1"] = cold["cached"] == 0 and warm["cached"] > 0

    # CHECK 2: cost reflects the discount (catches "cache theater").
    disc = (1 - warm["cost"]/cold["cost"])*100 if cold["cost"] and warm["cost"] else None
    r["discount"], r["check2"] = disc, (disc is not None and disc > 30)

    # CHECK 3: token accounting. cached fits inside the prompt total.
    r["check3"] = warm["prompt_total"] is None or warm["cached"] <= warm["prompt_total"]

    # CHECK 4: streaming preserves usage metadata (cache count AND cost).
    st = lane.call(SYS, "stream", stream=True)
    r["stream_cached"], r["stream_cost"] = st["cached"] > 0, st["cost"] is not None
    r["check4"] = r["stream_cached"] and r["stream_cost"]

    # CHECK 5: negative control. a unique prefix must always miss.
    n1 = lane.call(f"[uniq {uuid.uuid4().hex}]\n\n"+long_prompt, "x")
    n2 = lane.call(f"[uniq {uuid.uuid4().hex}]\n\n"+long_prompt, "y")
    r["check5"] = n1["cached"] == 0 and n2["cached"] == 0
    return r

# Any long, STABLE text works as the cacheable prefix: a system prompt, tool
# schemas, or a retrieved document. It only needs to clear the provider's
# minimum cacheable size (see Check 1). Load yours however you like.
LONG_SYSTEM_PROMPT = open("system_prompt.txt").read()   # ~8K+ tokens

for lane in [AutoLane("deepseek-v4-flash"), MarkerLane("claude-opus-4-8")]:
    print(audit(lane, LONG_SYSTEM_PROMPT))
Enter fullscreen mode Exit fullscreen mode

The rest of the post walks each check: the lines that implement it, what both lanes returned, and how to read the result.


Check 1: does the cache engage?

cold = lane.call(SYS, "Q1")
warm = cold
for i in range(4):                       # poll: a cache may take a beat to be readable
    warm = lane.call(SYS, f"warm {i}")
    if warm["cached"] > 0: break
    time.sleep(1.0)
r["check1"] = cold["cached"] == 0 and warm["cached"] > 0
Enter fullscreen mode Exit fullscreen mode
cold cached warm cached result
deepseek-v4-flash 0 7,552 / 7,870 (96%) PASS
claude-opus-4-8 0 12,446 / 12,454 (99.9%) PASS

A cold call on a unique prefix must cache nothing; a repeat must hit. The single most common false alarm is declaring "no cache" after one warm call, because caches don't always become readable instantly. The loop polls a few times with a 1-second pause, which removes the flakiness. If you still get 0 after several warm calls on a prompt above the size floor (~1,024 tokens for most providers; DeepSeek matches at a finer 64), the cache genuinely isn't engaging.


Check 2: does the cost reflect the discount?

disc = (1 - warm["cost"]/cold["cost"])*100 if cold["cost"] and warm["cost"] else None
r["check2"] = disc is not None and disc > 30
Enter fullscreen mode Exit fullscreen mode
cold cost warm cost discount result
deepseek-v4-flash $0.00107 $0.00030 72.3% PASS
claude-opus-4-8 $0.07112 $0.00672 90.6% PASS

This is the check that catches cache theater. The warm call's cost must actually drop. DeepSeek's per-call total fell ~72% (the cached input is discounted more steeply; output and the uncached remainder dilute the headline). Claude's cached read is ~90% off. The failure signal is unmistakable: cached_tokens > 0 with identical cold and warm cost means the gateway is reporting a hit it isn't pricing. You're paying full freight for a cache that "works" on paper.


Check 3: do the token counts add up?

r["check3"] = warm["prompt_total"] is None or warm["cached"] <= warm["prompt_total"]
Enter fullscreen mode Exit fullscreen mode
cached prompt total result
deepseek-v4-flash 7,552 7,870 PASS
claude-opus-4-8 12,446 12,454 PASS

cached has to sit inside the prompt total, with the remainder billed as uncached input. Both reconcile. If cached_tokens exceeds prompt_tokens, or the uncached remainder is implausibly large for a stable prefix, the gateway is mis-accounting: re-tokenizing or double-counting somewhere in the translation.


Check 4: does streaming preserve the metadata?

st = lane.call(SYS, "stream", stream=True)
r["stream_cached"], r["stream_cost"] = st["cached"] > 0, st["cost"] is not None
r["check4"] = r["stream_cached"] and r["stream_cost"]
Enter fullscreen mode Exit fullscreen mode
stream cached stream cost result
deepseek-v4-flash preserved preserved PASS
claude-opus-4-8 preserved preserved PASS

Most production chat streams, so this is the path that matters most. On both lanes the cache hit signal and the cost both survive the stream. cached_tokens and cost come through in the final usage chunk, so your highest-volume path stays auditable. The failure mode to watch for is a gateway that drops usage on streaming: a clean token output with no cached_tokens or cost means you're flying blind on the path you run most. (Pass stream_options={"include_usage": True} so the usage chunk is emitted at all.)


Check 5: the negative control

n1 = lane.call(f"[uniq {uuid.uuid4().hex}]\n\n"+long_prompt, "x")
n2 = lane.call(f"[uniq {uuid.uuid4().hex}]\n\n"+long_prompt, "y")
r["check5"] = n1["cached"] == 0 and n2["cached"] == 0
Enter fullscreen mode Exit fullscreen mode
unique-prefix A unique-prefix B result
deepseek-v4-flash cached 0 cached 0 PASS
claude-opus-4-8 cached 0 cached 0 PASS

Send a unique prefix every call; it must never hit. Both lanes correctly reported cached=0 at full cost for distinct prefixes. A "hit" here would make the cache reporting a false positive you could never trust. The clean negative control is what makes the positive results in Checks 1–2 meaningful in the first place.


Reading your scorecard

Check Healthy result Red flag
1. cache engages 0 cold, >0 warm (after polling) 0 after several warm calls, above the size floor
2. cost reflects discount warm cost ≪ cold cost cached > 0 but costs equal
3. token accounting cached ≤ prompt_total, reconciles counts don't add up
4. streaming metadata cache + cost survive the stream usage missing on streamed calls
5. negative control unique prefix always misses a distinct prefix "hits"

The two that cost money silently are 2 (full price for a reported hit) and 1 (no caching behind a clean response). Run both on every model you bill against.


Closing

Caching is the highest-leverage cost lever in an LLM app, which is exactly why "the cache is working" deserves a test, not an assumption. Wire Check 1 + Check 2 into CI against each model you bill against, alert if the discount drifts below your expected band, and you'll catch a silent regression the day a gateway or upstream provider changes behavior, instead of at the end of the billing cycle. And whatever your audit does, poll the warm read before you call a cache broken.

For the mechanics behind these numbers (prefill, KV cache, TTLs) start with How KV Cache & TTL Work. For working caching patterns per provider, see the tutorial.


FAQ

My Check 1 shows 0 on the warm call. Is my gateway lying?
Check three things first. (1) Does your prompt clear the provider's minimum cacheable size (~1,024 tokens for most; DeepSeek matches at finer 64-token granularity)? (2) Did you poll the warm read a few times? Caches don't always become readable on the very next call. (3) Is the prefix byte-identical between calls, with no timestamps or per-request IDs at the front? Only after all three should you suspect the gateway.

What does "cache theater" cost me in practice?
You pay the full input rate on every call while believing you pay a fraction. On a high-volume endpoint with a large stable prefix, that's your bill being several times what you modeled. Check 2 is the one to alert on.

Why is DeepSeek's discount lower than Claude's here?
Different things are being measured. Claude's ~90% is the read discount on cached input. DeepSeek's ~72% is the per-call total reduction, where output and the uncached remainder are billed at full rate and dilute the headline. Compare like with like for your own prompt shape.

Does this work for GPT, Gemini, Qwen too?
Yes. They're all automatic, so they use the AutoLane unchanged with a different model. Only Claude needs the MarkerLane. Same five checks either way.

Should this live in CI?
Yes. Run Check 1 + Check 2 against every model you bill against, on a schedule, and alert when the observed discount drifts outside your expected band. A standing audit turns a silent regression into a notification.

Top comments (0)