<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: synthorai</title>
    <description>The latest articles on DEV Community by synthorai (@synthorai).</description>
    <link>https://dev.to/synthorai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3954184%2Ff7a20b6f-3f1e-4eed-85a3-486012422cbd.png</url>
      <title>DEV Community: synthorai</title>
      <link>https://dev.to/synthorai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/synthorai"/>
    <language>en</language>
    <item>
      <title>Prompt Caching for Open-Weight LLMs: Provider Roulette</title>
      <dc:creator>synthorai</dc:creator>
      <pubDate>Mon, 15 Jun 2026 11:52:43 +0000</pubDate>
      <link>https://dev.to/synthorai/prompt-caching-for-open-weight-llms-provider-roulette-4faf</link>
      <guid>https://dev.to/synthorai/prompt-caching-for-open-weight-llms-provider-roulette-4faf</guid>
      <description>&lt;p&gt;With a closed model, prompt caching is one documented contract. Claude has &lt;code&gt;cache_control&lt;/code&gt; breakpoints; OpenAI and Gemini cache automatically above a token floor; the discounts are published and stable. You read one page and you're done.&lt;/p&gt;

&lt;p&gt;Open weights break that assumption. The same Qwen or Llama checkpoint is served by a dozen hosts, and &lt;strong&gt;caching is not a property of the model — it's a property of where the model runs.&lt;/strong&gt; To show how far that goes, here's one measured request — an identical ~4.7K-token prompt sent to the same Qwen model through a multi-provider router six times, no upstream pinned:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Call&lt;/th&gt;
&lt;th&gt;Upstream the router picked&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Cached tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Upstream A&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.0141&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Upstream B&lt;/td&gt;
&lt;td&gt;$0.000709&lt;/td&gt;
&lt;td&gt;0 (cold)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3–6&lt;/td&gt;
&lt;td&gt;Upstream B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.000286&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4,224 (warm)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same model, same router, same prompt: the bill ranged from &lt;strong&gt;$0.0141 to $0.000286 — a 49× spread&lt;/strong&gt; — purely on which upstream the router chose and whether that upstream had the prefix warm.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prompt caching for open-weight models is a routing outcome, not a model feature.&lt;/strong&gt; It's implemented — free and automatic — in the inference engine, then preserved or destroyed by every layer above it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Five layers: one provides caching, three can break it.&lt;/strong&gt; The model (sets cacheability, serves no cache) → the inference engine (caching, free) → the compute host (productizes it, unevenly) → the gateway (multi-cluster routing) → the router (scatters across vendors with disjoint caches).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measured.&lt;/strong&gt; An identical request, scattered by a router, cost &lt;strong&gt;49× more&lt;/strong&gt; on one pick than another; on one model, one host delivered &lt;strong&gt;59.6% off&lt;/strong&gt; and another &lt;strong&gt;0%&lt;/strong&gt;; published cache discounts span &lt;strong&gt;0% to ~98%&lt;/strong&gt; across models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What to do.&lt;/strong&gt; Pin your route so repeated prefixes hit the same warm cache; audit by the &lt;strong&gt;cost&lt;/strong&gt; delta, not the &lt;code&gt;cached_tokens&lt;/code&gt; field (which often reads 0 on a real hit); weigh latency separately — warm prefills run 2–10× faster even at ~0% cost discount.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Live figures were measured on 2026-06-14 against a multi-provider router and our own gateway, with a fixed ~4.7K-token English prompt, small &lt;code&gt;max_tokens&lt;/code&gt;, sequential runs. Documented pricing was checked against primary provider docs the same day and cross-verified adversarially. &lt;strong&gt;Ratios&lt;/strong&gt; (percent discount, latency change) are the portable part; absolute dollars depend on the venue, your prompt, and load. Reproduce before quoting.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The cache types you'll actually meet
&lt;/h2&gt;

&lt;p&gt;Before the stack, the vocabulary. Across open-weight hosts there are four distinct cache shapes, and they bill differently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Automatic prefix caching (no markers).&lt;/strong&gt; The dominant pattern. The server hashes your prompt prefix, reuses the KV state if it matches an earlier request, and applies the discount on its own — no &lt;code&gt;cache_control&lt;/code&gt;, no code change, often impossible to disable. DeepSeek, Zhipu GLM, and most open-weight hosts do this. Writes are free; the cache lives anywhere from VRAM (minutes) to disk (DeepSeek keeps prefixes "a few hours to a few days").&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Explicit breakpoint caching (&lt;code&gt;cache_control&lt;/code&gt;).&lt;/strong&gt; The Anthropic shape, which a few open-weight hosts also offer. Alibaba's Model Studio takes &lt;code&gt;"cache_control": {"type": "ephemeral"}&lt;/code&gt; on a Qwen message block; some serving platforms expose an equivalent marker. You mark the boundary, pay a &lt;strong&gt;write surcharge&lt;/strong&gt;, and get a &lt;strong&gt;deeper read discount&lt;/strong&gt; in return.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Rented cache objects (with a storage fee).&lt;/strong&gt; The one to watch. Moonshot's legacy &lt;code&gt;moonshot-v1&lt;/code&gt; family makes you &lt;code&gt;POST /v1/caching&lt;/code&gt; to create a cache, then bills a write fee, a &lt;strong&gt;per-token-per-minute storage fee&lt;/strong&gt;, and a per-call hit fee. Google's &lt;em&gt;explicit&lt;/em&gt; Gemini caching is the same idea — input cost &lt;strong&gt;plus storage&lt;/strong&gt; at roughly $1.00–$4.50 per 1M-tokens per hour. The cache is a resource you rent and must garbage-collect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Self-host KV reuse (free).&lt;/strong&gt; Run the weights yourself and the inference engine caches for free and automatically. No write fee, no read rate, no storage rental — a hit just skips prefill.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cache type&lt;/th&gt;
&lt;th&gt;Markers?&lt;/th&gt;
&lt;th&gt;Write fee&lt;/th&gt;
&lt;th&gt;Storage fee&lt;/th&gt;
&lt;th&gt;Where you meet it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Automatic prefix&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Most open-weight hosts; DeepSeek, GLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Explicit breakpoint&lt;/td&gt;
&lt;td&gt;&lt;code&gt;cache_control&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Surcharge&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Qwen (explicit mode); some platforms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rented cache object&lt;/td&gt;
&lt;td&gt;Create/TTL/delete&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Moonshot &lt;code&gt;moonshot-v1&lt;/code&gt;, Gemini explicit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-host KV reuse&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;vLLM, SGLang, TensorRT-LLM&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Qwen on Model Studio offers &lt;strong&gt;both&lt;/strong&gt; automatic and explicit modes, with a real tradeoff: implicit bills a hit at &lt;strong&gt;20% of input&lt;/strong&gt; with free writes; explicit bills a hit at &lt;strong&gt;10% of input&lt;/strong&gt; but charges &lt;strong&gt;125% on the write&lt;/strong&gt; and bounds the entry to a 5-minute TTL. Deeper discount, but you pay to populate and pay again each time it expires.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where caching lives in the stack
&lt;/h2&gt;

&lt;p&gt;Here is the key idea. Prompt caching for open weights is &lt;strong&gt;solved at exactly one layer and endangered at every layer above it.&lt;/strong&gt; Walk the stack from the weights up, and at each layer ask: does this layer &lt;em&gt;provide&lt;/em&gt; caching, or merely &lt;em&gt;forward&lt;/em&gt; it — and can it &lt;em&gt;break&lt;/em&gt; what the layer below already did?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  request
     |
     v
  +--------------------------------------------------+
  | L5  router             scatters across vendors   |  can break it
  | L4  gateway            multi-cluster routing     |  can break it
  | L3  compute host       uneven delivery           |  can break it
  |==================================================|
  | L2  inference engine   CACHING LIVES HERE, free  |  &amp;lt;-- the cache is born here
  |==================================================|
  | L1  model              cacheability: MLA / GQA   |  sets the ceiling
  +--------------------------------------------------+

  A cache hit is born at L2 and must survive L3-L5 routing to reach you;
  every layer above L2 is a chance to land where your prefix isn't.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 1 — The model: cacheability, not a cache
&lt;/h3&gt;

&lt;p&gt;This is the layer most people &lt;em&gt;think&lt;/em&gt; caching lives in — "DeepSeek has caching" — so it's the first one to get precise about. A checkpoint is a bag of weights; it runs the same attention whether or not a KV cache exists. It ships no cache, no discount, no TTL, no &lt;code&gt;cache_control&lt;/code&gt; marker — those are serving-layer features. In that strict sense the weights provide no caching &lt;em&gt;product&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;But the weights are not neutral, and DeepSeek is the perfect example of why. &lt;strong&gt;The model's attention architecture decides how big the KV cache is, and therefore how cheap caching can ever be:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DeepSeek's &lt;strong&gt;Multi-head Latent Attention (MLA)&lt;/strong&gt; compresses the KV cache into a low-rank latent — to roughly 4–14% of a standard multi-head cache. That compression is exactly what lets DeepSeek's API persist prefixes to disk and price a cache read at ~2% of input. The architecture is the &lt;em&gt;enabler&lt;/em&gt;; the disk cache is a &lt;em&gt;product built on top of it&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grouped-Query Attention (GQA)&lt;/strong&gt; — used by Llama, Qwen, Mistral, and DeepSeek — shares KV heads to shrink the cache by the group factor (≈8× on Llama-3).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So Layer 1's contribution is &lt;strong&gt;cacheability, not a cache&lt;/strong&gt;: the architecture sets the ceiling on how cheap every layer above can make caching, but the weights never serve a cached token themselves. And "DeepSeek has caching" quietly merges two different things wearing the same name — the &lt;em&gt;weights&lt;/em&gt; (this layer, which give you MLA) and DeepSeek's &lt;em&gt;API and serving stack&lt;/em&gt; (Layers 2–3, which give you the disk cache, the discount, and the usage fields). Download the open weights and run them yourself and you keep MLA's small KV cache, but the disk-cache &lt;em&gt;product&lt;/em&gt; stays on DeepSeek's servers — you inherit whatever Layer 2 you deploy in its place. So the operational move still holds: stop asking whether a &lt;em&gt;model&lt;/em&gt; caches and start asking where it's &lt;em&gt;served&lt;/em&gt; — just don't mistake that for the architecture not mattering. It sets the ceiling; the path sets what you actually get.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2 — The inference engine: where caching is built, and free
&lt;/h3&gt;

&lt;p&gt;One layer up, caching is not just present — it's &lt;strong&gt;solved, and free.&lt;/strong&gt; Modern inference engines cache prefixes automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;vLLM&lt;/strong&gt; — Automatic Prefix Caching: hashes each KV block, reuses any block whose prefix hash it has seen, LRU-evicts. On by default in V1.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SGLang&lt;/strong&gt; — RadixAttention: stores the KV cache in a radix tree so any shared prefix is reused, with cache-aware scheduling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TensorRT-LLM&lt;/strong&gt; — block reuse (&lt;code&gt;enable_block_reuse&lt;/code&gt;, default on), with optional offload of KV blocks to host memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Projects like LMCache extend this further — offloading KV to CPU/disk and &lt;em&gt;sharing it across instances&lt;/em&gt;, which is the seed of solving the routing problem we're about to hit. The point: if you self-host, you are done. Caching is automatic, costs nothing beyond the GPUs you already run, evicts by LRU, and &lt;strong&gt;you own it&lt;/strong&gt; — a hit simply skips prefill, lowering TTFT and raising throughput. There is no &lt;code&gt;cached_tokens&lt;/code&gt; billing field because nothing is billed; the payoff shows up in your own latency metrics. For a closed model you rent caching; for an open one you can own it outright. The catch is the inverse of the hosted world: the cache is ephemeral (VRAM, LRU), so it survives only while the prefix stays hot — which is precisely what the layers above must preserve.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3 — The compute host: productizing it, unevenly
&lt;/h3&gt;

&lt;p&gt;Commercial inference hosts wrap Layer 2 and run &lt;strong&gt;fleets of replicas&lt;/strong&gt;. They inherit free automatic caching — the question is whether they implement it &lt;em&gt;well&lt;/em&gt;, and the answer is mixed on two axes.&lt;/p&gt;

&lt;p&gt;First, &lt;strong&gt;exposure and price vary wildly&lt;/strong&gt;. Among the major open-weight hosts: one applies a flat 50% to cached input and lets cached tokens skip rate limits; another defaults to 50% off on serverless; a third prices cached input per model (e.g. a Qwen tier at ~80% off) and exposes a cache-key hint to improve affinity; a fourth makes caching always-on and undiscloseable on dedicated endpoints. Same underlying engine, four pricing philosophies.&lt;/p&gt;

&lt;p&gt;Second — and this is the first place caching &lt;em&gt;breaks&lt;/em&gt; — the &lt;strong&gt;multi-replica problem&lt;/strong&gt;. Your warm prefix lives in the VRAM of the one replica that served the cold request. The host's own load balancer may send your next request to a different replica with a cold cache. We saw exactly this: pinning the same Qwen model to one upstream at a time and running cold→warm:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pinned upstream&lt;/th&gt;
&lt;th&gt;Cold&lt;/th&gt;
&lt;th&gt;Warm&lt;/th&gt;
&lt;th&gt;Discount&lt;/th&gt;
&lt;th&gt;&lt;code&gt;cached_tokens&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Provider A&lt;/td&gt;
&lt;td&gt;$0.000709&lt;/td&gt;
&lt;td&gt;$0.000286&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;59.6%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4,224 ✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider B&lt;/td&gt;
&lt;td&gt;$0.000662&lt;/td&gt;
&lt;td&gt;$0.000662&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Provider A cached cleanly and reported it. Provider B — which &lt;em&gt;advertises&lt;/em&gt; a cache-read price for this model — returned &lt;strong&gt;no discount across a cold call and two warm calls&lt;/strong&gt; in our test. Whether that's eligibility, replica fan-out, or a longer warm-up than two requests, the measured result on this path was zero. The capability is solved at Layer 2; whether you actually receive it is a Layer-3 execution detail, and it differs by host.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4 — The gateway: the multi-cluster problem
&lt;/h3&gt;

&lt;p&gt;A gateway sits in front of one or more upstreams and multiplies the replica problem into a &lt;strong&gt;cluster problem&lt;/strong&gt;. If it round-robins requests across clusters or providers without &lt;strong&gt;cache affinity&lt;/strong&gt;, the warm cache becomes structurally unreachable — every request lands somewhere the prefix isn't. A cache-aware gateway must route by prefix hash so identical prefixes stick to the same upstream, the same way Layer 2 sticks them to the same KV blocks.&lt;/p&gt;

&lt;p&gt;We ran a cold→warm battery across open-weight models on a third-party gateway, reading the per-request &lt;code&gt;cost&lt;/code&gt; directly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Cold&lt;/th&gt;
&lt;th&gt;Warm&lt;/th&gt;
&lt;th&gt;Discount&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deepseek-v4-pro&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.00189&lt;/td&gt;
&lt;td&gt;$0.0000155&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;99.2%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6.0s → 1.1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deepseek-v4-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.000564&lt;/td&gt;
&lt;td&gt;$0.0000116&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;97.9%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4.9s → 1.2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qwen3.5-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.000561&lt;/td&gt;
&lt;td&gt;$0.0000853&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;84.8%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10.2s → 1.0s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kimi-k2.5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.00242&lt;/td&gt;
&lt;td&gt;$0.000469&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80.6%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3.2s → 1.2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qwen3-max&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.00350&lt;/td&gt;
&lt;td&gt;$0.00336&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.8%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.2s → 1.1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qwen3.5-plus&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.00114&lt;/td&gt;
&lt;td&gt;$0.00114&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.0%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.8s → 1.0s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;DeepSeek-V4 hit 97–99% (affinity working end to end); &lt;code&gt;qwen3.5-plus&lt;/code&gt; and &lt;code&gt;qwen3-max&lt;/code&gt; returned ~0% on the warm call despite carrying a cache-read price in the catalog. Two more gateway lessons hide in this table:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The usage field lies; the cost doesn't.&lt;/strong&gt; &lt;code&gt;cached_tokens&lt;/code&gt; read &lt;strong&gt;0&lt;/strong&gt; on &lt;em&gt;every&lt;/em&gt; call here, including the 99% cost drops. Many OpenAI-compatible gateways don't populate the cached-token field for upstreams that cache automatically. Audit by the &lt;code&gt;cost&lt;/code&gt; delta between a cold and warm call, not by the token field — the same lesson as auditing a &lt;a href="https://dev.to/blog/llm-gateway-cache-audit/"&gt;gateway's cache claims&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency improves even when cost doesn't.&lt;/strong&gt; Every warm call was 2–10× faster — &lt;code&gt;qwen3.5-flash&lt;/code&gt; went 10.2s→1.0s — including the ~0%-discount ones. A hit skips prefill regardless of how the host prices it, so caching can pay off in TTFT on a gateway that gives you nothing on the bill.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A gateway that doesn't preserve affinity hands you a cache you can't reach; one that doesn't surface cache cost hands you one you can't verify.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 5 — The router: random distribution across providers
&lt;/h3&gt;

&lt;p&gt;At the top, a multi-provider router load-balances one model ID across &lt;em&gt;different companies'&lt;/em&gt; clusters — each with a &lt;strong&gt;separate cache&lt;/strong&gt;. Now even perfect affinity within a provider can't save you: if call 1 goes to one vendor and call 2 to another, there is no shared cache to hit. This is the scatter from the top of this post, and it compounds Layer 4 — not just multiple clusters, but multiple vendors with disjoint cache state and disjoint prices (the priciest pick billed 20× the cheapest upstream's base rate). The cache only engaged once routing happened to stick to one provider.&lt;/p&gt;

&lt;p&gt;The fix is to collapse the randomness — make routing deterministic so repeated prefixes land on the same warm cache:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pin the upstream; otherwise load-balancing scatters you across disjoint caches.
# (field names follow a common multi-provider router's API)
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ROUTER_BASE&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/chat/completions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen/qwen3.5-35b-a3b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;include&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;              &lt;span class="c1"&gt;# return cost + cached_tokens
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;                            &lt;span class="c1"&gt;# the part that makes caching work
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;your-chosen-upstream&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;allow_fallbacks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To its credit, the router &lt;em&gt;did&lt;/em&gt; report &lt;code&gt;cached_tokens&lt;/code&gt; (4,224 on the hit) and a per-request &lt;code&gt;cost&lt;/code&gt;, so you can verify both — better than the Layer-4 gateway that read 0. But the routing is yours to constrain. &lt;strong&gt;Caching is a routing problem dressed up as a pricing feature:&lt;/strong&gt; the cache is free at Layer 2, and Layers 3, 4, and 5 are three escalating ways to route yourself away from it.&lt;/p&gt;




&lt;h2&gt;
  
  
  How deep is the discount? It's all over the map
&lt;/h2&gt;

&lt;p&gt;When the routing &lt;em&gt;does&lt;/em&gt; line up, how much do you save? For closed models the cache-read discount clusters near 90%. For open weights the published cache-read price ranges from a token gesture to near-total, even within one vendor's lineup. First-party published rates:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model (first-party / mode)&lt;/th&gt;
&lt;th&gt;Input $/M&lt;/th&gt;
&lt;th&gt;Cache read $/M&lt;/th&gt;
&lt;th&gt;Discount&lt;/th&gt;
&lt;th&gt;Layer-2 type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-v4-flash&lt;/td&gt;
&lt;td&gt;0.14&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.0028&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~98%&lt;/td&gt;
&lt;td&gt;auto disk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-v4-pro&lt;/td&gt;
&lt;td&gt;1.74&lt;/td&gt;
&lt;td&gt;0.145&lt;/td&gt;
&lt;td&gt;~92%&lt;/td&gt;
&lt;td&gt;auto disk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen (explicit mode)&lt;/td&gt;
&lt;td&gt;base&lt;/td&gt;
&lt;td&gt;0.10× base&lt;/td&gt;
&lt;td&gt;90%&lt;/td&gt;
&lt;td&gt;explicit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi K2.6&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.16&lt;/td&gt;
&lt;td&gt;~83%&lt;/td&gt;
&lt;td&gt;auto&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GLM-5&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;0.20&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;auto implicit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen (implicit mode)&lt;/td&gt;
&lt;td&gt;base&lt;/td&gt;
&lt;td&gt;0.20× base&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;auto&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;DeepSeek's automatic disk cache is the deepest in the field — &lt;code&gt;deepseek-v4-flash&lt;/code&gt; reads cached input at &lt;strong&gt;$0.0028/M against a $0.14/M miss, a 1:50 ratio&lt;/strong&gt;, which our Layer-4 test reproduced at 97.9%. &lt;strong&gt;Third-party hosts of these same open weights price cached input independently&lt;/strong&gt; — some apply a flat ~50%, others vary per model from ~50% to ~90% — so the discount you get is a function of which host you land on, not just the model. Same feature name, a 48-point spread.&lt;/p&gt;

&lt;p&gt;Because the discount is a venue property, one model carries different cache economics everywhere it's served. &lt;code&gt;deepseek-v4-pro&lt;/code&gt;, four ways:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Where (layer)&lt;/th&gt;
&lt;th&gt;Cache-read discount&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;First-party API (L3)&lt;/td&gt;
&lt;td&gt;~92% ($1.74 → $0.145)&lt;/td&gt;
&lt;td&gt;documented&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Third-party host A (L3)&lt;/td&gt;
&lt;td&gt;~89% ($1.74 → $0.20)&lt;/td&gt;
&lt;td&gt;documented&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Third-party host B (L3)&lt;/td&gt;
&lt;td&gt;~92% ($1.6 → $0.135)&lt;/td&gt;
&lt;td&gt;documented&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Third-party gateway (L4)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;99.2%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;measured (cold→warm)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;"DeepSeek-V4-Pro supports caching" is true and nearly useless; the operational question is "supports caching &lt;em&gt;where, at what rate, reported how&lt;/em&gt;."&lt;/p&gt;




&lt;h2&gt;
  
  
  A decision checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;The model sets the ceiling, not the cache&lt;/strong&gt; (Layer 1). Its attention architecture (MLA, GQA) decides how cheap caching &lt;em&gt;can&lt;/em&gt; be, but it never serves a cached token — so still ask where it's served and what that host's stack does.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Self-hosting? You already have it free&lt;/strong&gt; (Layer 2). Confirm automatic prefix caching is on (it is by default in vLLM/SGLang) and watch your prefix hit rate.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;On a compute host, verify delivery, not the price column&lt;/strong&gt; (Layer 3). A cache-read price is a claim; measure a cold→warm cost delta. Use a cache-key affinity hint where the host offers one.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Through a gateway, demand cache-affinity routing and cost reporting&lt;/strong&gt; (Layer 4). If identical prefixes don't stick to one upstream, or &lt;code&gt;cost&lt;/code&gt; doesn't drop on a warm call, the cache is unreachable or unverifiable.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;On a router, pin the upstream&lt;/strong&gt; (Layer 5). Constrain routing (e.g. a provider-order field with fallbacks off), or you forfeit hits to load-balancing across disjoint caches — and risk a 20–50× pricier upstream.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Weigh latency separately from cost.&lt;/strong&gt; Warm prefills are 2–10× faster even when the dollar discount is ~0.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Watch for storage-fee cache types.&lt;/strong&gt; Rented caches (Moonshot &lt;code&gt;moonshot-v1&lt;/code&gt;, Gemini explicit) bill per-token-time for an idle cache; automatic prefix caches don't.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Bottom line
&lt;/h2&gt;

&lt;p&gt;For closed models, "does it cache?" has one answer. For open weights the capability was solved years ago at the inference-engine layer — vLLM and SGLang cache every prefix, for free, automatically. Everything above that layer is plumbing that either preserves the hit or scatters you away from it: a compute host's replica balancer, a gateway's cluster routing, a router's random spread across vendors. The model's architecture sets the ceiling on how cheap caching can be — MLA and GQA are real, model-level wins — but the path your request takes decides what you actually get. Treat cache behavior as a &lt;strong&gt;routing property&lt;/strong&gt; — measure it in cost terms on the exact path you'll run, pin the route so the cache you warmed is the one you hit, and remember that the deepest discount in the world is worth nothing if request two lands somewhere request one never touched.&lt;/p&gt;

&lt;p&gt;For the mechanics of &lt;em&gt;why&lt;/em&gt; a KV cache exists and how TTLs work, start with &lt;a href="https://dev.to/blog/llm-prompt-caching-explained/"&gt;How KV Cache &amp;amp; TTL Work&lt;/a&gt;; to audit a gateway's cache claims, see &lt;a href="https://dev.to/blog/llm-gateway-cache-audit/"&gt;Does Your LLM Gateway Lie About Cache?&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Do open-weight models support prompt caching?&lt;/strong&gt;&lt;br&gt;
The weights set how cheap caching can be — attention architectures like MLA and GQA shrink the KV cache — but the &lt;em&gt;cache itself&lt;/em&gt;, the discount, and the API come from the serving stack. Caching is implemented in the inference engine (vLLM, SGLang, TensorRT-LLM), inherited by compute hosts, and forwarded (or scattered) by gateways and routers. Ship the same checkpoint to three hosts and you can get free automatic caching, none, or explicit-only.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why did the same model cost 49× more on one call than another?&lt;/strong&gt;&lt;br&gt;
On a multi-provider router, an un-pinned request is load-balanced across different vendors' clusters with different base prices and disjoint cache state. One call hit a pricey provider cold; another hit a cheap one warm. Pin the upstream (constrain provider order, fallbacks off) to control both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If I self-host, do I need to pay for caching?&lt;/strong&gt;&lt;br&gt;
No. Automatic prefix caching in vLLM, SGLang, and TensorRT-LLM is on by default and free — a hit just skips prefill. You pay only for the GPUs you already run, and the cache is yours, evicted by LRU when VRAM is needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The API says &lt;code&gt;cached_tokens: 0&lt;/code&gt; but my bill dropped — did caching work?&lt;/strong&gt;&lt;br&gt;
Probably yes. Many gateways don't populate &lt;code&gt;cached_tokens&lt;/code&gt; for upstreams that cache automatically. Trust the &lt;code&gt;cost&lt;/code&gt; field: a large drop between a cold and an identical warm call means the cache hit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which open-weight model has the deepest cache discount?&lt;/strong&gt;&lt;br&gt;
DeepSeek's automatic disk cache: &lt;code&gt;deepseek-v4-flash&lt;/code&gt; reads cached input at ~$0.0028/M against $0.14/M uncached (~98% off), reproduced at 97.9–99.2% across the V4 line in our cold→warm tests. Many third-party hosts apply a flat ~50% instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is there a catch with caches that charge a storage fee?&lt;/strong&gt;&lt;br&gt;
Yes — Moonshot's &lt;code&gt;moonshot-v1&lt;/code&gt; explicit cache and Gemini's explicit cache bill per-token-time to keep the cache alive (Gemini ~$1–4.50 / 1M-tokens / hour). An idle cache you forgot to delete keeps charging. Automatic prefix caches have no storage fee.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Verification: live cost/latency figures measured 2026-06-14 against a multi-provider router and our own gateway, using a fixed ~4.7K-token prompt, small &lt;code&gt;max_tokens&lt;/code&gt;, sequential cold→warm runs; discounts computed from the returned per-request &lt;code&gt;cost&lt;/code&gt;. Documented pricing and cache mechanics checked against primary provider docs the same day and cross-verified adversarially; a few vendor figures (notably Moonshot's explicit-cache fees) move frequently — confirm current values before quoting. Your numbers will vary with provider, prompt, region, and load.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://api-docs.deepseek.com/quick_start/pricing" rel="noopener noreferrer"&gt;DeepSeek — Pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://api-docs.deepseek.com/guides/kv_cache" rel="noopener noreferrer"&gt;DeepSeek — KV cache / Context Caching guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2412.19437" rel="noopener noreferrer"&gt;DeepSeek-V3 Technical Report — MLA (KV-cache compression)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2305.13245" rel="noopener noreferrer"&gt;GQA: Training Generalized Multi-Query Transformer Models (Ainslie et al.)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.alibabacloud.com/help/en/model-studio/" rel="noopener noreferrer"&gt;Alibaba Cloud Model Studio — context cache &amp;amp; pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.moonshot.ai/docs/guide/use-context-caching" rel="noopener noreferrer"&gt;Moonshot AI — Context Caching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.z.ai/" rel="noopener noreferrer"&gt;Zhipu / Z.AI — pricing &amp;amp; caching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html" rel="noopener noreferrer"&gt;vLLM — Automatic Prefix Caching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.sglang.ai/" rel="noopener noreferrer"&gt;SGLang — RadixAttention / cache&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/LMCache/LMCache" rel="noopener noreferrer"&gt;LMCache — KV cache offloading &amp;amp; sharing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/gemini-api/docs/caching" rel="noopener noreferrer"&gt;Google — Gemini context caching&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;All checked 2026-06-14. Not financial advice; verify current pricing before relying on it.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>llm</category>
    </item>
    <item>
      <title>Claude Fable 5's 30-Day Retention: ZDR, HIPAA, COPPA</title>
      <dc:creator>synthorai</dc:creator>
      <pubDate>Fri, 12 Jun 2026 06:54:46 +0000</pubDate>
      <link>https://dev.to/synthorai/claude-fable-5s-30-day-retention-zdr-hipaa-coppa-3d5d</link>
      <guid>https://dev.to/synthorai/claude-fable-5s-30-day-retention-zdr-hipaa-coppa-3d5d</guid>
      <description>&lt;p&gt;If your organization runs Claude under a zero-data-retention (ZDR) agreement, your first request to &lt;code&gt;claude-fable-5&lt;/code&gt; didn't return a completion. It returned &lt;code&gt;400 invalid_request_error&lt;/code&gt;. That's not an outage — it's policy. Fable 5 is the first generally available Claude model that &lt;strong&gt;cannot be used without 30-day data retention&lt;/strong&gt;, and the requirement follows the model onto every platform: the Claude API, AWS Bedrock, Google Vertex AI, and Microsoft Foundry each gate it behind an explicit retention opt-in.&lt;/p&gt;

&lt;p&gt;For teams that treated "we don't retain your data" as a settled property of their LLM stack, this is an architectural event. This post covers what the policy says, why the window exists, how each cloud implements it, and what it changes for consumer products and sensitive-data industries.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Policy details were checked against Anthropic's, AWS's, Google's, and Microsoft's published documentation on 2026-06-12. Policies change; verify against the linked primary sources and your own contracts. This is an engineering overview, not legal advice.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What the policy actually says
&lt;/h2&gt;

&lt;p&gt;Anthropic designates Claude Fable 5 and Claude Mythos 5 as &lt;a href="https://support.claude.com/en/articles/15425695" rel="noopener noreferrer"&gt;Covered Models&lt;/a&gt;. Per the &lt;a href="https://platform.claude.com/docs/en/manage-claude/api-and-data-retention" rel="noopener noreferrer"&gt;API data retention docs&lt;/a&gt; and the &lt;a href="https://support.claude.com/en/articles/15425996-data-retention-practices-for-mythos-class-models" rel="noopener noreferrer"&gt;Mythos-class retention practices article&lt;/a&gt; (effective 2026-06-09):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prompts and completions are retained for 30 days&lt;/strong&gt;, then automatically deleted — unless flagged for an active safety investigation or required by law.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;There is no opt-out.&lt;/strong&gt; Retention is a condition of using the model. A request from an organization whose retention configuration doesn't meet the requirement returns &lt;code&gt;400 invalid_request_error&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access is narrow by design.&lt;/strong&gt; Automated safety systems screen the data; only a small group of approved personnel can review flagged conversations, they cannot export, copy, or download it, and every access lands in tamper-proof logs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Existing ZDR agreements do not carry over&lt;/strong&gt; to Covered Model traffic — including through cloud platforms.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Consumer plans (Claude Free/Pro/Max) are unaffected — they already operate under their own retention terms. This policy targets the commercial API surface, exactly where "we never retain" promises tend to live.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why a 30-day window exists
&lt;/h2&gt;

&lt;p&gt;The rationale in the &lt;a href="https://support.claude.com/en/articles/15425695" rel="noopener noreferrer"&gt;Covered Models article&lt;/a&gt; is specific: these models have substantially advanced capabilities in software engineering, agentic workflows, and cybersecurity, and &lt;strong&gt;"some forms of misuse only become detectable across many requests."&lt;/strong&gt; The cited examples — best-of-N jailbreaking, state-sponsored espionage — are attack patterns where each prompt looks benign and only the sequence is diagnostic. You can't detect a sequence you've deleted.&lt;/p&gt;

&lt;p&gt;Two things the window is &lt;strong&gt;not&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not training data.&lt;/strong&gt; Anthropic states retained data is never used for training without express permission. The purpose is abuse detection, full stop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not new in kind — new in enforceability.&lt;/strong&gt; A ~30-day abuse-monitoring window has been the industry default for years: &lt;a href="https://openai.com/enterprise-privacy/" rel="noopener noreferrer"&gt;OpenAI&lt;/a&gt; keeps API abuse logs up to 30 days (ZDR by approval); &lt;a href="https://learn.microsoft.com/en-us/answers/questions/2156579/azure-openai-data-management-and-abuse-monitoring" rel="noopener noreferrer"&gt;Azure OpenAI&lt;/a&gt; stores prompts up to 30 days unless approved for modified abuse monitoring. What changed is that the window became &lt;strong&gt;non-negotiable for one model class&lt;/strong&gt; — previously every provider offered a zero-retention escape hatch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One pre-existing caveat that surprises people: even under ZDR, Anthropic retains safety classifier results, and content flagged for Usage Policy violations can be kept &lt;strong&gt;up to 2 years&lt;/strong&gt;. Zero data retention has never meant zero data — it means zero retention of unflagged content in the normal path.&lt;/p&gt;




&lt;h2&gt;
  
  
  Same requirement, three clouds, three mechanisms
&lt;/h2&gt;

&lt;p&gt;The retention applies wherever the model runs, but each platform wires the opt-in differently — and the differences decide who processes your data and where your controls live.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Opt-in mechanism&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;th&gt;Without opt-in&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude API&lt;/td&gt;
&lt;td&gt;30-day retention in Privacy controls&lt;/td&gt;
&lt;td&gt;Organization or workspace&lt;/td&gt;
&lt;td&gt;&lt;code&gt;400 invalid_request_error&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Bedrock&lt;/td&gt;
&lt;td&gt;&lt;code&gt;data_retention_mode: provider_data_share&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Account or project&lt;/td&gt;
&lt;td&gt;Model listed &lt;code&gt;unavailable&lt;/code&gt;; requests blocked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Google Vertex AI&lt;/td&gt;
&lt;td&gt;Anthropic data sharing + Model Garden terms&lt;/td&gt;
&lt;td&gt;Project&lt;/td&gt;
&lt;td&gt;Requests blocked until enabled&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Microsoft Foundry&lt;/td&gt;
&lt;td&gt;Anthropic's terms accepted at deployment&lt;/td&gt;
&lt;td&gt;Subscription/deployment&lt;/td&gt;
&lt;td&gt;Not covered by Azure's ZDR program at all&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;AWS Bedrock&lt;/strong&gt; is the most explicit. &lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/data-retention.html" rel="noopener noreferrer"&gt;Data retention is a configurable mode&lt;/a&gt; (&lt;code&gt;default&lt;/code&gt; / &lt;code&gt;provider_data_share&lt;/code&gt; / &lt;code&gt;none&lt;/code&gt;), resolved project → account → model default. Fable 5 declares &lt;code&gt;allowed_modes: ["provider_data_share"]&lt;/code&gt;: prompts and completions are shared with Anthropic and retained up to 30 days. Under any other mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anthropic.claude-fable-5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"unavailable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status_reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"This model is not available under data retention mode 'default'."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"data_retention"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"account"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"allowed_modes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"provider_data_share"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing changed for pre-Fable-5 models, and an SCP on the &lt;code&gt;bedrock:DataRetentionMode&lt;/code&gt; condition key can enforce your posture org-wide — nobody quietly flips the account to try the new model. Note: with cross-region inference, the retained copy lives in the &lt;em&gt;destination&lt;/em&gt; region, which matters if you carry residency commitments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Google Vertex AI&lt;/strong&gt; gates the model behind a project-level Anthropic data-sharing setting (&lt;code&gt;setPublisherModelConfig&lt;/code&gt; with &lt;code&gt;dataSharingEnabledProvider: "anthropic"&lt;/code&gt;) plus terms acceptance in Model Garden, per &lt;a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/partner-models/claude/fable-5" rel="noopener noreferrer"&gt;Google's Fable 5 documentation&lt;/a&gt;. General data handling follows &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/data-governance" rel="noopener noreferrer"&gt;Vertex AI's data-governance policy&lt;/a&gt;; for residency-sensitive workloads, Vertex's regional and multi-region endpoints control where inference runs — which now includes where the retained copy lives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Microsoft Foundry&lt;/strong&gt; is structurally different. Microsoft's &lt;a href="https://learn.microsoft.com/en-us/azure/foundry/responsible-ai/claude-models/data-privacy" rel="noopener noreferrer"&gt;data and privacy documentation&lt;/a&gt; is explicit that Claude models are third-party marketplace services: you accept Anthropic's terms at deployment, and &lt;strong&gt;Anthropic — not Microsoft — is the data processor&lt;/strong&gt;. Azure OpenAI's ZDR and modified-abuse-monitoring programs don't extend to Claude deployments. Organizations with ZDR postures elsewhere typically isolate Covered Model use in a dedicated subscription, making the retention boundary structural rather than procedural.&lt;/p&gt;

&lt;p&gt;The pattern across all three: &lt;strong&gt;retention class became a first-class, machine-readable model attribute&lt;/strong&gt; — a mode, a flag, a terms gate — rather than a paragraph in a contract. Your infrastructure can now enforce your data posture, and it should.&lt;/p&gt;




&lt;h2&gt;
  
  
  What it means for enterprise deployments
&lt;/h2&gt;

&lt;p&gt;With no ZDR agreement, nothing changes mechanically — you were already in a 30-day-style posture, possibly without realizing it. The work is making it &lt;em&gt;explicit&lt;/em&gt; in your vendor documentation.&lt;/p&gt;

&lt;p&gt;With a ZDR agreement, you have a three-way choice:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Skip Covered Models.&lt;/strong&gt; ZDR stays uniform; you give up the model. Viable if your workloads don't need it — see our &lt;a href="https://dev.to/blog/claude-fable-5-prompt-caching/"&gt;measured Fable 5 evaluation&lt;/a&gt; for what it costs and where it differs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Split by workspace or project.&lt;/strong&gt; Every platform supports a scoped opt-in: a designated Claude API workspace (Console → Settings → Workspaces → Privacy controls), a Bedrock project with &lt;code&gt;provider_data_share&lt;/code&gt;, a separate Vertex project or Azure subscription. Route only retention-tolerant workloads there.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accept retention org-wide.&lt;/strong&gt; Simplest to operate, but it silently downgrades the guarantee for &lt;em&gt;every&lt;/em&gt; workload — including the ones whose sensitivity justified ZDR. That's a decision for your data-protection owner, not a config change.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Regardless of provider: &lt;strong&gt;your own logging is a second retention surface.&lt;/strong&gt; If your gateway or observability stack logs full prompts, you're running a longer window than your provider, under your own roof. Provider guarantees are only as meaningful as the layer in front of them — the same audit logic we applied to &lt;a href="https://dev.to/blog/llm-gateway-cache-audit/"&gt;cache claims&lt;/a&gt; applies here.&lt;/p&gt;




&lt;h2&gt;
  
  
  What it means for consumer-facing products
&lt;/h2&gt;

&lt;p&gt;If you serve consumers and route their content through a Covered Model, the change propagates into your own legal surface — ZDR agreement or not. Three concrete consequences:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Your privacy notice probably needs an update.&lt;/strong&gt; Most regimes require disclosing retention, not just collection: GDPR Article 13(2)(a) requires the storage period (or criteria) at collection time; California's CPRA requires the notice at collection to state retention per category of personal information. If your notice says — or implies — that conversation data isn't retained anywhere, a processor holding a 30-day copy makes it wrong. Update the notice, the records of processing, and the DPA inventory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. You cannot offer users an opt-out you don't have.&lt;/strong&gt; The retention has no exception mechanism, so there is no toggle you can build that exempts a user's prompts &lt;em&gt;while still using that model&lt;/em&gt;. The lever you actually hold is &lt;strong&gt;routing&lt;/strong&gt;: a consent-aware gateway sends users who decline data sharing to ZDR-eligible models and everyone else to the Covered Model — a legal constraint turned into an ordinary routing rule. Far better than a preference checkbox that does nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Deletion requests need accurate plumbing.&lt;/strong&gt; Erasure obligations (GDPR Art. 17, CPRA deletion, and their equivalents) extend to processors. A bounded window that auto-deletes within 30 days is generally a defensible processor posture — but your DSAR playbook should say that, not promise immediate downstream deletion you can't execute.&lt;/p&gt;

&lt;p&gt;The global dimension compounds this: the same disclosure-and-processor logic appears in the UK GDPR, Brazil's LGPD, and the spreading family of US state privacy laws. For users in China, PIPL adds two sharper edges — providing personal information to another processor generally requires separate consent, and routing Chinese users' content to an overseas LLM endpoint is a cross-border transfer needing a recognized mechanism (security assessment, standard contract, or certification). A model upgrade that changes who retains what, where, for how long is exactly the change these frameworks expect you to re-paper.&lt;/p&gt;




&lt;h2&gt;
  
  
  Sensitive-data industries: where 30 days bites hardest
&lt;/h2&gt;

&lt;p&gt;For most products the provider's window is a documentation problem. For industries whose data is itself regulated, it's an architecture problem: the retained copy is regulated data at rest at a vendor, and sector rules govern exactly that.&lt;/p&gt;

&lt;h3&gt;
  
  
  Healthcare (HIPAA)
&lt;/h3&gt;

&lt;p&gt;HIPAA doesn't require zero retention — it requires that any vendor holding protected health information does so &lt;strong&gt;under a Business Associate Agreement (BAA)&lt;/strong&gt; with appropriate safeguards. The 30-day copy of your prompts is PHI at rest at a business associate; the question is whether your BAA covers it. The two major API vendors structure this differently, and the difference now matters: &lt;a href="https://platform.claude.com/docs/en/manage-claude/api-and-data-retention#hipaa-readiness" rel="noopener noreferrer"&gt;Anthropic's HIPAA-ready API access&lt;/a&gt; explicitly &lt;em&gt;doesn't&lt;/em&gt; require ZDR — it's built on retention-with-safeguards (encryption, access controls, audit logging, enforced feature restrictions). &lt;a href="https://help.openai.com/en/articles/8660679-how-can-i-get-a-business-associate-agreement-baa-with-openai" rel="noopener noreferrer"&gt;OpenAI's API BAA&lt;/a&gt; covers endpoints eligible for zero data retention — and a BAA scoped to ZDR endpoints structurally cannot cover a model that mandates retention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A model's retention class is now a BAA-eligibility question.&lt;/strong&gt; Confirm in writing that your BAA covers the specific model before routing PHI to it — and remember the chain shifts on clouds: on Bedrock the platform is your business associate; on Foundry, Anthropic processes the data directly. One sharp edge: PHI must never appear in JSON schema definitions for structured outputs — cached schemas don't get the same protections as message content.&lt;/p&gt;

&lt;h3&gt;
  
  
  Children's products (COPPA)
&lt;/h3&gt;

&lt;p&gt;The timing is awkward: the FTC's &lt;a href="https://www.ftc.gov/news-events/news/press-releases/2025/01/ftc-finalizes-changes-childrens-privacy-rule-limiting-companies-ability-monetize-kids-data" rel="noopener noreferrer"&gt;amended COPPA Rule&lt;/a&gt; took effect June 23, 2025, with compliance on most provisions due April 22, 2026 — the first model with mandatory provider-side retention arrived just as operators finished implementing the new retention obligations. Two of those interact directly with the 30-day window: a &lt;strong&gt;written, public data retention policy&lt;/strong&gt; is now mandatory (§312.10) — what children's data is collected, why, and when it's deleted — and &lt;strong&gt;indefinite retention is prohibited&lt;/strong&gt;, with retention limited to what's reasonably necessary for the collected purpose.&lt;/p&gt;

&lt;p&gt;A bounded 30-day window with automatic deletion is the &lt;em&gt;compatible&lt;/em&gt; shape — but the provider retains for &lt;em&gt;its&lt;/em&gt; trust-and-safety purpose, not the purpose you collected the child's data for, and your notice must describe the processor relationship accurately. For child-directed products that adopted ZDR specifically to minimize the data trail, the routing answer applies with higher stakes: children's traffic stays on ZDR-eligible models, or the Covered Model window goes into your §312.10 policy first.&lt;/p&gt;

&lt;h3&gt;
  
  
  The same pattern, other sectors
&lt;/h3&gt;

&lt;p&gt;Once you see the structure — &lt;em&gt;regulated data, retained copy at a vendor, sector rule governing retention&lt;/em&gt; — it recurs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Biometrics (Illinois BIPA):&lt;/strong&gt; operators need a written, publicly available retention schedule and destruction guidelines for biometric data. A provider's 30-day copy of prompts containing biometric identifiers belongs in that schedule.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Payments (PCI DSS / GLBA):&lt;/strong&gt; PCI DSS prohibits storing sensitive authentication data after authorization — anywhere. Card data pasted into a prompt becomes card data retained at a provider for 30 days. The clean answer is upstream redaction, not downstream paperwork.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Education (FERPA):&lt;/strong&gt; vendors handling student records under the school-official exception must remain under the school's &lt;em&gt;direct control&lt;/em&gt;. A safety-retention copy the school cannot access or delete early sits uneasily with that standard — a question for counsel before EdTech traffic hits a Covered Model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Financial services — the inversion (SEC/FINRA):&lt;/strong&gt; broker-dealers must &lt;em&gt;retain&lt;/em&gt; business communications under books-and-records rules. For them the provider's window isn't the problem; capturing their own compliant copy is. Same retention question, opposite sign.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common thread: &lt;strong&gt;sector rules regulate retention in both directions&lt;/strong&gt;, and a provider-side window you don't control must be mapped into whichever direction your sector points.&lt;/p&gt;




&lt;h2&gt;
  
  
  A decision checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;Inventory which models your traffic actually touches.&lt;/strong&gt; Retention class is now a per-model attribute, not a per-provider one.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;If you have ZDR: decide deliberately&lt;/strong&gt; — skip Covered Models, split by workspace/project/subscription, or accept retention org-wide. Don't let it happen implicitly.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Enforce the posture in infrastructure&lt;/strong&gt; — Bedrock SCPs, workspace privacy controls, separate cloud projects — not in a wiki page.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;B2C: update privacy notices and DSAR playbooks&lt;/strong&gt;; route non-consenting users to ZDR-eligible models instead of building opt-outs that can't work.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Regulated data: confirm coverage per model, in writing&lt;/strong&gt; — BAA for PHI, §312.10 policy for children's data, retention schedules for biometrics — before routing that data to a retention-required model.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Audit your own logging.&lt;/strong&gt; A provider's 30-day window is irrelevant if your gateway logs prompts indefinitely.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Bottom line
&lt;/h2&gt;

&lt;p&gt;The 30-day window attached to Fable 5 is not a data grab — it's bounded, purpose-limited abuse monitoring, consistent with what most of the industry already does by default, made mandatory for one model class because cross-request misuse detection doesn't work on deleted data. For most teams the engineering impact is zero and the governance impact is a paragraph in a vendor review.&lt;/p&gt;

&lt;p&gt;But for organizations whose compliance position assumed zero retention — ZDR-scoped BAAs, privacy notices that say nothing persists, children's products built on data minimization — Fable 5 is the moment that assumption stopped being uniform across models. The fix isn't avoiding the model. It's making retention class an explicit, per-model input to routing decisions, the same way you already treat price and context window.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Can I use Claude Fable 5 under a zero-data-retention agreement?&lt;/strong&gt;&lt;br&gt;
No. Fable 5 and Mythos 5 are Covered Models requiring 30-day retention; ZDR organizations get a &lt;code&gt;400 invalid_request_error&lt;/code&gt; unless they enable 30-day retention for a workspace and route Fable 5 traffic through it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does going through AWS Bedrock, Vertex AI, or Microsoft Foundry avoid the requirement?&lt;/strong&gt;&lt;br&gt;
No. Each platform gates the model behind its own retention opt-in: &lt;code&gt;provider_data_share&lt;/code&gt; on Bedrock, Anthropic data sharing plus Model Garden terms on Vertex, Anthropic's terms at deployment on Foundry (where Anthropic, not Microsoft, is the data processor). Existing ZDR arrangements don't carry over on any of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can my end users opt out of the retention?&lt;/strong&gt;&lt;br&gt;
No — there is no opt-out mechanism. The lever you hold is routing: send users who decline data sharing to ZDR-eligible models. Don't ship a preference toggle that doesn't change anything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is the retained data used to train models?&lt;/strong&gt;&lt;br&gt;
Anthropic states retained data is never used for training without express permission. The purpose is trust-and-safety review: automated screening, with flagged conversations reviewable only by approved personnel who cannot export the data, under tamper-proof access logs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does the 30-day retention change how prompt caching works?&lt;/strong&gt;&lt;br&gt;
No. Cache entries follow their own short TTLs (5 minutes or 1 hour) and the caching contract on Fable 5 is unchanged — see our &lt;a href="https://dev.to/blog/claude-fable-5-prompt-caching/"&gt;measured evaluation&lt;/a&gt;. The 30-day window is a separate, parallel retention for safety review.&lt;/p&gt;




&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/manage-claude/api-and-data-retention" rel="noopener noreferrer"&gt;Anthropic — API and data retention&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://support.claude.com/en/articles/15425695" rel="noopener noreferrer"&gt;Anthropic — Covered Models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://support.claude.com/en/articles/15425996-data-retention-practices-for-mythos-class-models" rel="noopener noreferrer"&gt;Anthropic — Data retention practices for Mythos-class models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/data-retention.html" rel="noopener noreferrer"&gt;AWS — Amazon Bedrock data retention&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/partner-models/claude/fable-5" rel="noopener noreferrer"&gt;Google Cloud — Claude Fable 5 (partner models)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/data-governance" rel="noopener noreferrer"&gt;Google Cloud — Vertex AI data governance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learn.microsoft.com/en-us/azure/foundry/responsible-ai/claude-models/data-privacy" rel="noopener noreferrer"&gt;Microsoft — Claude in Foundry: data, privacy, and security&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openai.com/enterprise-privacy/" rel="noopener noreferrer"&gt;OpenAI — Enterprise privacy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://help.openai.com/en/articles/8660679-how-can-i-get-a-business-associate-agreement-baa-with-openai" rel="noopener noreferrer"&gt;OpenAI — BAA for API services&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ftc.gov/news-events/news/press-releases/2025/01/ftc-finalizes-changes-childrens-privacy-rule-limiting-companies-ability-monetize-kids-data" rel="noopener noreferrer"&gt;FTC — COPPA Rule amendments (press release)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.federalregister.gov/documents/2025/04/22/2025-05904/childrens-online-privacy-protection-rule" rel="noopener noreferrer"&gt;Federal Register — Children's Online Privacy Protection Rule&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;All checked 2026-06-12. Policies change — verify against current documents and your own contracts. Not legal advice.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>fable5</category>
    </item>
    <item>
      <title>Claude Fable 5: Caching, Tokenizer &amp; Cost vs Opus 4.6</title>
      <dc:creator>synthorai</dc:creator>
      <pubDate>Thu, 11 Jun 2026 15:49:46 +0000</pubDate>
      <link>https://dev.to/synthorai/claude-fable-5-caching-tokenizer-cost-vs-opus-46-43ce</link>
      <guid>https://dev.to/synthorai/claude-fable-5-caching-tokenizer-cost-vs-opus-46-43ce</guid>
      <description>&lt;p&gt;&lt;code&gt;claude-fable-5&lt;/code&gt; is now available on the Synthorai gateway. If you cache against the Claude line, the good news is that the caching and TTL contract is a carry-over: same &lt;code&gt;cache_control&lt;/code&gt; markers, same 5-minute and 1-hour TTLs, same write premiums, same deep read discount. Your caching code moves over by changing one string.&lt;/p&gt;

&lt;p&gt;The thing to budget for isn't the cache mechanics — it's the bill. Fable 5 lists at &lt;strong&gt;2x the Opus token price&lt;/strong&gt;, and it tokenizes the same English text into &lt;strong&gt;~45% more tokens than Opus 4.6&lt;/strong&gt; (it's on the post-4.6 tokenizer, identical to Opus 4.8). Those two multipliers stack. This post measures all of it so you don't have to.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;All numbers below were measured against &lt;code&gt;https://synthorai.io/&lt;/code&gt; (Anthropic-native &lt;code&gt;/v1/messages&lt;/code&gt;) on 2026-06-10 with a stable ~6.6–9.6K-token English system prompt, &lt;code&gt;max_tokens&lt;/code&gt; small, single sequential run. Cost figures are read from the gateway &lt;code&gt;usage.cost&lt;/code&gt; field; &lt;strong&gt;ratios&lt;/strong&gt; (token counts, write premium, read discount, cross-model cost) are the portable part — absolute dollars scale with your prompt. Reproduce against your own prompt before quoting them.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Availability
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Anthropic&lt;/span&gt;

&lt;span class="n"&gt;anth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SYNTHORAI_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://synthorai.io/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# SDK appends /v1/messages
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-fable-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;             &lt;span class="c1"&gt;# the only line that changes
&lt;/span&gt;    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# input_tokens, cache_creation_input_tokens, cache_read_input_tokens, cost
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Swap &lt;code&gt;claude-opus-4-6&lt;/code&gt; → &lt;code&gt;claude-fable-5&lt;/code&gt; and nothing in your caching path needs to move. Fable 5 is an Anthropic-native model with a 1M-token context window. One behavioral note: it is a reasoning model and &lt;strong&gt;emits thinking tokens by default&lt;/strong&gt; — even a trivial "reply OK" returned &lt;code&gt;output_tokens_details.thinking_tokens &amp;gt; 0&lt;/code&gt; in our runs, where Opus 4.6/4.8 returned zero. Budget output tokens accordingly. The mechanics behind &lt;code&gt;cache_control&lt;/code&gt; are covered in &lt;a href="https://dev.to/blog/prompt-caching-tutorial-code-examples/"&gt;the caching tutorial&lt;/a&gt;; the architecture of &lt;em&gt;why&lt;/em&gt; the cache exists is in &lt;a href="https://dev.to/blog/llm-prompt-caching-explained/"&gt;Part 1 of the series&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The headline: Fable 5 is on the new tokenizer
&lt;/h2&gt;

&lt;p&gt;The token count for the Opus line jumped at the 4.7 generation: the same English text that counted as ~6.6K tokens on 4.6 counts as ~9.6K on 4.8. &lt;strong&gt;Fable 5 lands on the new side&lt;/strong&gt; — identical text reports the exact same token count as Opus 4.8.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input tokens (identical text)&lt;/th&gt;
&lt;th&gt;Tokenizer generation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;6,614&lt;/td&gt;
&lt;td&gt;pre-4.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;9,619&lt;/td&gt;
&lt;td&gt;post-4.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-fable-5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9,619&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;post-4.7 (identical to 4.8)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The same system prompt is &lt;strong&gt;~45% more tokens on Fable 5 than on Opus 4.6&lt;/strong&gt; (9,619 / 6,614 = 1.45). This is the single most important number to internalize before you migrate, because every downstream figure — cost, the 1,024-token cache-eligibility floor, your per-call budget — is computed in tokens.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We're describing a measured observation — identical text, identical token count on Fable 5 and Opus 4.8, ~45% above Opus 4.6 — most consistent with the tokenizer/vocabulary update that shipped at the 4.7 generation. If you're coming from 4.6 or earlier, re-measure; if you're coming from 4.7/4.8, expect parity.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Caching behavior: the contract is unchanged
&lt;/h2&gt;

&lt;p&gt;We ran the same no-cache / cold-write / warm-read sequence on each model. The discount structure is identical end to end — Fable 5 honors &lt;code&gt;cache_control&lt;/code&gt; and reports the same usage fields (&lt;code&gt;cache_creation_input_tokens&lt;/code&gt;, &lt;code&gt;cache_read_input_tokens&lt;/code&gt;, and the &lt;code&gt;ephemeral_5m&lt;/code&gt; / &lt;code&gt;ephemeral_1h&lt;/code&gt; buckets).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;5m cache write&lt;/th&gt;
&lt;th&gt;1h cache write&lt;/th&gt;
&lt;th&gt;Warm read&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1.25x&lt;/td&gt;
&lt;td&gt;2.00x&lt;/td&gt;
&lt;td&gt;~9% of no-cache&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1.25x&lt;/td&gt;
&lt;td&gt;2.00x&lt;/td&gt;
&lt;td&gt;~6% of no-cache&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-fable-5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1.24x&lt;/td&gt;
&lt;td&gt;1.99x&lt;/td&gt;
&lt;td&gt;~6% of no-cache&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two invariants hold across all three:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Write premium ≈ 1.25x (5m), ≈ 2x (1h).&lt;/strong&gt; The first (cold) call costs ~1.25x the no-cache price to populate a 5-minute entry, or ~2x for a 1-hour entry. Break-even is one hit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read discount ≈ 90%+.&lt;/strong&gt; A warm cache read on Fable 5 cost ~6% of the no-cache call — a ~94% discount, in line with (slightly better than) Anthropic's documented ~90% cached-read economics. Reads stay deeply discounted regardless of TTL.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The percentages are flat across the line. As with the Opus 4.7 → 4.8 step, the higher &lt;em&gt;absolute&lt;/em&gt; bill on Fable 5 is a price-and-token story, not a cache-economics story — covered next.&lt;/p&gt;




&lt;h2&gt;
  
  
  TTL behavior: both windows honored
&lt;/h2&gt;

&lt;p&gt;Fable 5 supports the same two TTLs as the rest of the line: a 5-minute sliding default and an opt-in 1-hour window. We isolated each TTL with a unique prefix per call (so no stale entry could contaminate the result) and confirmed the usage object reports the correct bucket — &lt;code&gt;cache_creation.ephemeral_5m_input_tokens&lt;/code&gt; or &lt;code&gt;ephemeral_1h_input_tokens&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1-hour TTL — same marker syntax on Fable 5 as on the Opus line
&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ttl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1h&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 1-hour write costs ~2x no-cache (vs ~1.25x for the 5-minute write), and reads stay at the deep discount regardless of TTL — identical to Opus 4.6/4.8. If you picked &lt;code&gt;5m&lt;/code&gt; for live chat and &lt;code&gt;1h&lt;/code&gt; for agents with human-in-the-loop pauses on Opus, keep those choices on Fable 5.&lt;/p&gt;




&lt;h2&gt;
  
  
  The cost story: 2x price x 1.45x tokens
&lt;/h2&gt;

&lt;p&gt;Here is where Fable 5 actually differs. Two things push the bill up, and they multiply.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. List price is 2x the Opus tier.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input ($/M)&lt;/th&gt;
&lt;th&gt;Output ($/M)&lt;/th&gt;
&lt;th&gt;Cache read ($/M)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;claude-opus-4-6&lt;/code&gt; / &lt;code&gt;4-8&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-fable-5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;2. The same text is ~45% more tokens than on 4.6&lt;/strong&gt; (the tokenizer shift above).&lt;/p&gt;

&lt;p&gt;Multiply them and the same English prompt costs materially more. Measured against the identical system prompt on each model (gateway &lt;code&gt;usage.cost&lt;/code&gt;, same single run):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Comparison&lt;/th&gt;
&lt;th&gt;Token ratio&lt;/th&gt;
&lt;th&gt;Price ratio&lt;/th&gt;
&lt;th&gt;Same-prompt cost ratio (measured)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fable 5 vs &lt;strong&gt;Opus 4.8&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;1.00x&lt;/td&gt;
&lt;td&gt;2.0x&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.0x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fable 5 vs &lt;strong&gt;Opus 4.6&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;1.45x&lt;/td&gt;
&lt;td&gt;2.0x&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.9x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;So against Opus 4.8 (same tokenizer), Fable 5 is a clean &lt;strong&gt;2x&lt;/strong&gt; — pure price premium. Against Opus 4.6, the tokenizer change compounds the price change into roughly &lt;strong&gt;2.9x&lt;/strong&gt; the cost for the same prompt. Your cache &lt;em&gt;discount&lt;/em&gt; is unchanged, but the absolute base it applies to is ~2.9x larger than it was on 4.6. If you sized a per-call budget against 4.6, re-do it.&lt;/p&gt;

&lt;p&gt;A practical consequence: &lt;strong&gt;re-check the 1,024-token cache-eligibility floor.&lt;/strong&gt; Anthropic only caches prefixes at or above a minimum size. A prompt that sat just under the floor on 4.6 (in old-tokenizer tokens) may clear it on Fable 5 (~45% more tokens) — and vice versa for size estimates built on the old count. Always read &lt;code&gt;cache_creation_input_tokens&lt;/code&gt; / &lt;code&gt;cache_read_input_tokens&lt;/code&gt; from the live response rather than estimating from a local tokenizer that may not match.&lt;/p&gt;




&lt;h2&gt;
  
  
  Migration checklist (Opus → Fable 5)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;Caching code carries over verbatim.&lt;/strong&gt; &lt;code&gt;cache_control&lt;/code&gt; markers, breakpoint count (up to 4), &lt;code&gt;ttl: "1h"&lt;/code&gt;, usage-field names — all identical.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;TTL choices carry over.&lt;/strong&gt; 5m for live/session workloads, 1h for bursty/agent-with-pauses.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Discount economics carry over.&lt;/strong&gt; ~90%+ read, ~1.25x write (5m), ~2x write (1h).&lt;/li&gt;
&lt;li&gt;⚠️ &lt;strong&gt;Re-budget on absolute cost.&lt;/strong&gt; Fable 5 is ~2x Opus per token, and ~2.9x the same-prompt cost vs Opus 4.6. The discount percentage is unchanged; the base it applies to is not.&lt;/li&gt;
&lt;li&gt;⚠️ &lt;strong&gt;Re-measure token counts&lt;/strong&gt; if coming from 4.6 or earlier (expect ~45% more for the same text). From 4.7/4.8, expect parity.&lt;/li&gt;
&lt;li&gt;⚠️ &lt;strong&gt;Account for default thinking tokens.&lt;/strong&gt; Fable 5 emits reasoning tokens by default — they bill at the output rate ($50/M). Cap or disable thinking if you don't need it.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Bottom line
&lt;/h2&gt;

&lt;p&gt;For a team already caching against Claude, &lt;code&gt;claude-fable-5&lt;/code&gt; is an easy &lt;em&gt;integration&lt;/em&gt;: the entire caching and TTL surface is stable, so there's nothing to relearn and no code to rewrite. It is not an easy &lt;em&gt;budget&lt;/em&gt; swap from Opus 4.6 — between the 2x token price and the ~45% tokenizer inflation, the same prompt runs ~2.9x the cost. Confirm your numbers against the live &lt;code&gt;usage&lt;/code&gt; object, decide whether you need the default thinking tokens, and size the cache breakpoints against the new token counts.&lt;/p&gt;

&lt;p&gt;For the full caching playbook — prompt structure, hit-rate debugging, TTL-aware patterns — see the four-part series starting with &lt;a href="https://dev.to/blog/llm-prompt-caching-explained/"&gt;How KV Cache &amp;amp; TTL Work&lt;/a&gt; and the &lt;a href="https://dev.to/blog/prompt-caching-tutorial-code-examples/"&gt;working Python tutorial&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Do I need to change my &lt;code&gt;cache_control&lt;/code&gt; code to use Fable 5?&lt;/strong&gt;&lt;br&gt;
No. The marker syntax, breakpoint limit, and TTL options are identical to the Opus line. Change the &lt;code&gt;model&lt;/code&gt; field and nothing else in the caching path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Did the cache read discount change on Fable 5?&lt;/strong&gt;&lt;br&gt;
No. A warm read is a small single-digit fraction of the no-cache input price (~90%+ off) — we measured ~94% on Fable 5, consistent with Anthropic's documented cached-read economics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does Fable 5 support the 1-hour TTL?&lt;/strong&gt;&lt;br&gt;
Yes. &lt;code&gt;{"type": "ephemeral", "ttl": "1h"}&lt;/code&gt; works exactly as on Opus. The 1-hour write costs ~2x no-cache; the 5-minute write ~1.25x. Reads stay deeply discounted on both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why is the same prompt so much more expensive on Fable 5 than on Opus 4.6?&lt;/strong&gt;&lt;br&gt;
Two stacked multipliers: Fable 5 lists at 2x the per-token price, and the same English text counts as ~45% more tokens (it uses the post-4.6 tokenizer). Together that's ~2.9x the cost for an identical prompt. The cache &lt;em&gt;discount&lt;/em&gt; is unchanged.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is Fable 5 a drop-in replacement for Opus 4.8?&lt;/strong&gt;&lt;br&gt;
On the caching/TTL surface and token counts, yes — token counts are identical, so the only delta is the 2x price and Fable 5's default thinking tokens. We don't publish capability benchmarks we haven't run; for quality and reasoning claims, see Anthropic's model card.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Verification: all token-count, cost, write-premium, and read-discount figures measured against &lt;code&gt;https://synthorai.io/&lt;/code&gt; on 2026-06-10 using the official &lt;code&gt;anthropic&lt;/code&gt; SDK, single tenant, single sequential run. Cost is read from the gateway &lt;code&gt;usage.cost&lt;/code&gt; field; cross-model and premium/discount ratios are computed from those measured costs and are independent of any account-level promotion. Discount/premium ratios cross-checked against &lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;Anthropic Prompt Caching docs&lt;/a&gt;. Warm-read latency (TTFT) was dominated by network jitter in our run and is omitted as unreliable. Your numbers will vary with prompt, region, and load.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>llm</category>
      <category>claude</category>
    </item>
    <item>
      <title>Provider Drift: How Default Routing Inflates LLM Cost 3.9 — A Measurement</title>
      <dc:creator>synthorai</dc:creator>
      <pubDate>Fri, 05 Jun 2026 09:34:34 +0000</pubDate>
      <link>https://dev.to/synthorai/provider-drift-how-default-routing-inflates-llm-cost-39x-a-measurement-2017</link>
      <guid>https://dev.to/synthorai/provider-drift-how-default-routing-inflates-llm-cost-39x-a-measurement-2017</guid>
      <description>&lt;p&gt;You turned on prompt caching, the hit counter ticks now and then, but your bill barely moved. Before blaming your prompt structure, look at something the dashboard hides: which upstream actually served each request.&lt;/p&gt;

&lt;p&gt;Multi-provider gateways spread a single model across several upstream providers and pick one per request. Prompt caches are per-provider (often per-node inside a provider). So when your second identical request lands on a different upstream than the first, it is a cache miss, even though your prompt did not change one byte. This is &lt;strong&gt;provider drift&lt;/strong&gt;, and on a pay-per-token model it quietly multiplies your cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two conditions that trigger it
&lt;/h2&gt;

&lt;p&gt;This is not a misconfiguration you opted into. It is what you get out of the box:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Default auto routing.&lt;/strong&gt; The request is sent to the model without pinning an upstream, so the gateway chooses one per call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default provider sort = "default (balanced)".&lt;/strong&gt; The gateway load-balances across eligible upstreams rather than sticking to one.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both are the factory defaults. You do not have to touch anything to get drift; you have to touch settings to avoid it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What 20 identical requests look like
&lt;/h2&gt;

&lt;p&gt;We sent the &lt;strong&gt;same&lt;/strong&gt; ~8K-token prefix 20 times in a row to one popular multi-provider gateway, on the defaults above, asking for the upstream's own reported provider and cache fields each time. For a disk-cached model in the DeepSeek family:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;9 distinct upstreams&lt;/strong&gt; served the 20 calls: &lt;code&gt;N***a&lt;/code&gt;, &lt;code&gt;S***w&lt;/code&gt;, &lt;code&gt;M***h&lt;/code&gt;, &lt;code&gt;D***a&lt;/code&gt;, &lt;code&gt;A***L&lt;/code&gt;, &lt;code&gt;P***l&lt;/code&gt;, &lt;code&gt;S***e&lt;/code&gt;, &lt;code&gt;V***e&lt;/code&gt;, &lt;code&gt;A***d&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache hit rate: 4/20 (20%).&lt;/strong&gt; You only hit on the calls that happened to land on an upstream that had already cached your prefix.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run the same 20 calls against a &lt;strong&gt;single-backend&lt;/strong&gt; gateway (one model, one upstream, no balancing) and the hit rate is &lt;strong&gt;19/20 (95%)&lt;/strong&gt; on the identical workload. Same model, same prompt, same number of calls. The only variable is whether routing drifts.&lt;/p&gt;

&lt;p&gt;For contrast, on the very same multi-provider gateway a GPT-class model was routed to &lt;strong&gt;one&lt;/strong&gt; upstream (&lt;code&gt;A***e&lt;/code&gt;) for all 20 calls and hit &lt;strong&gt;19/20&lt;/strong&gt;. Drift is not uniform; it bites whichever model the gateway happens to spread, and on this run that was the DeepSeek-family model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion A: the cost you expected vs the cost you paid
&lt;/h2&gt;

&lt;p&gt;Per-call cost on the drifting model split cleanly by cache outcome:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;call type&lt;/th&gt;
&lt;th&gt;median cost / call&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;cache hit&lt;/td&gt;
&lt;td&gt;~$0.00015&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cache miss&lt;/td&gt;
&lt;td&gt;~$0.00062&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A miss costs about &lt;strong&gt;4x a hit&lt;/strong&gt; on this model (on raw input tokens the published gap is wider still, roughly 50x). Now total it across the 20 calls:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;scenario&lt;/th&gt;
&lt;th&gt;hit rate&lt;/th&gt;
&lt;th&gt;cost for 20 identical calls&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;expected&lt;/strong&gt; (cache reachable)&lt;/td&gt;
&lt;td&gt;95%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.0026&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;actual&lt;/strong&gt; (default drift)&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.0102&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same model, same prompt, same 20 requests. Provider drift made the run cost &lt;strong&gt;~3.9x more&lt;/strong&gt;. The caching was "on" the whole time; the routing layer simply billed most of your tokens at the miss rate. Scale that to a production endpoint replaying a large stable prefix all day and the gap is the bulk of your input spend.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion B: no cache also means no latency win
&lt;/h2&gt;

&lt;p&gt;Caching is not only a cost lever. A warm prefill returns the first token sooner. When drift denies you the cache, you forfeit that speedup too. We measured time-to-first-token (TTFT) on repeated identical calls:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPT-class model (routed to one consistent upstream, cache reachable):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;call&lt;/th&gt;
&lt;th&gt;TTFT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1st (cold, miss)&lt;/td&gt;
&lt;td&gt;~1760 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;subsequent (warm, hit)&lt;/td&gt;
&lt;td&gt;~1130 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Caching buys roughly a &lt;strong&gt;36% faster first token&lt;/strong&gt;, and it is steady: every warm call lands in a tight band.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DeepSeek-family model (default drift, cache rarely reachable):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache hits across a 10-call repeat: &lt;strong&gt;0&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;TTFT swung from &lt;strong&gt;~1000 ms to ~4500 ms&lt;/strong&gt; call to call, with occasional empty responses.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because almost every request is a fresh upstream, you stay at cold-prefill latency and inherit the variance of whichever provider answered. The GPT model got a 36% TTFT improvement from a reachable cache; the drifting model got none, plus a 4.5x spread between its fastest and slowest call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Audit your own setup in five minutes
&lt;/h2&gt;

&lt;p&gt;Do not trust these numbers, or anyone's. Send the same long prefix several times and watch two fields. No domains hardcoded; point it at your own gateway with env vars.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GW_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GW_BASE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;SYS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[probe &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;hex&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a support assistant. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GW_MODEL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SYS&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                  &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="n"&gt;extra_body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;include&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}})&lt;/span&gt;
    &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_dump&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;det&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_tokens_details&lt;/span&gt;
    &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;det&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;det&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;   &lt;span class="c1"&gt;# populated when exposed
&lt;/span&gt;    &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hit rate &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/20; upstreams seen: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;More than one upstream for the same model means drift. A hit rate well below your prompt stability means it is taxing you. The fuller method is in &lt;a href="https://dev.to/blog/llm-gateway-cache-audit/"&gt;Does Your LLM Gateway Lie About Cache?&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to look for
&lt;/h2&gt;

&lt;p&gt;The cure for drift is structural: route a given model to a consistent backend so a warm cache is actually reachable on the next request, instead of load-balancing each call onto a fresh upstream that has never seen your prefix. When you evaluate a gateway, send the same prefix 20 times and count the upstreams. One is what you want. Nine is a tax.&lt;/p&gt;

&lt;p&gt;A fair caveat: prompt caching is best-effort everywhere, and on disk-cached models the hit rate still softens over long idle gaps even with a single backend. Eliminating drift does not hand you an infinite cache. It removes the largest and most wasteful source of misses, the one you never agreed to and cannot see.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;"Supports prompt caching" and "your cache is reachable" are different claims. A gateway that scatters one model across a rotating cast of upstreams can report cache support truthfully while delivering a 20% hit rate, a ~4x bill, and first-token latency that swings 4.5x. The number to watch is not whether caching is advertised. It is your measured hit rate and how many upstreams your identical requests touch. Run the probe and let the data settle it.&lt;/p&gt;

&lt;p&gt;For the broader audit method see &lt;a href="https://dev.to/blog/llm-gateway-cache-audit/"&gt;Does Your LLM Gateway Lie About Cache?&lt;/a&gt;; for why caches exist at all, see &lt;a href="https://dev.to/blog/llm-prompt-caching-explained/"&gt;How KV Cache &amp;amp; TTL Work&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is this a misconfiguration on my side?&lt;/strong&gt;&lt;br&gt;
No. It happens on the factory defaults: auto routing with the provider sort left at "default (balanced)." Avoiding drift requires actively pinning an upstream, not the other way around.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does pinning one upstream fix it?&lt;/strong&gt;&lt;br&gt;
It removes cross-provider drift, but a single upstream often runs multiple replicas without prefix affinity, so hits can still flip-flop. Measure after pinning rather than assuming.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why did the GPT-class model not drift?&lt;/strong&gt;&lt;br&gt;
On this run the gateway happened to route it to a single upstream. Drift is per-model and depends on how many eligible upstreams the gateway balances across; it is not uniform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is the cost gap really ~4x?&lt;/strong&gt;&lt;br&gt;
On the per-call totals we measured, a miss was ~4x a hit; on raw input-token pricing for this model class the published hit-vs-miss gap is closer to 50x. Either way, turning expected hits into misses is the expensive part.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What single metric should I monitor?&lt;/strong&gt;&lt;br&gt;
Cache hit rate per model over time, alongside the count of distinct upstreams per model. If hit rate falls or upstream count rises, your effective token cost just went up.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>llm</category>
      <category>openrouter</category>
    </item>
    <item>
      <title>Does Your LLM Gateway Lie About Cache? A 5-Min Audit</title>
      <dc:creator>synthorai</dc:creator>
      <pubDate>Tue, 02 Jun 2026 15:30:00 +0000</pubDate>
      <link>https://dev.to/synthorai/does-your-llm-gateway-lie-about-cache-a-5-min-audit-k3l</link>
      <guid>https://dev.to/synthorai/does-your-llm-gateway-lie-about-cache-a-5-min-audit-k3l</guid>
      <description>&lt;p&gt;A gateway sits between your code and the model provider. You read &lt;code&gt;cached_tokens&lt;/code&gt; back from the response, you see a smaller number, and you trust the dollars saved are real. But you never see the upstream call. The gateway could report a cache hit and still bill the full input rate. It could fail to cache at all behind a perfectly clean response. It could strip usage metadata on streaming, the path most of your production traffic runs on, so you can't tell either way.&lt;/p&gt;

&lt;p&gt;This isn't hypothetical. A &lt;a href="https://news.ycombinator.com/item?id=48319827" rel="noopener noreferrer"&gt;Hacker News PSA&lt;/a&gt; reported that routing DeepSeek V4 through a popular gateway returned &lt;strong&gt;2–3× fewer cached tokens&lt;/strong&gt; than calling DeepSeek directly; one commenter posted bills showing the caching stats weren't reported through the gateway at all. The gateway's team replied that they couldn't reproduce it and were investigating. That disagreement is the whole point. When two parties can't agree on whether your cache is working, the only tiebreaker is a measurement you ran yourself.&lt;/p&gt;

&lt;p&gt;Usually this isn't malice. It's a translation gap or an unfinished code path. The effect on your invoice is the same either way. This post is one runnable script that audits both styles of prompt caching, automatic (DeepSeek) and marker-based (Claude), against any gateway, including this one. It prints a side-by-side scorecard in under five minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Four ways a gateway can lie about cache
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure mode&lt;/th&gt;
&lt;th&gt;What you see&lt;/th&gt;
&lt;th&gt;What's actually happening&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Silent no-cache&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A clean response, no error&lt;/td&gt;
&lt;td&gt;Nothing was cached; you pay full price every call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cache theater&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cached_tokens&lt;/code&gt; &amp;gt; 0 in the response&lt;/td&gt;
&lt;td&gt;…but the billed cost is the full input rate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Markup creep&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A plausible cost number&lt;/td&gt;
&lt;td&gt;The gateway's markup quietly eats the discount&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata blackout&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Clean text output&lt;/td&gt;
&lt;td&gt;Usage fields stripped (esp. on streaming), so you can't audit it&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The dangerous ones are the first two: the response &lt;em&gt;looks&lt;/em&gt; like caching is working. You find out at the end of the month.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two cache mechanisms, one audit
&lt;/h2&gt;

&lt;p&gt;Providers expose caching in two shapes, and a real gateway has to pass both through faithfully:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatic&lt;/strong&gt; (DeepSeek, GPT, Gemini, Qwen): the provider caches any sufficiently long prefix on its own. No markers. Hits appear in &lt;code&gt;usage.prompt_tokens_details.cached_tokens&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Marker-based&lt;/strong&gt; (Anthropic Claude): you tag cacheable spans with &lt;code&gt;cache_control&lt;/code&gt;. Hits appear as &lt;code&gt;cache_read_input_tokens&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The script hides that difference behind a thin &lt;code&gt;Lane&lt;/code&gt; adapter, then runs all five checks against both. Here is the whole thing: two lanes and one &lt;code&gt;audit()&lt;/code&gt; that performs every check.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Anthropic&lt;/span&gt;

&lt;span class="n"&gt;KEY&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GATEWAY_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;oai&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://synthorai.io/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# auto lane
&lt;/span&gt;&lt;span class="n"&gt;anth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://synthorai.io/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# marker lane
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AutoLane&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="c1"&gt;# DeepSeek / GPT / Gemini / Qwen: provider caches automatically
&lt;/span&gt;    &lt;span class="n"&gt;mode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
            &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;oai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;stream_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;include_usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;},{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;}])&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_tokens_details&lt;/span&gt;
                    &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cached_tokens&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;oai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;},{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;}]).&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;
        &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_tokens_details&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cached_tokens&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_tokens_details&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_tokens&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MarkerLane&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="c1"&gt;# Anthropic Claude: explicit cache_control markers
&lt;/span&gt;    &lt;span class="n"&gt;mode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;marker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;anth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;}])&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;pass&lt;/span&gt;
                &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_final_message&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_dump&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_read_input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;}]).&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_dump&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_read_input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_creation_input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;read&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;created&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;audit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lane&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;long_prompt&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;SYS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[audit &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;hex&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;long_prompt&lt;/span&gt;    &lt;span class="c1"&gt;# unique =&amp;gt; guaranteed cold start
&lt;/span&gt;    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lane&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;lane&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;lane&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# CHECK 1: cache engages. Cold misses; a repeat should hit. A cache can
&lt;/span&gt;    &lt;span class="c1"&gt;# take a moment to become readable, so poll the warm read (sleep 1s between
&lt;/span&gt;    &lt;span class="c1"&gt;# attempts) before concluding "no cache".
&lt;/span&gt;    &lt;span class="n"&gt;cold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lane&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SYS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Q1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;warm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cold&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;warm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lane&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SYS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warm &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;warm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;warm&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cold&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;warm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="c1"&gt;# CHECK 2: cost reflects the discount (catches "cache theater").
&lt;/span&gt;    &lt;span class="n"&gt;disc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;warm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;cold&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cold&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;warm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;discount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;disc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;disc&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;disc&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# CHECK 3: token accounting. cached fits inside the prompt total.
&lt;/span&gt;    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;warm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;warm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;warm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# CHECK 4: streaming preserves usage metadata (cache count AND cost).
&lt;/span&gt;    &lt;span class="n"&gt;st&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lane&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SYS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream_cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream_cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# CHECK 5: negative control. a unique prefix must always miss.
&lt;/span&gt;    &lt;span class="n"&gt;n1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lane&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[uniq &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;hex&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;long_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;n2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lane&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[uniq &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;hex&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;long_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;n2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;

&lt;span class="c1"&gt;# Any long, STABLE text works as the cacheable prefix: a system prompt, tool
# schemas, or a retrieved document. It only needs to clear the provider's
# minimum cacheable size (see Check 1). Load yours however you like.
&lt;/span&gt;&lt;span class="n"&gt;LONG_SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system_prompt.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;   &lt;span class="c1"&gt;# ~8K+ tokens
&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;lane&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;AutoLane&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek-v4-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nc"&gt;MarkerLane&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;audit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lane&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LONG_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The rest of the post walks each check: the lines that implement it, what both lanes returned, and how to read the result.&lt;/p&gt;




&lt;h2&gt;
  
  
  Check 1: does the cache engage?
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lane&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SYS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Q1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;warm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cold&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;                       &lt;span class="c1"&gt;# poll: a cache may take a beat to be readable
&lt;/span&gt;    &lt;span class="n"&gt;warm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lane&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SYS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warm &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;warm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cold&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;warm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;cold cached&lt;/th&gt;
&lt;th&gt;warm cached&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deepseek-v4-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;7,552 / 7,870 (96%)&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;12,446 / 12,454 (99.9%)&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A cold call on a unique prefix must cache nothing; a repeat must hit. The single most common false alarm is &lt;strong&gt;declaring "no cache" after one warm call&lt;/strong&gt;, because caches don't always become readable instantly. The loop polls a few times with a 1-second pause, which removes the flakiness. If you still get &lt;code&gt;0&lt;/code&gt; after several warm calls on a prompt above the size floor (~1,024 tokens for most providers; DeepSeek matches at a finer 64), the cache genuinely isn't engaging.&lt;/p&gt;




&lt;h2&gt;
  
  
  Check 2: does the cost reflect the discount?
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;disc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;warm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;cold&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cold&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;warm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;disc&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;disc&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;cold cost&lt;/th&gt;
&lt;th&gt;warm cost&lt;/th&gt;
&lt;th&gt;discount&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deepseek-v4-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.00107&lt;/td&gt;
&lt;td&gt;$0.00030&lt;/td&gt;
&lt;td&gt;72.3%&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.07112&lt;/td&gt;
&lt;td&gt;$0.00672&lt;/td&gt;
&lt;td&gt;90.6%&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the check that catches &lt;strong&gt;cache theater&lt;/strong&gt;. The warm call's cost must actually drop. DeepSeek's per-call total fell ~72% (the cached input is discounted more steeply; output and the uncached remainder dilute the headline). Claude's cached &lt;em&gt;read&lt;/em&gt; is ~90% off. The failure signal is unmistakable: &lt;code&gt;cached_tokens &amp;gt; 0&lt;/code&gt; with &lt;strong&gt;identical&lt;/strong&gt; cold and warm cost means the gateway is reporting a hit it isn't pricing. You're paying full freight for a cache that "works" on paper.&lt;/p&gt;




&lt;h2&gt;
  
  
  Check 3: do the token counts add up?
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;warm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;warm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;warm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;cached&lt;/th&gt;
&lt;th&gt;prompt total&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deepseek-v4-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;7,552&lt;/td&gt;
&lt;td&gt;7,870&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;12,446&lt;/td&gt;
&lt;td&gt;12,454&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;cached&lt;/code&gt; has to sit inside the prompt total, with the remainder billed as uncached input. Both reconcile. If &lt;code&gt;cached_tokens&lt;/code&gt; exceeds &lt;code&gt;prompt_tokens&lt;/code&gt;, or the uncached remainder is implausibly large for a stable prefix, the gateway is mis-accounting: re-tokenizing or double-counting somewhere in the translation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Check 4: does streaming preserve the metadata?
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;st&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lane&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SYS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream_cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream_cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;stream cached&lt;/th&gt;
&lt;th&gt;stream cost&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deepseek-v4-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;preserved&lt;/td&gt;
&lt;td&gt;preserved&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;preserved&lt;/td&gt;
&lt;td&gt;preserved&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most production chat streams, so this is the path that matters most. On both lanes the cache hit signal and the cost both survive the stream. &lt;code&gt;cached_tokens&lt;/code&gt; and &lt;code&gt;cost&lt;/code&gt; come through in the final usage chunk, so your highest-volume path stays auditable. The failure mode to watch for is a gateway that drops usage on streaming: a clean token output with no &lt;code&gt;cached_tokens&lt;/code&gt; or &lt;code&gt;cost&lt;/code&gt; means you're flying blind on the path you run most. (Pass &lt;code&gt;stream_options={"include_usage": True}&lt;/code&gt; so the usage chunk is emitted at all.)&lt;/p&gt;




&lt;h2&gt;
  
  
  Check 5: the negative control
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;n1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lane&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[uniq &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;hex&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;long_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;n2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lane&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[uniq &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;hex&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;long_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;n2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;unique-prefix A&lt;/th&gt;
&lt;th&gt;unique-prefix B&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deepseek-v4-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;cached 0&lt;/td&gt;
&lt;td&gt;cached 0&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;cached 0&lt;/td&gt;
&lt;td&gt;cached 0&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Send a unique prefix every call; it must never hit. Both lanes correctly reported &lt;code&gt;cached=0&lt;/code&gt; at full cost for distinct prefixes. A "hit" here would make the cache reporting a false positive you could never trust. The clean negative control is what makes the &lt;em&gt;positive&lt;/em&gt; results in Checks 1–2 meaningful in the first place.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reading your scorecard
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;Healthy result&lt;/th&gt;
&lt;th&gt;Red flag&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1. cache engages&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;0&lt;/code&gt; cold, &lt;code&gt;&amp;gt;0&lt;/code&gt; warm (after polling)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;0&lt;/code&gt; after several warm calls, above the size floor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2. cost reflects discount&lt;/td&gt;
&lt;td&gt;warm cost ≪ cold cost&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cached &amp;gt; 0&lt;/code&gt; but costs equal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3. token accounting&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cached ≤ prompt_total&lt;/code&gt;, reconciles&lt;/td&gt;
&lt;td&gt;counts don't add up&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4. streaming metadata&lt;/td&gt;
&lt;td&gt;cache + cost survive the stream&lt;/td&gt;
&lt;td&gt;usage missing on streamed calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5. negative control&lt;/td&gt;
&lt;td&gt;unique prefix always misses&lt;/td&gt;
&lt;td&gt;a distinct prefix "hits"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The two that cost money silently are &lt;strong&gt;2&lt;/strong&gt; (full price for a reported hit) and &lt;strong&gt;1&lt;/strong&gt; (no caching behind a clean response). Run both on every model you bill against.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Caching is the highest-leverage cost lever in an LLM app, which is exactly why "the cache is working" deserves a test, not an assumption. Wire Check 1 + Check 2 into CI against each model you bill against, alert if the discount drifts below your expected band, and you'll catch a silent regression the day a gateway or upstream provider changes behavior, instead of at the end of the billing cycle. And whatever your audit does, &lt;strong&gt;poll the warm read&lt;/strong&gt; before you call a cache broken.&lt;/p&gt;

&lt;p&gt;For the mechanics behind these numbers (prefill, KV cache, TTLs) start with &lt;a href="https://dev.to/blog/llm-prompt-caching-explained/"&gt;How KV Cache &amp;amp; TTL Work&lt;/a&gt;. For working caching patterns per provider, see the &lt;a href="https://dev.to/blog/prompt-caching-tutorial-code-examples/"&gt;tutorial&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;My Check 1 shows &lt;code&gt;0&lt;/code&gt; on the warm call. Is my gateway lying?&lt;/strong&gt;&lt;br&gt;
Check three things first. (1) Does your prompt clear the provider's minimum cacheable size (~1,024 tokens for most; DeepSeek matches at finer 64-token granularity)? (2) Did you &lt;strong&gt;poll&lt;/strong&gt; the warm read a few times? Caches don't always become readable on the very next call. (3) Is the prefix byte-identical between calls, with no timestamps or per-request IDs at the front? Only after all three should you suspect the gateway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What does "cache theater" cost me in practice?&lt;/strong&gt;&lt;br&gt;
You pay the full input rate on every call while believing you pay a fraction. On a high-volume endpoint with a large stable prefix, that's your bill being several times what you modeled. Check 2 is the one to alert on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why is DeepSeek's discount lower than Claude's here?&lt;/strong&gt;&lt;br&gt;
Different things are being measured. Claude's ~90% is the &lt;em&gt;read&lt;/em&gt; discount on cached input. DeepSeek's ~72% is the &lt;em&gt;per-call total&lt;/em&gt; reduction, where output and the uncached remainder are billed at full rate and dilute the headline. Compare like with like for your own prompt shape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does this work for GPT, Gemini, Qwen too?&lt;/strong&gt;&lt;br&gt;
Yes. They're all automatic, so they use the &lt;code&gt;AutoLane&lt;/code&gt; unchanged with a different &lt;code&gt;model&lt;/code&gt;. Only Claude needs the &lt;code&gt;MarkerLane&lt;/code&gt;. Same five checks either way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Should this live in CI?&lt;/strong&gt;&lt;br&gt;
Yes. Run Check 1 + Check 2 against every model you bill against, on a schedule, and alert when the observed discount drifts outside your expected band. A standing audit turns a silent regression into a notification.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Claude Opus 4.8 on Synthorai: Caching &amp; TTL vs 4.7/4.6</title>
      <dc:creator>synthorai</dc:creator>
      <pubDate>Fri, 29 May 2026 14:47:14 +0000</pubDate>
      <link>https://dev.to/synthorai/claude-opus-48-on-synthorai-caching-ttl-vs-4746-3mhm</link>
      <guid>https://dev.to/synthorai/claude-opus-48-on-synthorai-caching-ttl-vs-4746-3mhm</guid>
      <description>&lt;p&gt;&lt;code&gt;claude-opus-4-8&lt;/code&gt; is now available on the Synthorai gateway. If you already run prompt caching against the Opus line, the headline is reassuring and slightly boring: &lt;strong&gt;nothing about the caching or TTL contract changed from 4.7 or 4.6.&lt;/strong&gt; Same &lt;code&gt;cache_control&lt;/code&gt; markers, same 5-minute and 1-hour TTLs, same read discount, same write premiums. Your caching code is a drop-in carry-over.&lt;/p&gt;

&lt;p&gt;There is exactly one thing that &lt;em&gt;did&lt;/em&gt; change — and it changed back at 4.7, not at 4.8 — that affects your token budget. This post measures it so you don't have to.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;All numbers below were measured against &lt;code&gt;https://synthorai.io/&lt;/code&gt; (Anthropic-native &lt;code&gt;/v1/messages&lt;/code&gt;) on 2026-05-29 with a ~8K-character English system prompt, &lt;code&gt;max_tokens&lt;/code&gt; small, single sequential run. Reproduce against your own prompt before quoting them.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Availability
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Anthropic&lt;/span&gt;

&lt;span class="n"&gt;anth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SYNTHORAI_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://synthorai.io/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# SDK appends /v1/messages
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;# the only line that changes
&lt;/span&gt;    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# cache_creation_input_tokens, cache_read_input_tokens, cost
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Swap &lt;code&gt;claude-opus-4-7&lt;/code&gt; → &lt;code&gt;claude-opus-4-8&lt;/code&gt; and nothing else in your caching path needs to move. The mechanics behind &lt;code&gt;cache_control&lt;/code&gt; are covered in &lt;a href="https://dev.to/blog/prompt-caching-tutorial-code-examples/"&gt;the caching tutorial&lt;/a&gt;; the architecture of &lt;em&gt;why&lt;/em&gt; the cache exists is in &lt;a href="https://dev.to/blog/llm-prompt-caching-explained/"&gt;Part 1 of the series&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Caching behavior: unchanged from 4.7/4.6
&lt;/h2&gt;

&lt;p&gt;We ran the same cache write / cache read / no-cache sequence across the recent Opus line. The discount structure is identical end to end.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;No-cache cost&lt;/th&gt;
&lt;th&gt;5m cache write&lt;/th&gt;
&lt;th&gt;Cache read&lt;/th&gt;
&lt;th&gt;Read discount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.0364&lt;/td&gt;
&lt;td&gt;$0.0452&lt;/td&gt;
&lt;td&gt;$0.0041&lt;/td&gt;
&lt;td&gt;88.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.0364&lt;/td&gt;
&lt;td&gt;$0.0452&lt;/td&gt;
&lt;td&gt;$0.0041&lt;/td&gt;
&lt;td&gt;88.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-7&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.0522&lt;/td&gt;
&lt;td&gt;$0.0654&lt;/td&gt;
&lt;td&gt;$0.0059&lt;/td&gt;
&lt;td&gt;88.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.0520&lt;/td&gt;
&lt;td&gt;$0.0654&lt;/td&gt;
&lt;td&gt;$0.0059&lt;/td&gt;
&lt;td&gt;88.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two invariants hold across all four versions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Read discount ≈ 89%.&lt;/strong&gt; A warm cache read costs ~11% of the no-cache input price. This is Anthropic's documented 10% cached-read rate, unchanged.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write premium ≈ 25%.&lt;/strong&gt; The first (cold) call costs ~1.25× the no-cache price to populate the cache. Break-even is one hit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The absolute dollar figures for 4.7 and 4.8 are higher than 4.5/4.6, but as we'll see in a moment that's a token-count story, not a cache-economics story — the &lt;em&gt;percentages&lt;/em&gt; are flat.&lt;/p&gt;




&lt;h2&gt;
  
  
  TTL behavior: unchanged from 4.7/4.6
&lt;/h2&gt;

&lt;p&gt;Opus 4.8 honors the same two TTLs as the rest of the line: a 5-minute sliding default and an opt-in 1-hour window. We isolated the TTL path with a unique prefix per call (so no stale cache entry could contaminate the result) and measured the write premium for each TTL:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;TTL&lt;/th&gt;
&lt;th&gt;Cache write&lt;/th&gt;
&lt;th&gt;Write premium vs no-cache&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-7&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5m&lt;/td&gt;
&lt;td&gt;$0.0650&lt;/td&gt;
&lt;td&gt;~1.25×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-7&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1h&lt;/td&gt;
&lt;td&gt;$0.1036&lt;/td&gt;
&lt;td&gt;~2×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5m&lt;/td&gt;
&lt;td&gt;$0.0650&lt;/td&gt;
&lt;td&gt;~1.25×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1h&lt;/td&gt;
&lt;td&gt;$0.1036&lt;/td&gt;
&lt;td&gt;~2×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1-hour TTL — same marker syntax on 4.8 as on 4.7/4.6
&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ttl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1h&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The usage object reports the TTL bucket exactly as before — &lt;code&gt;cache_creation.ephemeral_5m_input_tokens&lt;/code&gt; or &lt;code&gt;ephemeral_1h_input_tokens&lt;/code&gt;. The 1-hour write costs ~2× no-cache (vs ~1.25× for the 5-minute write), and reads stay at ~11% regardless of TTL. Identical to 4.7. If you picked &lt;code&gt;5m&lt;/code&gt; for live chat and &lt;code&gt;1h&lt;/code&gt; for agents with human-in-the-loop pauses on 4.7, keep those choices on 4.8.&lt;/p&gt;




&lt;h2&gt;
  
  
  Time-to-first-token: flat across the line
&lt;/h2&gt;

&lt;p&gt;We measured warm-read TTFT with a streaming call (5 samples per model after a gateway warm-up, median reported). On this ~8–11K-token prompt, TTFT sits in a ~2.2–2.8 s band with no material per-version trend — the sample ranges overlap, so the differences are jitter, not a version effect.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Warm-read TTFT (median)&lt;/th&gt;
&lt;th&gt;Range (n=5)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2.72 s&lt;/td&gt;
&lt;td&gt;2.58 – 2.78 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2.76 s&lt;/td&gt;
&lt;td&gt;2.65 – 3.01 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-7&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2.21 s&lt;/td&gt;
&lt;td&gt;1.98 – 2.97 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2.47 s&lt;/td&gt;
&lt;td&gt;2.23 – 4.38 s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two caveats worth stating plainly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Don't read a ranking into this.&lt;/strong&gt; The ranges overlap heavily (4.8's high sample was an outlier at 4.38 s); on this prompt size TTFT is dominated by network and queueing jitter, not the model version. Treat ~2.2–2.8 s as the warm band for all four.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The cache TTFT win scales with prompt length.&lt;/strong&gt; At ~8–11K tokens the prefill saved by a cache hit is small, so cold and warm TTFT are close (both ~2–3 s on a warmed gateway). The gap widens substantially at 100K+ tokens, where prefill dominates — that's where a warm cache turns a multi-second wait into a fast first token. The mechanics are in &lt;a href="https://dev.to/blog/llm-prompt-caching-explained/"&gt;Part 1: How KV Cache &amp;amp; TTL Work&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The one real change: tokenization (since 4.7)
&lt;/h2&gt;

&lt;p&gt;Here is the thing to re-check before you migrate. The &lt;strong&gt;same system text reports ~43% more input tokens on 4.7/4.8 than on 4.5/4.6.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input tokens (identical text)&lt;/th&gt;
&lt;th&gt;No-cache cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~7,976&lt;/td&gt;
&lt;td&gt;$0.0364&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~7,977&lt;/td&gt;
&lt;td&gt;$0.0364&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-7&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~11,393&lt;/td&gt;
&lt;td&gt;$0.0522&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;claude-opus-4-8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~11,394&lt;/td&gt;
&lt;td&gt;$0.0520&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The token count jumps at the 4.7 generation and carries into 4.8. The cost tracks the token count almost exactly: the cost ratio (4.8 / 4.5) is 1.43, and the token ratio is 1.429. In other words, &lt;strong&gt;the per-token price is the same across the whole line&lt;/strong&gt; — the higher bill on 4.7/4.8 comes entirely from the same text counting as more tokens.&lt;/p&gt;

&lt;p&gt;Two practical consequences:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Re-budget on absolute cost, not on discount.&lt;/strong&gt; Your cache &lt;em&gt;discount&lt;/em&gt; is unchanged (~89% read), but the same English prompt is ~43% more expensive in absolute terms on 4.7/4.8 than it was on 4.6. If you sized a per-call budget against 4.6 token counts, it will be off.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-check the 1,024-token cache-eligibility floor.&lt;/strong&gt; Anthropic only caches prefixes at or above a minimum size. A prompt that sat just under the floor on 4.6 may clear it on 4.7/4.8 (more tokens), and a prompt sized in tokens for the old tokenizer needs re-measuring. Always read &lt;code&gt;cache_creation_input_tokens&lt;/code&gt; / &lt;code&gt;cache_read_input_tokens&lt;/code&gt; from the live response rather than estimating from a local tokenizer that may not match.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;We're describing a measured observation — identical text, ~43% more reported input tokens on 4.7/4.8 — most consistent with a tokenizer/vocabulary update at the 4.7 generation. The takeaway doesn't depend on the root cause: re-measure token counts when you migrate, because the cache math is token-based.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Migration checklist (4.6/4.7 → 4.8)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;Caching code carries over verbatim.&lt;/strong&gt; &lt;code&gt;cache_control&lt;/code&gt; markers, breakpoint count (up to 4), &lt;code&gt;ttl: "1h"&lt;/code&gt;, usage-field names — all identical.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;TTL choices carry over.&lt;/strong&gt; 5m for live/session workloads, 1h for bursty/agent-with-pauses.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Discount economics carry over.&lt;/strong&gt; ~89% read, ~1.25× write (5m), ~2× write (1h).&lt;/li&gt;
&lt;li&gt;⚠️ &lt;strong&gt;Re-measure token counts.&lt;/strong&gt; If you're coming from 4.5/4.6, expect ~40%+ more input tokens for the same text (this happened at 4.7). Coming from 4.7, expect parity.&lt;/li&gt;
&lt;li&gt;⚠️ &lt;strong&gt;Re-validate cost dashboards.&lt;/strong&gt; Trust &lt;code&gt;usage.cost&lt;/code&gt; and the &lt;code&gt;*_input_tokens&lt;/code&gt; fields from the live response, not a cached estimate from the old generation.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Bottom line
&lt;/h2&gt;

&lt;p&gt;For an engineering team already caching against Opus, &lt;code&gt;claude-opus-4-8&lt;/code&gt; is the easy kind of upgrade: the entire caching and TTL surface is stable, so there's nothing to relearn and no code to rewrite. Budget for the tokenizer shift if you're jumping from 4.6 or earlier, confirm your numbers against the live &lt;code&gt;usage&lt;/code&gt; object, and ship.&lt;/p&gt;

&lt;p&gt;For the full caching playbook — prompt structure, hit-rate debugging, TTL-aware patterns — see the four-part series starting with &lt;a href="https://dev.to/blog/llm-prompt-caching-explained/"&gt;How KV Cache &amp;amp; TTL Work&lt;/a&gt; and the &lt;a href="https://dev.to/blog/prompt-caching-tutorial-code-examples/"&gt;working Python tutorial&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Do I need to change my &lt;code&gt;cache_control&lt;/code&gt; code to use Opus 4.8?&lt;/strong&gt;&lt;br&gt;
No. The marker syntax, breakpoint limit, and TTL options are identical to 4.7/4.6. Change the &lt;code&gt;model&lt;/code&gt; field and nothing else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Did the cache read discount change on 4.8?&lt;/strong&gt;&lt;br&gt;
No. A warm read is ~11% of the no-cache input price (~89% off) on 4.5 through 4.8, matching Anthropic's documented rate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Did the 1-hour TTL premium change?&lt;/strong&gt;&lt;br&gt;
No. The 1-hour write costs ~2× the no-cache input price; the 5-minute write costs ~1.25×. Reads are ~11% regardless of TTL. Same as 4.7.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why is the same prompt more expensive on 4.8 than on 4.6?&lt;/strong&gt;&lt;br&gt;
The per-token price is the same — the prompt simply counts as more tokens. Identical text reported ~8.0K tokens on 4.5/4.6 and ~11.4K on 4.7/4.8 in our measurements (a ~43% increase), most consistent with a tokenizer change at the 4.7 generation. The cache &lt;em&gt;discount&lt;/em&gt; is unchanged.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is 4.8 a drop-in replacement for 4.7?&lt;/strong&gt;&lt;br&gt;
On the caching/TTL surface, yes — token counts and economics were already at the 4.7 level, so migration from 4.7 is parity. We don't publish capability benchmarks we haven't run; for quality and reasoning claims, see Anthropic's model card.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Verification: all caching, TTL, token-count, cost, and TTFT figures measured against &lt;code&gt;https://synthorai.io/&lt;/code&gt; on 2026-05-29 using the official &lt;code&gt;anthropic&lt;/code&gt; SDK, single tenant. Cost/token figures are a single sequential run; TTFT is a 5-sample median per model after gateway warm-up. Discount/premium ratios cross-checked against &lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;Anthropic Prompt Caching docs&lt;/a&gt;. Your numbers will vary with prompt, region, and load.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>ai</category>
      <category>llm</category>
      <category>webdev</category>
    </item>
    <item>
      <title>LLM Prompt Caching: The Complete 2026 Guide</title>
      <dc:creator>synthorai</dc:creator>
      <pubDate>Wed, 27 May 2026 15:30:00 +0000</pubDate>
      <link>https://dev.to/synthorai/llm-prompt-caching-the-complete-2026-guide-3mmb</link>
      <guid>https://dev.to/synthorai/llm-prompt-caching-the-complete-2026-guide-3mmb</guid>
      <description>&lt;p&gt;If you ship a chatbot, a RAG app, or an AI agent against a large language model, prompt caching is the single optimization that gives you back &lt;strong&gt;50–90% of input cost and 3–10× of time-to-first-token&lt;/strong&gt; at no quality cost. It isn't a bolt-on trick — it falls directly out of how Transformer attention is defined. Once you understand that, the rest of the stack (TTLs, provider differences, prompt structure) lines up cleanly.&lt;/p&gt;

&lt;p&gt;This page is the index to a four-part series that takes you from the theory to a production decision matrix. Pick where to enter based on what you already know.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where to enter
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If you want to...&lt;/th&gt;
&lt;th&gt;Start at&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Understand &lt;em&gt;why&lt;/em&gt; caching exists and what KV cache actually is&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/blog/llm-prompt-caching-explained/"&gt;Part 1 — How KV Cache &amp;amp; TTL Work&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pick a provider and know what's different about each&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/blog/provider-caching-comparison/"&gt;Part 2 — Compare Claude, GPT, Gemini, DeepSeek&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Copy-paste working Python and measure your own numbers&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/blog/prompt-caching-tutorial-code-examples/"&gt;Part 3 — Working Python Tutorial&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Match a chatbot / RAG / agent workload to the right model&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/blog/best-llm-by-use-case-chat-api-agent/"&gt;Part 4 — Best Model for Chat, RAG &amp;amp; Agents&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each part stands alone but they're written so reading them in order builds the picture without redundancy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1 — How LLM Prompt Caching Works
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/blog/llm-prompt-caching-explained/"&gt;&lt;strong&gt;LLM Prompt Caching #1: How KV Cache &amp;amp; TTL Work →&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The architectural article. Walks through self-attention as a single equation, explains &lt;em&gt;why&lt;/em&gt; the K and V vectors of a stable prefix are mathematically reusable, and shows how the memory-vs-compute tradeoff produces the TTL behavior every developer has to design around.&lt;/p&gt;

&lt;p&gt;Key takeaways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt caching isn't an optimization layered on top — it's a direct consequence of causal-masked attention. K/V at position &lt;code&gt;i&lt;/code&gt; is a deterministic function of tokens &lt;code&gt;1…i&lt;/code&gt;, so identical prefixes give bit-identical K/V.&lt;/li&gt;
&lt;li&gt;Prefill (compute-bound, O(N²)) is what caching saves; decode (memory-bandwidth-bound, O(N) per token) is what every inference engine already optimizes.&lt;/li&gt;
&lt;li&gt;TTLs exist because KV cache is enormous (~10 GB for a 32K context on a 70B model). 5 minutes is the GPU memory-pressure horizon; hours-to-days are only possible with disk-backed caches (DeepSeek's MLA architecture).&lt;/li&gt;
&lt;li&gt;Caching wins both &lt;strong&gt;cost&lt;/strong&gt; (50–90% off input on cache hits) and &lt;strong&gt;latency&lt;/strong&gt; (TTFT drops 3–10× for prompts in the 5–10K-token range and much more for 100K+).&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Part 2 — Compare LLM Prompt Caching Across Providers
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/blog/provider-caching-comparison/"&gt;&lt;strong&gt;LLM Prompt Caching #2: Compare Claude, GPT, Gemini, DeepSeek →&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The buyer's guide. Five providers expose prompt caching in five very different shapes — explicit markers (Claude), fully automatic (GPT-5, DeepSeek-v4), hybrid implicit+explicit (Gemini, Qwen), or architectural disk-backing (DeepSeek's MLA). The article gives a feature-by-feature comparison plus a &lt;strong&gt;5-dimension evaluation framework&lt;/strong&gt; to score them for your specific workload.&lt;/p&gt;

&lt;p&gt;Key takeaways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don't compare base prices — compare effective cost weighted by your hit rate (formula in §4.1).&lt;/li&gt;
&lt;li&gt;Claude has the deepest single-call discount (~90%) but requires explicit &lt;code&gt;cache_control&lt;/code&gt; markers.&lt;/li&gt;
&lt;li&gt;DeepSeek-v4 is the only provider with disk-backed caches at scale; partial-prefix matches earn discounts because the granularity is 64 tokens instead of 1,024.&lt;/li&gt;
&lt;li&gt;Gemini's explicit cache costs hourly storage fees — break-even depends on call frequency.&lt;/li&gt;
&lt;li&gt;API ergonomics, hit-rate predictability, TTL fit, latency under miss, and migration cost are the five dimensions that actually distinguish providers once you control for hit rate.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Part 3 — Working Python Tutorial
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/blog/prompt-caching-tutorial-code-examples/"&gt;&lt;strong&gt;LLM Prompt Caching #3: Working Python Tutorial →&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The hands-on article. One OpenAI SDK + one Anthropic SDK against a single gateway, with measured numbers from 2026-05-25 across the full Claude family (haiku-4-5 through opus-4-7), GPT-5.x, Gemini 2.5, DeepSeek-v4, and Qwen3.&lt;/p&gt;

&lt;p&gt;Key takeaways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude with &lt;code&gt;cache_control&lt;/code&gt; markers&lt;/strong&gt;: measured &lt;strong&gt;88–89% cost reduction&lt;/strong&gt; uniformly across haiku/sonnet/opus 4-x. Use the Anthropic SDK with &lt;code&gt;base_url="https://synthorai.io/"&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-5.4-mini auto-cache&lt;/strong&gt;: 5× TTFT improvement (3.6 s → 0.73 s on a 7K-token prompt), 93% cache hit rate on the system tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini 2.5-flash implicit&lt;/strong&gt;: 88% cost reduction on cache hits when streaming usage is captured.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek-v4-flash&lt;/strong&gt;: 74% off, disk-backed (cache survives hour-scale idle).&lt;/li&gt;
&lt;li&gt;TTL-aware patterns: keep-alive heartbeat for cron, prefix stability rules, what to log per call.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Part 4 — Best Model by Use Case
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/blog/best-llm-by-use-case-chat-api-agent/"&gt;&lt;strong&gt;LLM Prompt Caching #4: Best Model for Chat, RAG &amp;amp; Agents →&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The decision article. Different workloads pull the cost/latency levers differently — chat is naturally cache-friendly, RAG fights the prefix-stability problem, agents depend on cumulative prefix discipline. The article gives a model recommendation by workload shape with cost estimates.&lt;/p&gt;

&lt;p&gt;Key takeaways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chatbots&lt;/strong&gt;: any model with auto-cache works; sessions hit naturally. Pick on cost/quality. &lt;code&gt;gpt-5.4-nano&lt;/code&gt; cheapest, &lt;code&gt;gpt-5.4-mini&lt;/code&gt; fastest cached TTFT, &lt;code&gt;claude-haiku-4-5&lt;/code&gt; best instruction-following at modest premium.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG&lt;/strong&gt;: retrieved-doc reordering kills mid-prompt cache hits. Three fixes — push references to the end, deterministic chunk ordering, or Claude's multi-&lt;code&gt;cache_control&lt;/code&gt; breakpoints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agents&lt;/strong&gt;: tool calls and results must be append-only and byte-identical step-to-step. &lt;code&gt;claude-sonnet-4-5&lt;/code&gt; with 4 &lt;code&gt;cache_control&lt;/code&gt; markers gives the strongest cumulative-prefix discount; &lt;code&gt;gpt-5.4-mini&lt;/code&gt; works without code changes at 50% savings.&lt;/li&gt;
&lt;li&gt;TTL match: 5 min for chat, 1 hour for agents with human-in-the-loop steps, disk-backed for sporadic batch.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How to read this
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Engineer new to the topic&lt;/strong&gt;: read in order. The architecture in Part 1 makes Parts 2–4 click instantly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PM or architect doing vendor selection&lt;/strong&gt;: jump to Part 2 + Part 4. Reference Part 1 if a teammate asks "but why TTL exists".&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineer with a specific workload to ship today&lt;/strong&gt;: Part 4 first (find your row in the matrix), then Part 3 for the exact code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anyone optimizing an existing app&lt;/strong&gt;: Part 3 §6 cross-provider benchmark — reproduce it against your own prompt; that's a one-day exercise, not a multi-week migration.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Numbers in this series
&lt;/h2&gt;

&lt;p&gt;All measured numbers were captured on &lt;strong&gt;2026-05-25&lt;/strong&gt; against the Synthorai gateway (&lt;code&gt;https://synthorai.io/v1&lt;/code&gt; for OpenAI-compat, &lt;code&gt;https://synthorai.io/&lt;/code&gt; for Anthropic-native), single-tenant, single sequential run, no concurrent load. Your numbers will move with region, time-of-day, and competing tenant load — treat them as a starting point and reproduce against your own traffic before quoting them.&lt;/p&gt;

&lt;p&gt;Pricing tables and TTL behavior reflect vendor public documentation as of 2026-05. Providers update these every few months; the architectural reasoning (Part 1) is stable, the comparative numbers (Part 2 &amp;amp; 3) drift.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>llm</category>
      <category>python</category>
    </item>
  </channel>
</rss>
