John Medina

Posted on Jun 1

Cache-hit dispersion is the 7th vendor-risk axis — and the one your invoice can't see

#llm #costs #saas #monitoring

stavros dropped a comment on hn yesterday that should have ended the per-token billing conversation for anyone running a multi-tenant llm product, but it didn't, because the implication is too inconvenient to take seriously yet (thread, 581 pts / 243 c on the deepseek reasonix front page).

his numbers:

"the prices are what equivalent Sonnet usage would have cost, the actual amount I paid was $10. On performance, DeepSeek V4 Pro is comparable to Sonnet for me. 97.27% cache hit rate."

ten dollars actual. two hundred forty one dollars sonnet-equivalent. same task, same model, same pricing card. the only variable: how many of his calls landed on warm cache.

twenty four x.

and three more accounts in the same sub-thread confirmed similar dispersion — embedding-shape at 96.4% through a codex bridge, estebarb at 98.6% on opencode, metalspot saying his own steady-state agent loop sits "consistently above 95 once the context is primed." different stacks, different bridges, all converging on the same shape: once you're cache-warm, the per-token sticker price stops describing what you pay.

if you run a saas where customers consume llm tokens — chat, agent, copilot, anything — that 24× spread is between your tenants on the same model. and your vendor dashboard is reporting the aggregate. you have no idea which customers actually cost you money.

the seven axes, written down in one place

we've been mapping a vendor-risk taxonomy on this blog for about six weeks, one axis per hn front-page incident. people keep asking for the consolidated version, so here it is, with the originating thread for each:

#	axis	originating signal	what it costs you
1	acquihire-eol	helicone → mintlify (2025-08); stainless → anthropic (2026-05-19, HN 48182281)	observability vendor goes dark, you migrate under duress
2	multiplier-creep	gemini 3.5 flash repriced ~14× from flash 3 at launch (2026-05-20); copilot 27× credit multiplier; cursor team plan 5×	unit cost moves under a stable model name
3	suspension-without-recourse	railway / gcp account terminations posted as ask hn (2026-05-21)	provider kills your service with no human escalation
4	tco opacity	"was my $48k gpu server worth it" devto 2026-05-22	rent-vs-own per-experiment cost unknowable until you've already chosen
5	budget-blowout-at-scale	microsoft killing claude code internally because a december pilot ate 2026's ai budget (HN 48238979, 285 pts)	pilot becomes annual line item before kill-switch fires
6	support-function-attrition	aws "four years and out" (HN 48254475, 219 pts) — ex-ossm liaison leaves because human-in-the-loop roles are getting llm-restructured	the non-fungible human who reverses a wrongful suspension isn't there next quarter
7	cache-hit-rate dispersion	deepseek reasonix (HN 48261733, 581 pts, 2026-05-24) — $10 vs $241 / 24× at 97% cache	unit cost spread of 10-25× between tenants on the same model is invisible until you measure it harness-side

axes 1–6 are detectable from billing data. you can look at your invoice and notice the change. they're slow, but they're legible.

axis 7 is structurally different. the vendor invoice is already weighted by the actual cache hit ratio you got. it doesn't tell you that customer A is paying you $0.04/request at 98% cache while customer B is generating $0.95/request at 61% cache on the same prompts. you see one line: "this month's deepseek bill: $3,500." you don't see that 3 of your 200 tenants generated 70% of the marginal cost.

why this kept being invisible

three reasons.

first, the vendor pricing card lists per-token rates with a "cached input" line at 10-25% of the regular input rate. it implies cache is a 4-10× discount applied to your invoice in aggregate. it isn't. cache hit rate is a property of the workload, not the model — and workloads vary across tenants by an order of magnitude on the same product.

a tenant whose conversation loop reuses 30k tokens of system prompt + scratchpad + tool definitions on every turn lives at 95-98%. a tenant whose product spawns one-shot calls with fresh context per request lives at 0-15%. that's not a 4× spread, that's a 20-30× spread on input tokens alone.

second, vendor dashboards aggregate. anthropic's usage view, openai's billing dashboard, deepseek's portal — all report cache hit across your entire account. for a single-tenant product that's fine. for a saas with 200 customers, you're looking at the average of two populations: the warm-cache power users and the cold-cache thrash users. the average tells you neither.

third, "cost equivalence" reporting masks it further. when deepseek tells stavros his calls would have cost $241 on sonnet, that's a sonnet-equivalent price calculated on token volume. it doesn't subtract anthropic's prompt caching, which also offers 90% discounts on cache hits. the apples-to-apples number on sonnet would be lower than $241 — but stavros wouldn't know that without re-running on sonnet and measuring his own cache rate there too. the sticker comparison is doing what stickers do: hiding the variance.

what cache-hit dispersion does to your p&l

let me run the math on a synthetic but realistic shape.

assume:

100 tenants on your product
identical pricing: $50/mo per tenant
identical surface workload: ~2M input tokens per tenant per month
cache hit rates distributed: 40% of tenants at 90-98%, 40% at 50-70%, 20% at 10-30%
deepseek v4 pro pricing: $0.27/M input (uncached), $0.07/M cache hit

the warm tenants cost you roughly $0.20-$0.30/month on inference.
the mid tenants cost you $0.80-$1.10/month.
the cold tenants cost you $4.50-$5.20/month.

on a $50 sticker price these all look profitable. but: your top 20 cold-cache tenants are eating ~$100/month combined while the bottom 40 warm-cache tenants contribute ~$10. one cohort is subsidizing the other and you can't see it because you priced on average tokens.

now scale that to coding agents — where prompt sizes are 50k–200k and cache hit rate dispersion is even wider — and the math gets worse. an agent loop on a 200k context can cost you $3-$8 per task at low cache or $0.10-$0.20 at high cache. two orders of magnitude on the same workload on the same model.

this is the structural reason "what's our cogs per customer" stops being answerable from vendor dashboards in 2026. the question isn't "how much did i spend on llms" anymore. it's "how is that spend distributed across the customers who generated it" — and the answer lives in your runtime, not in the bill.

instrumenting axis 7 in your stack today

three changes, low effort, you can ship this week:

1. capture `cache_read_input_tokens` separately on every call

every modern provider returns it. log it. don't roll cached and uncached input together:

async function attributedCall(req, ctx) {
  const res = await provider.messages.create(req)
  await ledger.insert({
    ts: new Date(),
    tenant_id: ctx.tenant,
    feature_id: ctx.feature,
    model: req.model,
    input_tokens: res.usage.input_tokens,
    cached_input_tokens: res.usage.cache_read_input_tokens ?? 0,
    output_tokens: res.usage.output_tokens,
    cache_hit_ratio: (res.usage.cache_read_input_tokens ?? 0)
                    / Math.max(res.usage.input_tokens, 1),
    cost_usd: priceWithCache(req.model, res.usage),
  })
  return res
}

5-10 lines. it gives you the only field that actually predicts your invoice variance per tenant.

2. rollup cache hit rate per tenant, not per account

a daily job that computes tenant_id → median cache_hit_ratio, p10 cache_hit_ratio, n_requests. that's the table that tells you which customers are on the wrong end of axis 7. if the gap between p10 and median is wider than 30 percentage points inside a single tenant, that tenant has internal workload variance worth understanding — usually a bursty integration or a feature flagged on for them only.

3. set a per-tenant marginal cogs threshold, not a total spend alert

alert on "tenant T's marginal cost-per-request crossed $0.50 for the rolling 7-day window," not "this month's bill is up 20%." by the time the second alert fires, the bill is already up 20%. the first one fires while the workload is still in progress and you can intervene — change the model, throttle, route, talk to the customer about what changed.

why dashboards from observability vendors will continue to miss this

datadog, new relic, sentry, helicone, langfuse, portkey, langsmith — almost all of them sit at the model layer or the gateway layer. they see calls. they tag calls. they aggregate calls. what they don't do is own the harness-side attribution: which session, which feature, which tenant, which agent loop iteration, which retry — the keys that let you join cache_hit_ratio to your business object.

the vendors that ship at the gateway layer have a structural conflict, too: most of them are owned by, acquired by, or routing through the same providers whose pricing card they'd need to interrogate. helicone is mintlify property since 2025-08. langfuse is clickhouse property since 2026-01. stainless is part of anthropic as of 2026-05-19. portkey is mid-acquisition by palo alto networks per their 2026-04-30 release. axis 1 (acquihire-eol) and axis 7 (cache dispersion) collide here: the layer that should measure dispersion is the layer that's getting acquired by the entity whose dispersion you're trying to measure.

self-hosting the attribution layer — agpl, in your stack, owned by you — is the only configuration where the answer to "which tenant cost me what" stays answerable across provider acquisitions, pricing changes, and dashboard re-skins.

what to do this week, no tool required

pull last 30 days of llm calls from your logs. group by tenant. compute median cache_hit_ratio per tenant. if you don't have cache_read_input_tokens logged, add it today — 5 lines per call site.
find the 5 tenants with the worst cache hit rate. what's their workload shape? thrashing context? cold-start agents? you probably have a product problem, not just a cost problem.
find the 5 tenants with the best cache hit rate. how are they using your product? this is your retention shape. they're the ones priced correctly under your current sticker.
compute marginal cost per active tenant per day. divide by your sticker price. anything above 25% is a margin red flag. anything above 100% is a customer you're paying to keep.
write down the calendar. the june 15 anthropic agent sdk credit-pool split changes cache accounting semantics for everyone on claude. the deepseek v4-pro 75% promo expires 2026-05-31 15:59 UTC and the post-promo per-token rate quadruples. if you don't already model what your cache-hit distribution does at the new prices, you'll find out on the july invoice.

the point

per-token billing was a fine abstraction in 2023 when context windows were 4-8k, cache wasn't a line item, and most products had one workload shape. it stopped describing reality somewhere around the time agent loops normalized 200k contexts and providers shipped 90% prompt-cache discounts. the unit cost spread between cache-warm and cache-cold tenants on the same model is now larger than the spread between different models, and nothing in the vendor's billing surface tells you which side any given tenant is on.

axes 1-6 of the taxonomy say "your vendor will surprise you on the bill." axis 7 says "your vendor will surprise you on which customers generated the bill" — and that one is worse, because customer p&l drives the product decisions you make from here.

if your cogs report rolls cache and non-cache together, you don't have an attribution model. you have an average that lies about your distribution.

i build llmeter — open-source (agpl-3.0) attribution at the harness layer. per-tenant rollups, per-feature cogs, cache-hit ratio surfaced as a first-class metric. it's not a proxy and doesn't sit in your request path. genuinely curious what cache-hit-rate distribution looks like across your own tenant base — drop a number if you've measured it.

DEV Community

Cache-hit dispersion is the 7th vendor-risk axis — and the one your invoice can't see

the seven axes, written down in one place

why this kept being invisible

what cache-hit dispersion does to your p&l

instrumenting axis 7 in your stack today

1. capture `cache_read_input_tokens` separately on every call

2. rollup cache hit rate per tenant, not per account

3. set a per-tenant marginal cogs threshold, not a total spend alert

why dashboards from observability vendors will continue to miss this

what to do this week, no tool required

the point

Top comments (0)

the seven axes, written down in one place

why this kept being invisible

what cache-hit dispersion does to your p&l

instrumenting axis 7 in your stack today

1. capture cache_read_input_tokens separately on every call

2. rollup cache hit rate per tenant, not per account

3. set a per-tenant marginal cogs threshold, not a total spend alert

why dashboards from observability vendors will continue to miss this

what to do this week, no tool required

the point

1. capture `cache_read_input_tokens` separately on every call