DEV Community: John Medina

Stop sharing one OpenAI key across all your users

John Medina — Mon, 01 Jun 2026 14:09:04 +0000

I see this pattern everywhere. A startup launches their AI feature, they drop a single OPENAI_API_KEY in their .env, and call it a day.

tbh, it works fine for the first 100 users. Then user 101 figures out how to write a 50-turn loop that triggers your agent to summarize War and Peace every hour, and your Stripe balance goes negative.

The problem isn't the API cost. The problem is you have zero multi-tenant attribution. When the $5k bill hits, all you see is gpt-4o usage. You have no idea who caused it.

If you are building B2B SaaS, you need to track cost per tenant from day one. Not per endpoint. Not per model. Per tenant.

How to actually fix this:

Stop using the raw OpenAI client everywhere. Wrap it.
Inject tenantId and userId into every single completion request as metadata or a tag.
Log the usage object from the response asynchronously. Don't block the critical path.

I built LLMeter exactly for this because I got tired of building the same tracking wrapper at every company. It's open source (AGPL), uses Supabase, and tracks cost per user and per day out of the box. ymmv with other tools, but you need something that gives you a dashboard of which users are burning your margin.

Stop flying blind.

Cache-hit dispersion is the 7th vendor-risk axis — and the one your invoice can't see

John Medina — Mon, 01 Jun 2026 14:04:58 +0000

stavros dropped a comment on hn yesterday that should have ended the per-token billing conversation for anyone running a multi-tenant llm product, but it didn't, because the implication is too inconvenient to take seriously yet (thread, 581 pts / 243 c on the deepseek reasonix front page).

his numbers:

"the prices are what equivalent Sonnet usage would have cost, the actual amount I paid was $10. On performance, DeepSeek V4 Pro is comparable to Sonnet for me. 97.27% cache hit rate."

ten dollars actual. two hundred forty one dollars sonnet-equivalent. same task, same model, same pricing card. the only variable: how many of his calls landed on warm cache.

twenty four x.

and three more accounts in the same sub-thread confirmed similar dispersion — embedding-shape at 96.4% through a codex bridge, estebarb at 98.6% on opencode, metalspot saying his own steady-state agent loop sits "consistently above 95 once the context is primed." different stacks, different bridges, all converging on the same shape: once you're cache-warm, the per-token sticker price stops describing what you pay.

if you run a saas where customers consume llm tokens — chat, agent, copilot, anything — that 24× spread is between your tenants on the same model. and your vendor dashboard is reporting the aggregate. you have no idea which customers actually cost you money.

the seven axes, written down in one place

we've been mapping a vendor-risk taxonomy on this blog for about six weeks, one axis per hn front-page incident. people keep asking for the consolidated version, so here it is, with the originating thread for each:

#	axis	originating signal	what it costs you
1	acquihire-eol	helicone → mintlify (2025-08); stainless → anthropic (2026-05-19, HN 48182281)	observability vendor goes dark, you migrate under duress
2	multiplier-creep	gemini 3.5 flash repriced ~14× from flash 3 at launch (2026-05-20); copilot 27× credit multiplier; cursor team plan 5×	unit cost moves under a stable model name
3	suspension-without-recourse	railway / gcp account terminations posted as ask hn (2026-05-21)	provider kills your service with no human escalation
4	tco opacity	"was my $48k gpu server worth it" devto 2026-05-22	rent-vs-own per-experiment cost unknowable until you've already chosen
5	budget-blowout-at-scale	microsoft killing claude code internally because a december pilot ate 2026's ai budget (HN 48238979, 285 pts)	pilot becomes annual line item before kill-switch fires
6	support-function-attrition	aws "four years and out" (HN 48254475, 219 pts) — ex-ossm liaison leaves because human-in-the-loop roles are getting llm-restructured	the non-fungible human who reverses a wrongful suspension isn't there next quarter
7	cache-hit-rate dispersion	deepseek reasonix (HN 48261733, 581 pts, 2026-05-24) — $10 vs $241 / 24× at 97% cache	unit cost spread of 10-25× between tenants on the same model is invisible until you measure it harness-side

axes 1–6 are detectable from billing data. you can look at your invoice and notice the change. they're slow, but they're legible.

axis 7 is structurally different. the vendor invoice is already weighted by the actual cache hit ratio you got. it doesn't tell you that customer A is paying you $0.04/request at 98% cache while customer B is generating $0.95/request at 61% cache on the same prompts. you see one line: "this month's deepseek bill: $3,500." you don't see that 3 of your 200 tenants generated 70% of the marginal cost.

why this kept being invisible

three reasons.

first, the vendor pricing card lists per-token rates with a "cached input" line at 10-25% of the regular input rate. it implies cache is a 4-10× discount applied to your invoice in aggregate. it isn't. cache hit rate is a property of the workload, not the model — and workloads vary across tenants by an order of magnitude on the same product.

a tenant whose conversation loop reuses 30k tokens of system prompt + scratchpad + tool definitions on every turn lives at 95-98%. a tenant whose product spawns one-shot calls with fresh context per request lives at 0-15%. that's not a 4× spread, that's a 20-30× spread on input tokens alone.

second, vendor dashboards aggregate. anthropic's usage view, openai's billing dashboard, deepseek's portal — all report cache hit across your entire account. for a single-tenant product that's fine. for a saas with 200 customers, you're looking at the average of two populations: the warm-cache power users and the cold-cache thrash users. the average tells you neither.

third, "cost equivalence" reporting masks it further. when deepseek tells stavros his calls would have cost $241 on sonnet, that's a sonnet-equivalent price calculated on token volume. it doesn't subtract anthropic's prompt caching, which also offers 90% discounts on cache hits. the apples-to-apples number on sonnet would be lower than $241 — but stavros wouldn't know that without re-running on sonnet and measuring his own cache rate there too. the sticker comparison is doing what stickers do: hiding the variance.

what cache-hit dispersion does to your p&l

let me run the math on a synthetic but realistic shape.

assume:

100 tenants on your product
identical pricing: $50/mo per tenant
identical surface workload: ~2M input tokens per tenant per month
cache hit rates distributed: 40% of tenants at 90-98%, 40% at 50-70%, 20% at 10-30%
deepseek v4 pro pricing: $0.27/M input (uncached), $0.07/M cache hit

the warm tenants cost you roughly $0.20-$0.30/month on inference.
the mid tenants cost you $0.80-$1.10/month.
the cold tenants cost you $4.50-$5.20/month.

on a $50 sticker price these all look profitable. but: your top 20 cold-cache tenants are eating ~$100/month combined while the bottom 40 warm-cache tenants contribute ~$10. one cohort is subsidizing the other and you can't see it because you priced on average tokens.

now scale that to coding agents — where prompt sizes are 50k–200k and cache hit rate dispersion is even wider — and the math gets worse. an agent loop on a 200k context can cost you $3-$8 per task at low cache or $0.10-$0.20 at high cache. two orders of magnitude on the same workload on the same model.

this is the structural reason "what's our cogs per customer" stops being answerable from vendor dashboards in 2026. the question isn't "how much did i spend on llms" anymore. it's "how is that spend distributed across the customers who generated it" — and the answer lives in your runtime, not in the bill.

instrumenting axis 7 in your stack today

three changes, low effort, you can ship this week:

1. capture `cache_read_input_tokens` separately on every call

every modern provider returns it. log it. don't roll cached and uncached input together:

async function attributedCall(req, ctx) {
  const res = await provider.messages.create(req)
  await ledger.insert({
    ts: new Date(),
    tenant_id: ctx.tenant,
    feature_id: ctx.feature,
    model: req.model,
    input_tokens: res.usage.input_tokens,
    cached_input_tokens: res.usage.cache_read_input_tokens ?? 0,
    output_tokens: res.usage.output_tokens,
    cache_hit_ratio: (res.usage.cache_read_input_tokens ?? 0)
                    / Math.max(res.usage.input_tokens, 1),
    cost_usd: priceWithCache(req.model, res.usage),
  })
  return res
}

5-10 lines. it gives you the only field that actually predicts your invoice variance per tenant.

2. rollup cache hit rate per tenant, not per account

a daily job that computes tenant_id → median cache_hit_ratio, p10 cache_hit_ratio, n_requests. that's the table that tells you which customers are on the wrong end of axis 7. if the gap between p10 and median is wider than 30 percentage points inside a single tenant, that tenant has internal workload variance worth understanding — usually a bursty integration or a feature flagged on for them only.

3. set a per-tenant marginal cogs threshold, not a total spend alert

alert on "tenant T's marginal cost-per-request crossed $0.50 for the rolling 7-day window," not "this month's bill is up 20%." by the time the second alert fires, the bill is already up 20%. the first one fires while the workload is still in progress and you can intervene — change the model, throttle, route, talk to the customer about what changed.

why dashboards from observability vendors will continue to miss this

datadog, new relic, sentry, helicone, langfuse, portkey, langsmith — almost all of them sit at the model layer or the gateway layer. they see calls. they tag calls. they aggregate calls. what they don't do is own the harness-side attribution: which session, which feature, which tenant, which agent loop iteration, which retry — the keys that let you join cache_hit_ratio to your business object.

the vendors that ship at the gateway layer have a structural conflict, too: most of them are owned by, acquired by, or routing through the same providers whose pricing card they'd need to interrogate. helicone is mintlify property since 2025-08. langfuse is clickhouse property since 2026-01. stainless is part of anthropic as of 2026-05-19. portkey is mid-acquisition by palo alto networks per their 2026-04-30 release. axis 1 (acquihire-eol) and axis 7 (cache dispersion) collide here: the layer that should measure dispersion is the layer that's getting acquired by the entity whose dispersion you're trying to measure.

self-hosting the attribution layer — agpl, in your stack, owned by you — is the only configuration where the answer to "which tenant cost me what" stays answerable across provider acquisitions, pricing changes, and dashboard re-skins.

what to do this week, no tool required

pull last 30 days of llm calls from your logs. group by tenant. compute median cache_hit_ratio per tenant. if you don't have cache_read_input_tokens logged, add it today — 5 lines per call site.
find the 5 tenants with the worst cache hit rate. what's their workload shape? thrashing context? cold-start agents? you probably have a product problem, not just a cost problem.
find the 5 tenants with the best cache hit rate. how are they using your product? this is your retention shape. they're the ones priced correctly under your current sticker.
compute marginal cost per active tenant per day. divide by your sticker price. anything above 25% is a margin red flag. anything above 100% is a customer you're paying to keep.
write down the calendar. the june 15 anthropic agent sdk credit-pool split changes cache accounting semantics for everyone on claude. the deepseek v4-pro 75% promo expires 2026-05-31 15:59 UTC and the post-promo per-token rate quadruples. if you don't already model what your cache-hit distribution does at the new prices, you'll find out on the july invoice.

the point

per-token billing was a fine abstraction in 2023 when context windows were 4-8k, cache wasn't a line item, and most products had one workload shape. it stopped describing reality somewhere around the time agent loops normalized 200k contexts and providers shipped 90% prompt-cache discounts. the unit cost spread between cache-warm and cache-cold tenants on the same model is now larger than the spread between different models, and nothing in the vendor's billing surface tells you which side any given tenant is on.

axes 1-6 of the taxonomy say "your vendor will surprise you on the bill." axis 7 says "your vendor will surprise you on which customers generated the bill" — and that one is worse, because customer p&l drives the product decisions you make from here.

if your cogs report rolls cache and non-cache together, you don't have an attribution model. you have an average that lies about your distribution.

i build llmeter — open-source (agpl-3.0) attribution at the harness layer. per-tenant rollups, per-feature cogs, cache-hit ratio surfaced as a first-class metric. it's not a proxy and doesn't sit in your request path. genuinely curious what cache-hit-rate distribution looks like across your own tenant base — drop a number if you've measured it.

Your prompt is getting longer without you knowing it (and it's killing your margins)

John Medina — Fri, 29 May 2026 14:02:11 +0000

I've been looking at LLM billing patterns lately, and there's a silent killer that creeps up on almost every team: prompt inflation.

When you first build an AI feature, your prompt is tight. Maybe 500 tokens for the system instructions and 100 for the user query. The math looks great. "This will cost us fractions of a cent per call," you tell the team.

Fast forward three months.

Someone added conversation history to make the bot "smarter." Another dev added a massive RAG context block because the model hallucinated once. Product asked for formatting instructions, so now the system prompt is a 2,000-word essay.

Suddenly, your baseline request is 8k tokens.

The worst part is that user value doesn't scale linearly with prompt size. But your OpenAI bill sure does. If you're running at scale, you're suddenly paying $0.05+ per request for a feature you modeled at $0.005.

If you just look at your monthly total on the provider dashboard, it just looks like you're getting more usage. You think "growth is good" until the Stripe payout hits and you realize your margins are gone.

You need to track cost per user and cost per feature, not just total spend. If you see specific users driving crazy costs, they're probably accumulating massive context windows that you need to truncate.

fwiw, I ran into this exact issue, which is why I built LLMeter (https://llmeter.org?utm_source=devto&utm_medium=article&utm_campaign=2026-04-21-prompt-inflation-margin-killer). It's an open-source, proxy-free way to track this stuff. It attributes costs down to the user ID level so you can actually see who is dragging around a 10k token history.

Stop assuming your prompt is the same size it was on day one. Track it.

You Don't Need Enterprise LLMOps, You Need a Better Dashboard

John Medina — Wed, 27 May 2026 14:02:57 +0000

PLATAFORMA: Dev.to

Token bills are getting out of hand. Everyone knows it. The default response has been to reach for massive, venture-backed "LLMOps" platforms that promise to solve everything. They offer observability, caching, prompt versioning, evaluation, and a dozen other features.

tbh, for most of us, that's overkill. It's like buying a full-scale CI/CD platform when all you need is a simple cron job.

The real problem for 90% of devs isn't complex prompt A/B testing or fine-tuning workflows. It's answering one basic question: "Who or what is costing me so much money?"

Usually, the answer is buried in a CSV file from OpenAI or Anthropic. You end up writing custom scripts to parse it, attribute costs to users, and hope you catch the runaway agent that's stuck in a loop summarizing the same text 1,000 times.

This isn't an "observability" problem. It's a dashboard problem.

Before you invest in a complex system, you need a clear view of three things:

Cost per user: Which tenant is burning through your credits?
Cost per model: Is claude-3-opus really worth 15x more than haiku for that simple task?
Real-time alerts: Can you get a Slack notification when a user's spend hits $100, before it hits $1,000?

Most enterprise tools do this, but they bundle it with features you won't touch for months. And they aren't cheap.

This is why we built LLMeter as an open-source tool. It's not a massive platform. It's a focused, self-hostable dashboard (Next.js, Supabase) that does one thing well: monitor costs across different providers (OpenAI, Anthropic, DeepSeek, OpenRouter).

It gives you multi-tenant attribution and budget alerts without the enterprise complexity. You can see which user is calling which model and how much it's costing you, in real-time. AGPL-3.0, so you can host it yourself.

fwiw, the next time your bill spikes, don't assume you need a revolutionary AI-powered solution. You might just need a better dashboard. Check out the project at llmeter.org.

The Token Spiral: How One Runaway AI Agent Burned $2,847 in 4 Hours

John Medina — Mon, 25 May 2026 14:08:57 +0000

traditional monitoring is completely broken when it comes to AI agents.

we've all seen the dashboards. everything is green. HTTP 200s across the board. p99 latency looks fine. CPU is barely ticking.

meanwhile, your agent is stuck in an infinite retry loop, burning $80 per iteration because it keeps hallucinating an invalid JSON payload and asking the LLM to fix it.

this exact failure mode—the "token spiral"—recently burned $2,847 in just 4 hours for a dev team. and they only noticed because their card declined.

here is why standard observability tools miss this:
they track the container, the request, the database. they don't track the tokens per customer task.

when an agent starts spiraling, it's making valid API calls to OpenAI or Anthropic. the provider happily returns 200 OK. the latency might be slightly elevated, but not enough to trigger a generic PagerDuty alert. it just looks like heavy usage.

to catch a token spiral before it bankrupts you, you need runtime cost enforcement. not just a daily digest, but active circuit breakers.

if you're at an enterprise, you buy Braintrust or Vantage.
if you're building a startup or just vibing in your garage, you can't afford those.

imo, you need open-source per-customer cost attribution. i built LLMeter to solve exactly this problem. it tracks costs by model, by user, by day. you can set budget alerts and actually see which specific tenant is spiraling out of control.

ymmv, but don't deploy agents without cost circuit breakers. the API providers aren't going to refund you for bad prompts.

The Token Spiral: How One Runaway AI Agent Burned $2,847 in 4 Hours

John Medina — Fri, 22 May 2026 14:02:11 +0000