<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: John Medina</title>
    <description>The latest articles on DEV Community by John Medina (@amedinat).</description>
    <link>https://dev.to/amedinat</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3854284%2F73b7fb73-f118-4d37-b5a7-37581d43bd0a.png</url>
      <title>DEV Community: John Medina</title>
      <link>https://dev.to/amedinat</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/amedinat"/>
    <language>en</language>
    <item>
      <title>Stop sharing one OpenAI key across all your users</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Mon, 01 Jun 2026 14:09:04 +0000</pubDate>
      <link>https://dev.to/amedinat/stop-sharing-one-openai-key-across-all-your-users-3g8g</link>
      <guid>https://dev.to/amedinat/stop-sharing-one-openai-key-across-all-your-users-3g8g</guid>
      <description>&lt;p&gt;I see this pattern everywhere. A startup launches their AI feature, they drop a single &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; in their &lt;code&gt;.env&lt;/code&gt;, and call it a day. &lt;/p&gt;

&lt;p&gt;tbh, it works fine for the first 100 users. Then user 101 figures out how to write a 50-turn loop that triggers your agent to summarize War and Peace every hour, and your Stripe balance goes negative.&lt;/p&gt;

&lt;p&gt;The problem isn't the API cost. The problem is you have zero multi-tenant attribution. When the $5k bill hits, all you see is &lt;code&gt;gpt-4o&lt;/code&gt; usage. You have no idea &lt;em&gt;who&lt;/em&gt; caused it.&lt;/p&gt;

&lt;p&gt;If you are building B2B SaaS, you need to track cost per tenant from day one. Not per endpoint. Not per model. Per tenant. &lt;/p&gt;

&lt;p&gt;How to actually fix this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Stop using the raw OpenAI client everywhere. Wrap it.&lt;/li&gt;
&lt;li&gt;Inject &lt;code&gt;tenantId&lt;/code&gt; and &lt;code&gt;userId&lt;/code&gt; into every single completion request as metadata or a tag. &lt;/li&gt;
&lt;li&gt;Log the &lt;code&gt;usage&lt;/code&gt; object from the response asynchronously. Don't block the critical path.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I built &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=devto-multi-tenant-cost-attribution-20260422" rel="noopener noreferrer"&gt;LLMeter&lt;/a&gt; exactly for this because I got tired of building the same tracking wrapper at every company. It's open source (AGPL), uses Supabase, and tracks cost per user and per day out of the box. ymmv with other tools, but you need &lt;em&gt;something&lt;/em&gt; that gives you a dashboard of which users are burning your margin. &lt;/p&gt;

&lt;p&gt;Stop flying blind.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>Cache-hit dispersion is the 7th vendor-risk axis — and the one your invoice can't see</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Mon, 01 Jun 2026 14:04:58 +0000</pubDate>
      <link>https://dev.to/amedinat/cache-hit-dispersion-is-the-7th-vendor-risk-axis-and-the-one-your-invoice-cant-see-4b93</link>
      <guid>https://dev.to/amedinat/cache-hit-dispersion-is-the-7th-vendor-risk-axis-and-the-one-your-invoice-cant-see-4b93</guid>
      <description>&lt;p&gt;stavros dropped a comment on hn yesterday that should have ended the per-token billing conversation for anyone running a multi-tenant llm product, but it didn't, because the implication is too inconvenient to take seriously yet (&lt;a href="https://news.ycombinator.com/item?id=48261733" rel="noopener noreferrer"&gt;thread&lt;/a&gt;, 581 pts / 243 c on the deepseek reasonix front page).&lt;/p&gt;

&lt;p&gt;his numbers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"the prices are what equivalent Sonnet usage would have cost, the actual amount I paid was $10. On performance, DeepSeek V4 Pro is comparable to Sonnet for me. 97.27% cache hit rate."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;ten dollars actual. two hundred forty one dollars sonnet-equivalent. same task, same model, same pricing card. the only variable: how many of his calls landed on warm cache.&lt;/p&gt;

&lt;p&gt;twenty four x.&lt;/p&gt;

&lt;p&gt;and three more accounts in the same sub-thread confirmed similar dispersion — embedding-shape at 96.4% through a codex bridge, estebarb at 98.6% on opencode, metalspot saying his own steady-state agent loop sits "consistently above 95 once the context is primed." different stacks, different bridges, all converging on the same shape: once you're cache-warm, the per-token sticker price stops describing what you pay.&lt;/p&gt;

&lt;p&gt;if you run a saas where customers consume llm tokens — chat, agent, copilot, anything — that 24× spread is &lt;em&gt;between your tenants on the same model&lt;/em&gt;. and your vendor dashboard is reporting the aggregate. you have no idea which customers actually cost you money.&lt;/p&gt;

&lt;h2&gt;
  
  
  the seven axes, written down in one place
&lt;/h2&gt;

&lt;p&gt;we've been mapping a vendor-risk taxonomy on this blog for about six weeks, one axis per hn front-page incident. people keep asking for the consolidated version, so here it is, with the originating thread for each:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;axis&lt;/th&gt;
&lt;th&gt;originating signal&lt;/th&gt;
&lt;th&gt;what it costs you&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;acquihire-eol&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;helicone → mintlify (2025-08); stainless → anthropic (2026-05-19, &lt;a href="https://news.ycombinator.com/item?id=48182281" rel="noopener noreferrer"&gt;HN 48182281&lt;/a&gt;)&lt;/td&gt;
&lt;td&gt;observability vendor goes dark, you migrate under duress&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;multiplier-creep&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;gemini 3.5 flash repriced ~14× from flash 3 at launch (2026-05-20); copilot 27× credit multiplier; cursor team plan 5×&lt;/td&gt;
&lt;td&gt;unit cost moves under a stable model name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;suspension-without-recourse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;railway / gcp account terminations posted as ask hn (2026-05-21)&lt;/td&gt;
&lt;td&gt;provider kills your service with no human escalation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;tco opacity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"was my $48k gpu server worth it" devto 2026-05-22&lt;/td&gt;
&lt;td&gt;rent-vs-own per-experiment cost unknowable until you've already chosen&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;budget-blowout-at-scale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;microsoft killing claude code internally because a december pilot ate 2026's ai budget (&lt;a href="https://news.ycombinator.com/item?id=48238979" rel="noopener noreferrer"&gt;HN 48238979&lt;/a&gt;, 285 pts)&lt;/td&gt;
&lt;td&gt;pilot becomes annual line item before kill-switch fires&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;support-function-attrition&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;aws "four years and out" (&lt;a href="https://news.ycombinator.com/item?id=48254475" rel="noopener noreferrer"&gt;HN 48254475&lt;/a&gt;, 219 pts) — ex-ossm liaison leaves because human-in-the-loop roles are getting llm-restructured&lt;/td&gt;
&lt;td&gt;the non-fungible human who reverses a wrongful suspension isn't there next quarter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;cache-hit-rate dispersion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;deepseek reasonix (&lt;a href="https://news.ycombinator.com/item?id=48261733" rel="noopener noreferrer"&gt;HN 48261733&lt;/a&gt;, 581 pts, 2026-05-24) — $10 vs $241 / 24× at 97% cache&lt;/td&gt;
&lt;td&gt;unit cost spread of 10-25× between tenants on the same model is invisible until you measure it harness-side&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;axes 1–6 are detectable from billing data. you can look at your invoice and notice the change. they're slow, but they're legible.&lt;/p&gt;

&lt;p&gt;axis 7 is structurally different. the vendor invoice is &lt;strong&gt;already&lt;/strong&gt; weighted by the actual cache hit ratio you got. it doesn't tell you that customer A is paying you $0.04/request at 98% cache while customer B is generating $0.95/request at 61% cache on the same prompts. you see one line: "this month's deepseek bill: $3,500." you don't see that 3 of your 200 tenants generated 70% of the marginal cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  why this kept being invisible
&lt;/h2&gt;

&lt;p&gt;three reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;first&lt;/strong&gt;, the vendor pricing card lists per-token rates with a "cached input" line at 10-25% of the regular input rate. it implies cache is a 4-10× discount applied to your invoice in aggregate. it isn't. cache hit rate is a property of the &lt;em&gt;workload&lt;/em&gt;, not the model — and workloads vary across tenants by an order of magnitude on the same product.&lt;/p&gt;

&lt;p&gt;a tenant whose conversation loop reuses 30k tokens of system prompt + scratchpad + tool definitions on every turn lives at 95-98%. a tenant whose product spawns one-shot calls with fresh context per request lives at 0-15%. that's not a 4× spread, that's a 20-30× spread on input tokens alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;second&lt;/strong&gt;, vendor dashboards aggregate. anthropic's usage view, openai's billing dashboard, deepseek's portal — all report cache hit &lt;em&gt;across your entire account&lt;/em&gt;. for a single-tenant product that's fine. for a saas with 200 customers, you're looking at the average of two populations: the warm-cache power users and the cold-cache thrash users. the average tells you neither.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;third&lt;/strong&gt;, "cost equivalence" reporting masks it further. when deepseek tells stavros his calls would have cost $241 on sonnet, that's a &lt;em&gt;sonnet-equivalent&lt;/em&gt; price calculated on token volume. it doesn't subtract anthropic's prompt caching, which also offers 90% discounts on cache hits. the apples-to-apples number on sonnet would be lower than $241 — but stavros wouldn't know that without re-running on sonnet and measuring his own cache rate there too. the sticker comparison is doing what stickers do: hiding the variance.&lt;/p&gt;

&lt;h2&gt;
  
  
  what cache-hit dispersion does to your p&amp;amp;l
&lt;/h2&gt;

&lt;p&gt;let me run the math on a synthetic but realistic shape.&lt;/p&gt;

&lt;p&gt;assume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;100 tenants on your product&lt;/li&gt;
&lt;li&gt;identical pricing: $50/mo per tenant&lt;/li&gt;
&lt;li&gt;identical surface workload: ~2M input tokens per tenant per month&lt;/li&gt;
&lt;li&gt;cache hit rates distributed: 40% of tenants at 90-98%, 40% at 50-70%, 20% at 10-30%&lt;/li&gt;
&lt;li&gt;deepseek v4 pro pricing: $0.27/M input (uncached), $0.07/M cache hit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;the warm tenants cost you roughly $0.20-$0.30/month on inference.&lt;br&gt;
the mid tenants cost you $0.80-$1.10/month.&lt;br&gt;
the cold tenants cost you $4.50-$5.20/month.&lt;/p&gt;

&lt;p&gt;on a $50 sticker price these all look profitable. but: your top 20 cold-cache tenants are eating ~$100/month combined while the bottom 40 warm-cache tenants contribute ~$10. one cohort is subsidizing the other and you can't see it because you priced on average tokens.&lt;/p&gt;

&lt;p&gt;now scale that to coding agents — where prompt sizes are 50k–200k and cache hit rate dispersion is even wider — and the math gets worse. an agent loop on a 200k context can cost you $3-$8 per task at low cache or $0.10-$0.20 at high cache. &lt;strong&gt;two orders of magnitude on the same workload on the same model.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;this is the structural reason "what's our cogs per customer" stops being answerable from vendor dashboards in 2026. the question isn't "how much did i spend on llms" anymore. it's "how is that spend distributed across the customers who generated it" — and the answer lives in your runtime, not in the bill.&lt;/p&gt;
&lt;h2&gt;
  
  
  instrumenting axis 7 in your stack today
&lt;/h2&gt;

&lt;p&gt;three changes, low effort, you can ship this week:&lt;/p&gt;
&lt;h3&gt;
  
  
  1. capture &lt;code&gt;cache_read_input_tokens&lt;/code&gt; separately on every call
&lt;/h3&gt;

&lt;p&gt;every modern provider returns it. log it. don't roll cached and uncached input together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;attributedCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ledger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tenant&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;feature_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;feature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;cached_input_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cache_read_input_tokens&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;cache_hit_ratio&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cache_read_input_tokens&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;cost_usd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;priceWithCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;5-10 lines. it gives you the only field that actually predicts your invoice variance per tenant.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. rollup cache hit rate &lt;strong&gt;per tenant&lt;/strong&gt;, not per account
&lt;/h3&gt;

&lt;p&gt;a daily job that computes &lt;code&gt;tenant_id → median cache_hit_ratio, p10 cache_hit_ratio, n_requests&lt;/code&gt;. that's the table that tells you which customers are on the wrong end of axis 7. if the gap between p10 and median is wider than 30 percentage points inside a single tenant, that tenant has internal workload variance worth understanding — usually a bursty integration or a feature flagged on for them only.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. set a per-tenant marginal cogs threshold, not a total spend alert
&lt;/h3&gt;

&lt;p&gt;alert on "tenant T's marginal cost-per-request crossed $0.50 for the rolling 7-day window," not "this month's bill is up 20%." by the time the second alert fires, the bill is already up 20%. the first one fires while the workload is still in progress and you can intervene — change the model, throttle, route, talk to the customer about what changed.&lt;/p&gt;

&lt;h2&gt;
  
  
  why dashboards from observability vendors will continue to miss this
&lt;/h2&gt;

&lt;p&gt;datadog, new relic, sentry, helicone, langfuse, portkey, langsmith — almost all of them sit at the model layer or the gateway layer. they see calls. they tag calls. they aggregate calls. what they don't do is &lt;strong&gt;own the harness-side attribution&lt;/strong&gt;: which session, which feature, which tenant, which agent loop iteration, which retry — the keys that let you join &lt;code&gt;cache_hit_ratio&lt;/code&gt; to your business object.&lt;/p&gt;

&lt;p&gt;the vendors that ship at the gateway layer have a structural conflict, too: most of them are owned by, acquired by, or routing through the same providers whose pricing card they'd need to interrogate. helicone is mintlify property since 2025-08. langfuse is clickhouse property since 2026-01. stainless is part of anthropic as of 2026-05-19. portkey is mid-acquisition by palo alto networks per their 2026-04-30 release. axis 1 (acquihire-eol) and axis 7 (cache dispersion) collide here: the layer that &lt;em&gt;should&lt;/em&gt; measure dispersion is the layer that's getting acquired by the entity whose dispersion you're trying to measure.&lt;/p&gt;

&lt;p&gt;self-hosting the attribution layer — agpl, in your stack, owned by you — is the only configuration where the answer to "which tenant cost me what" stays answerable across provider acquisitions, pricing changes, and dashboard re-skins.&lt;/p&gt;

&lt;h2&gt;
  
  
  what to do this week, no tool required
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;pull last 30 days of llm calls from your logs.&lt;/strong&gt; group by tenant. compute median cache_hit_ratio per tenant. if you don't have &lt;code&gt;cache_read_input_tokens&lt;/code&gt; logged, add it today — 5 lines per call site.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;find the 5 tenants with the worst cache hit rate.&lt;/strong&gt; what's their workload shape? thrashing context? cold-start agents? you probably have a product problem, not just a cost problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;find the 5 tenants with the best cache hit rate.&lt;/strong&gt; how are they using your product? this is your retention shape. they're the ones priced correctly under your current sticker.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;compute marginal cost per active tenant per day.&lt;/strong&gt; divide by your sticker price. anything above 25% is a margin red flag. anything above 100% is a customer you're paying to keep.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;write down the calendar.&lt;/strong&gt; the june 15 anthropic agent sdk credit-pool split changes cache accounting semantics for everyone on claude. the deepseek v4-pro 75% promo expires 2026-05-31 15:59 UTC and the post-promo per-token rate quadruples. if you don't already model what your cache-hit distribution does at the new prices, you'll find out on the july invoice.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  the point
&lt;/h2&gt;

&lt;p&gt;per-token billing was a fine abstraction in 2023 when context windows were 4-8k, cache wasn't a line item, and most products had one workload shape. it stopped describing reality somewhere around the time agent loops normalized 200k contexts and providers shipped 90% prompt-cache discounts. the unit cost spread between cache-warm and cache-cold tenants on the same model is now larger than the spread between different models, and nothing in the vendor's billing surface tells you which side any given tenant is on.&lt;/p&gt;

&lt;p&gt;axes 1-6 of the taxonomy say "your vendor will surprise you on the bill." axis 7 says "your vendor will surprise you on which &lt;em&gt;customers&lt;/em&gt; generated the bill" — and that one is worse, because customer p&amp;amp;l drives the product decisions you make from here.&lt;/p&gt;

&lt;p&gt;if your cogs report rolls cache and non-cache together, you don't have an attribution model. you have an average that lies about your distribution.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;i build &lt;a href="https://llmeter.org/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=hn-week-cache-hit-axis-7" rel="noopener noreferrer"&gt;llmeter&lt;/a&gt; — open-source (agpl-3.0) attribution at the harness layer. per-tenant rollups, per-feature cogs, cache-hit ratio surfaced as a first-class metric. it's not a proxy and doesn't sit in your request path. genuinely curious what cache-hit-rate distribution looks like across your own tenant base — drop a number if you've measured it.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>costs</category>
      <category>saas</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Your prompt is getting longer without you knowing it (and it's killing your margins)</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Fri, 29 May 2026 14:02:11 +0000</pubDate>
      <link>https://dev.to/amedinat/your-prompt-is-getting-longer-without-you-knowing-it-and-its-killing-your-margins-2293</link>
      <guid>https://dev.to/amedinat/your-prompt-is-getting-longer-without-you-knowing-it-and-its-killing-your-margins-2293</guid>
      <description>&lt;p&gt;I've been looking at LLM billing patterns lately, and there's a silent killer that creeps up on almost every team: prompt inflation.&lt;/p&gt;

&lt;p&gt;When you first build an AI feature, your prompt is tight. Maybe 500 tokens for the system instructions and 100 for the user query. The math looks great. "This will cost us fractions of a cent per call," you tell the team.&lt;/p&gt;

&lt;p&gt;Fast forward three months.&lt;/p&gt;

&lt;p&gt;Someone added conversation history to make the bot "smarter." Another dev added a massive RAG context block because the model hallucinated once. Product asked for formatting instructions, so now the system prompt is a 2,000-word essay. &lt;/p&gt;

&lt;p&gt;Suddenly, your baseline request is 8k tokens. &lt;/p&gt;

&lt;p&gt;The worst part is that user value doesn't scale linearly with prompt size. But your OpenAI bill sure does. If you're running at scale, you're suddenly paying $0.05+ per request for a feature you modeled at $0.005. &lt;/p&gt;

&lt;p&gt;If you just look at your monthly total on the provider dashboard, it just looks like you're getting more usage. You think "growth is good" until the Stripe payout hits and you realize your margins are gone.&lt;/p&gt;

&lt;p&gt;You need to track cost &lt;em&gt;per user&lt;/em&gt; and cost &lt;em&gt;per feature&lt;/em&gt;, not just total spend. If you see specific users driving crazy costs, they're probably accumulating massive context windows that you need to truncate.&lt;/p&gt;

&lt;p&gt;fwiw, I ran into this exact issue, which is why I built LLMeter (&lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-21-prompt-inflation-margin-killer" rel="noopener noreferrer"&gt;https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-21-prompt-inflation-margin-killer&lt;/a&gt;). It's an open-source, proxy-free way to track this stuff. It attributes costs down to the user ID level so you can actually see who is dragging around a 10k token history.&lt;/p&gt;

&lt;p&gt;Stop assuming your prompt is the same size it was on day one. Track it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>You Don't Need Enterprise LLMOps, You Need a Better Dashboard</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Wed, 27 May 2026 14:02:57 +0000</pubDate>
      <link>https://dev.to/amedinat/you-dont-need-enterprise-llmops-you-need-a-better-dashboard-25bg</link>
      <guid>https://dev.to/amedinat/you-dont-need-enterprise-llmops-you-need-a-better-dashboard-25bg</guid>
      <description>&lt;p&gt;PLATAFORMA: Dev.to&lt;/p&gt;

&lt;p&gt;Token bills are getting out of hand. Everyone knows it. The default response has been to reach for massive, venture-backed "LLMOps" platforms that promise to solve everything. They offer observability, caching, prompt versioning, evaluation, and a dozen other features.&lt;/p&gt;

&lt;p&gt;tbh, for most of us, that's overkill. It's like buying a full-scale CI/CD platform when all you need is a simple cron job.&lt;/p&gt;

&lt;p&gt;The real problem for 90% of devs isn't complex prompt A/B testing or fine-tuning workflows. It's answering one basic question: "Who or what is costing me so much money?"&lt;/p&gt;

&lt;p&gt;Usually, the answer is buried in a CSV file from OpenAI or Anthropic. You end up writing custom scripts to parse it, attribute costs to users, and hope you catch the runaway agent that's stuck in a loop summarizing the same text 1,000 times.&lt;/p&gt;

&lt;p&gt;This isn't an "observability" problem. It's a dashboard problem.&lt;/p&gt;

&lt;p&gt;Before you invest in a complex system, you need a clear view of three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Cost per user:&lt;/strong&gt; Which tenant is burning through your credits?&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Cost per model:&lt;/strong&gt; Is &lt;code&gt;claude-3-opus&lt;/code&gt; really worth 15x more than &lt;code&gt;haiku&lt;/code&gt; for that simple task?&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Real-time alerts:&lt;/strong&gt; Can you get a Slack notification when a user's spend hits $100, &lt;em&gt;before&lt;/em&gt; it hits $1,000?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Most enterprise tools do this, but they bundle it with features you won't touch for months. And they aren't cheap.&lt;/p&gt;

&lt;p&gt;This is why we built LLMeter as an open-source tool. It's not a massive platform. It's a focused, self-hostable dashboard (Next.js, Supabase) that does one thing well: monitor costs across different providers (OpenAI, Anthropic, DeepSeek, OpenRouter).&lt;/p&gt;

&lt;p&gt;It gives you multi-tenant attribution and budget alerts without the enterprise complexity. You can see which user is calling which model and how much it's costing you, in real-time. AGPL-3.0, so you can host it yourself.&lt;/p&gt;

&lt;p&gt;fwiw, the next time your bill spikes, don't assume you need a revolutionary AI-powered solution. You might just need a better dashboard. Check out the project at llmeter.org.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>The Token Spiral: How One Runaway AI Agent Burned $2,847 in 4 Hours</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Mon, 25 May 2026 14:08:57 +0000</pubDate>
      <link>https://dev.to/amedinat/the-token-spiral-how-one-runaway-ai-agent-burned-2847-in-4-hours-269n</link>
      <guid>https://dev.to/amedinat/the-token-spiral-how-one-runaway-ai-agent-burned-2847-in-4-hours-269n</guid>
      <description>&lt;p&gt;traditional monitoring is completely broken when it comes to AI agents. &lt;/p&gt;

&lt;p&gt;we've all seen the dashboards. everything is green. HTTP 200s across the board. p99 latency looks fine. CPU is barely ticking. &lt;/p&gt;

&lt;p&gt;meanwhile, your agent is stuck in an infinite retry loop, burning $80 per iteration because it keeps hallucinating an invalid JSON payload and asking the LLM to fix it. &lt;/p&gt;

&lt;p&gt;this exact failure mode—the "token spiral"—recently burned $2,847 in just 4 hours for a dev team. and they only noticed because their card declined.&lt;/p&gt;

&lt;p&gt;here is why standard observability tools miss this:&lt;br&gt;
they track the container, the request, the database. they don't track the &lt;em&gt;tokens per customer task&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;when an agent starts spiraling, it's making valid API calls to OpenAI or Anthropic. the provider happily returns 200 OK. the latency might be slightly elevated, but not enough to trigger a generic PagerDuty alert. it just looks like heavy usage.&lt;/p&gt;

&lt;p&gt;to catch a token spiral before it bankrupts you, you need runtime cost enforcement. not just a daily digest, but active circuit breakers.&lt;/p&gt;

&lt;p&gt;if you're at an enterprise, you buy Braintrust or Vantage. &lt;br&gt;
if you're building a startup or just vibing in your garage, you can't afford those.&lt;/p&gt;

&lt;p&gt;imo, you need open-source per-customer cost attribution. i built &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=agent-token-spiral-silent-killer" rel="noopener noreferrer"&gt;LLMeter&lt;/a&gt; to solve exactly this problem. it tracks costs by model, by user, by day. you can set budget alerts and actually see which specific tenant is spiraling out of control.&lt;/p&gt;

&lt;p&gt;ymmv, but don't deploy agents without cost circuit breakers. the API providers aren't going to refund you for bad prompts.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>The Token Spiral: How One Runaway AI Agent Burned $2,847 in 4 Hours</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Fri, 22 May 2026 14:02:11 +0000</pubDate>
      <link>https://dev.to/amedinat/the-token-spiral-how-one-runaway-ai-agent-burned-2847-in-4-hours-12ok</link>
      <guid>https://dev.to/amedinat/the-token-spiral-how-one-runaway-ai-agent-burned-2847-in-4-hours-12ok</guid>
      <description>&lt;p&gt;traditional monitoring is completely broken when it comes to AI agents. &lt;/p&gt;

&lt;p&gt;we've all seen the dashboards. everything is green. HTTP 200s across the board. p99 latency looks fine. CPU is barely ticking. &lt;/p&gt;

&lt;p&gt;meanwhile, your agent is stuck in an infinite retry loop, burning $80 per iteration because it keeps hallucinating an invalid JSON payload and asking the LLM to fix it. &lt;/p&gt;

&lt;p&gt;this exact failure mode—the "token spiral"—recently burned $2,847 in just 4 hours for a dev team. and they only noticed because their card declined.&lt;/p&gt;

&lt;p&gt;here is why standard observability tools miss this:&lt;br&gt;
they track the container, the request, the database. they don't track the &lt;em&gt;tokens per customer task&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;when an agent starts spiraling, it's making valid API calls to OpenAI or Anthropic. the provider happily returns 200 OK. the latency might be slightly elevated, but not enough to trigger a generic PagerDuty alert. it just looks like heavy usage.&lt;/p&gt;

&lt;p&gt;to catch a token spiral before it bankrupts you, you need runtime cost enforcement. not just a daily digest, but active circuit breakers.&lt;/p&gt;

&lt;p&gt;if you're at an enterprise, you buy Braintrust or Vantage. &lt;br&gt;
if you're building a startup or just vibing in your garage, you can't afford those.&lt;/p&gt;

&lt;p&gt;imo, you need open-source per-customer cost attribution. i built &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=agent-token-spiral-silent-killer" rel="noopener noreferrer"&gt;LLMeter&lt;/a&gt; to solve exactly this problem. it tracks costs by model, by user, by day. you can set budget alerts and actually see which specific tenant is spiraling out of control.&lt;/p&gt;

&lt;p&gt;ymmv, but don't deploy agents without cost circuit breakers. the API providers aren't going to refund you for bad prompts.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>The Overlooked Costs of Your LLM API Calls</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Wed, 20 May 2026 14:01:53 +0000</pubDate>
      <link>https://dev.to/amedinat/the-overlooked-costs-of-your-llm-api-calls-21gd</link>
      <guid>https://dev.to/amedinat/the-overlooked-costs-of-your-llm-api-calls-21gd</guid>
      <description>&lt;p&gt;Everyone tracks the cost per token. It's the obvious metric. But if that's all you're watching, you're missing the bigger picture. After spending way too much time sifting through invoices and logs, I've found the real cost sinks are often hidden elsewhere.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Retries &amp;amp; Timeouts Tax
&lt;/h3&gt;

&lt;p&gt;Your code retries on a 503 from OpenAI. Standard practice, right? But are you tracking the cost of those retries? A temporary outage or a poorly optimized prompt can cause a spike in retries, doubling or tripling the cost of a single user action without you even noticing until the end of the month. It's not just the API cost, either. It's the extended function execution time, the user waiting, the potential for cascading failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The "Which User Was That?" Problem
&lt;/h3&gt;

&lt;p&gt;A huge bill comes in. You see a spike in &lt;code&gt;gpt-4-turbo&lt;/code&gt; usage last Tuesday. Who caused it? Was it a single power user, a misbehaving script, or a feature getting abused? If you're just passing an API key from your backend, you have no idea. Per-user attribution isn't a vanity metric; it's essential. Without it, you can't tell who your most expensive users are, or if a specific user is hammering your service in a way you didn't anticipate.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The "Development vs. Production" Blind Spot
&lt;/h3&gt;

&lt;p&gt;You run tests, you experiment in a staging environment. All those calls are hitting your single API key. How much are you spending on non-production traffic? Is your CI/CD pipeline making a dozen LLM calls on every commit? These small, untracked costs from multiple developers and automated systems add up. They muddy the waters, making it impossible to see your actual production COGS.&lt;/p&gt;




&lt;p&gt;I built LLMeter to get a handle on this. It's an open-source dashboard that helps me see costs per user, per model, and set alerts before the bill gets out of hand. It directly connects to OpenAI, Anthropic, and others, giving me a real-time view of what's actually happening. Fwiw, it's helped me catch more than one runaway script. Check it out if you're tired of flying blind: &lt;a href="https://llmeter.org" rel="noopener noreferrer"&gt;https://llmeter.org&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>Your AI costs are growing faster than your revenue</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Mon, 18 May 2026 14:09:10 +0000</pubDate>
      <link>https://dev.to/amedinat/your-ai-costs-are-growing-faster-than-your-revenue-3c6o</link>
      <guid>https://dev.to/amedinat/your-ai-costs-are-growing-faster-than-your-revenue-3c6o</guid>
      <description>&lt;p&gt;Most startups integrating LLMs run into the exact same wall around month 6.&lt;/p&gt;

&lt;p&gt;User growth looks great. ARR is going up. But your OpenAI/Anthropic bill is growing 3x faster than your MRR. Suddenly your gross margins are negative, and you have no idea why.&lt;/p&gt;

&lt;p&gt;I've talked to dozens of founders this year. Almost everyone starts the same way: one global API key, no caching, and a "we'll figure out costs later" mentality. Later is usually when Stripe fails to cover the API bill.&lt;/p&gt;

&lt;p&gt;The problem isn't the model pricing. GPT-4o mini and Claude 3.5 Haiku are cheap. The problem is lack of visibility.&lt;/p&gt;

&lt;p&gt;When a customer complains about an issue, your support team runs an agent loop. When a user uploads a PDF, your RAG pipeline chunks and embeds 50 pages. Who paid for that? Which customer is actually profitable? &lt;/p&gt;

&lt;p&gt;Usually, 20% of your power users are burning 80% of your API budget, while the rest are subsidizing them. But without per-tenant cost attribution, you can't tell them apart. You can't adjust your pricing tiers.&lt;/p&gt;

&lt;p&gt;If you don't track costs per user, you are flying blind. &lt;/p&gt;

&lt;p&gt;Start tracking per-tenant usage early. Even logging token counts to your database is better than nothing.&lt;br&gt;
fwiw, if you don't want to build it yourself, I built LLMeter (&lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-15-devto-ai-costs-revenue" rel="noopener noreferrer"&gt;https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-15-devto-ai-costs-revenue&lt;/a&gt;). It's an open-source dashboard that tracks LLM API costs by model, by user, and by day. Handles OpenAI, Anthropic, DeepSeek, and OpenRouter. &lt;/p&gt;

&lt;p&gt;Stop guessing where your margin went.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>How I track per-customer LLM costs in production</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Fri, 15 May 2026 19:27:55 +0000</pubDate>
      <link>https://dev.to/amedinat/how-i-track-per-customer-llm-costs-in-production-4l2l</link>
      <guid>https://dev.to/amedinat/how-i-track-per-customer-llm-costs-in-production-4l2l</guid>
      <description>&lt;p&gt;Tracking LLM costs across an entire app is easy. Finding out &lt;em&gt;which&lt;/em&gt; customer is actually burning through your OpenAI bill? That's a nightmare.&lt;/p&gt;

&lt;p&gt;For a while, we were just eating the cost. You look at the Stripe dashboard, look at the OpenAI invoice, and pray the margins make sense. But when a single power user decides to process 10k documents through Claude on a Saturday night, averages stop mattering real fast.&lt;/p&gt;

&lt;p&gt;tbh, most billing dashboards are useless for this. They tell you you spent $400 yesterday, but not &lt;em&gt;who&lt;/em&gt; spent it.&lt;/p&gt;

&lt;p&gt;Here is what actually works in production, without over-engineering your entire stack.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Metadata trick
&lt;/h3&gt;

&lt;p&gt;If you're using OpenAI or Anthropic, you can pass metadata with every request. Don't just log it in your db. Pass the &lt;code&gt;user_id&lt;/code&gt; or &lt;code&gt;tenant_id&lt;/code&gt; directly to the provider.&lt;/p&gt;

&lt;p&gt;OpenAI supports this natively. Anthropic has custom headers. OpenRouter makes it trivial.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[...],&lt;/span&gt;
  &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`tenant_&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt; &lt;span class="c1"&gt;// &amp;lt;--- Do this. Always.&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This simple line changes everything. Now you can at least export CSVs from your provider and run a pivot table. But fwiw, doing that manually every week gets old fast.&lt;/p&gt;

&lt;h3&gt;
  
  
  Building a real ingestion pipeline
&lt;/h3&gt;

&lt;p&gt;We needed real-time budget alerts. If a user on a $19/mo plan burns $5 in API costs, I want a Slack ping before they hit $20.&lt;/p&gt;

&lt;p&gt;We ended up building a pipeline for this: Next.js edge functions for ingestion, Supabase for fast time-series queries, and Inngest for async budget checking. &lt;/p&gt;

&lt;p&gt;The flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Proxy the request (or capture the token usage post-request).&lt;/li&gt;
&lt;li&gt;Fire an event to Inngest with &lt;code&gt;tenant_id&lt;/code&gt;, &lt;code&gt;model&lt;/code&gt;, and &lt;code&gt;tokens&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Inngest writes to a Supabase table.&lt;/li&gt;
&lt;li&gt;If &lt;code&gt;sum(cost)&lt;/code&gt; &amp;gt; &lt;code&gt;budget_limit&lt;/code&gt; -&amp;gt; alert.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No Stripe integration needed. Just raw usage limits. Ymmv, but decoupling billing from cost-tracking saved us a lot of headaches.&lt;/p&gt;

&lt;h3&gt;
  
  
  Open sourcing the dashboard
&lt;/h3&gt;

&lt;p&gt;I got tired of rebuilding this for every project, so I built LLMeter. It's an open-source dashboard specifically for this: per-tenant LLM cost tracking.&lt;/p&gt;

&lt;p&gt;It tracks costs per model, per user, per day. Handles budget alerts. Supports OpenAI, Anthropic, DeepSeek, and OpenRouter out of the box. &lt;/p&gt;

&lt;p&gt;Stack is Next.js, Supabase, Inngest. It’s AGPL-3.0 licensed, completely free if you self-host.&lt;br&gt;
We use it internally to keep margins in check. &lt;/p&gt;

&lt;p&gt;Repo is up at &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=devto-per-customer-llm-costs" rel="noopener noreferrer"&gt;llmeter.org&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;How are you all handling multi-tenant API costs right now? Anyone found a better way to enforce per-user limits?&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>Agents need control flow because the loop pays the bill</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Wed, 13 May 2026 20:55:10 +0000</pubDate>
      <link>https://dev.to/amedinat/agents-need-control-flow-because-the-loop-pays-the-bill-414d</link>
      <guid>https://dev.to/amedinat/agents-need-control-flow-because-the-loop-pays-the-bill-414d</guid>
      <description>&lt;p&gt;last week a post called "agents need control flow, not more prompts" went around hn (thread &lt;a href="https://news.ycombinator.com/item?id=48051562" rel="noopener noreferrer"&gt;48051562&lt;/a&gt;, 588 points, 293 comments). the argument is an engineering one: open-ended prompt loops are unpredictable, deterministic harnesses aren't, so wrap the agent in a flowchart and feed it one step at a time. one commenter described doing exactly that — "wrapped the agent in a loop that kept feeding it the next step in the flowchart."&lt;/p&gt;

&lt;p&gt;all true. but there's a second axis the thread mostly stepped around, and one person said it out loud:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I used to assume they pushed people into the prompt-only workflows because you're paying them for the tokens" — DrewADesign, &lt;a href="https://news.ycombinator.com/item?id=48051562" rel="noopener noreferrer"&gt;same thread&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;an open-ended agent loop isn't just unreliable behavior. it's unbounded &lt;em&gt;spend&lt;/em&gt;. and the part almost nobody is instrumenting: when the invoice arrives, you have one number for a session that made 30 model calls, and no way to tell which of those 30 calls re-read the repo three times and cost $1.40 of the $1.83.&lt;/p&gt;

&lt;h2&gt;
  
  
  the loop got more expensive three times in the last 30 days
&lt;/h2&gt;

&lt;p&gt;the timing here isn't subtle. while the thread was arguing about flowcharts, the per-token price moved underneath everyone:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;github copilot&lt;/strong&gt; shifted to a token-credit model where the same Opus turn bills at a 1x / 7.5x / 27x multiplier depending on plan and overage state (HN 47923357). same work, three prices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;anthropic&lt;/strong&gt; A/B-tested removing Claude Code from the Pro tier mid-cycle (HN 47854477). people found out when the harness they'd built around it stopped working — no notice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;openai&lt;/strong&gt; shipped GPT-5.5 on may 8 at roughly 2x GPT-5.4's per-token price (HN 48057209, 213 pts). OpenRouter measured a 49–92% net cost increase even after the 19–34% token-efficiency gain, because efficiency doesn't save you if the price moved further than the efficiency did. a commenter there: "it's also quite the cost lottery and i'm not sure i am comfortable with that."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and the workload itself is variance-heavy &lt;em&gt;before&lt;/em&gt; any repricing. reflex.dev benchmarked computer-use against a structured API on the &lt;em&gt;same&lt;/em&gt; admin-panel task (HN 48024859, 269 comments): 550,976 ± 178,849 input tokens for the agent loop, 12,151 ± 27 for the structured call. the standard deviation on the loop is ~32% of the mean. run the same task twice and you get a 400k–750k token swing — and a matching 750s–1257s wall-clock swing.&lt;/p&gt;

&lt;p&gt;stack those: a workload that already swings ~2x run-to-run, on a per-token price that moved three times in a month, inside a loop whose length is decided by the model and not by you. "average cost per task" is not a number you can budget against. it's a number that was true once, for one run.&lt;/p&gt;

&lt;h2&gt;
  
  
  what we actually measured
&lt;/h2&gt;

&lt;p&gt;i build llmeter — an open-source dashboard for llm api cost tracking — so i spend a lot of time staring at this data. the thing that broke for us first wasn't the price. it was attribution.&lt;/p&gt;

&lt;p&gt;here's the shape of an agentic task: one user request becomes N model calls, where N is the agent's choice, not the user's. without per-call attribution your invoice for that session is a single line. you can't point at the iteration that re-read the repo three times. you can't tell a retried tool-call branch — one that already succeeded — apart from real work. "agents are expensive" stays a feeling.&lt;/p&gt;

&lt;p&gt;once we started recording cost per API call and rolling it up per user / per model / per day, two things fell out fast:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the cost distribution across "the same" task is &lt;strong&gt;bimodal, not normal&lt;/strong&gt;. most runs are cheap; a small tail of runs is 3–5x, and the tail is exactly where the loop decided to do something extra. the mean hides the tail. the p95 is the number that actually predicts your invoice.&lt;/li&gt;
&lt;li&gt;a handful of users — usually on the free tier, usually running something on a cron — accounted for a wildly disproportionate share of token spend. one cron job re-summarizing the same document every hour will quietly outspend your paying customers, and you won't see it in an aggregate provider bill.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;none of that is a pricing problem. it's a visibility problem that pricing volatility makes expensive.&lt;/p&gt;

&lt;h2&gt;
  
  
  what to do this week (none of this needs a tool)
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;find your single most expensive task in production this month.&lt;/strong&gt; not the average — the single worst run. if you can't query that in under five minutes, that's the gap, and closing it is usually a five-line change: log &lt;code&gt;model&lt;/code&gt;, &lt;code&gt;input_tokens&lt;/code&gt;, &lt;code&gt;output_tokens&lt;/code&gt;, &lt;code&gt;cached_tokens&lt;/code&gt; and a &lt;code&gt;task_id&lt;/code&gt; next to every completion call, then &lt;code&gt;GROUP BY task_id ORDER BY cost DESC LIMIT 10&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;break out the cached-token line.&lt;/strong&gt; OpenAI, Anthropic and DeepSeek each name cache-hit / cache-miss / cached-input differently, and the cached tier is the one that tends to move most on a repricing. if your cost rollup collapses everything into "input tokens," a price change on the cached tier is invisible until the invoice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;put a &lt;code&gt;task_id&lt;/code&gt; on agent loops and count the iterations.&lt;/strong&gt; the reflex numbers say iteration count is your variance source. if you're not logging "this user request fanned out to 14 model calls," you can't tell a healthy run from a runaway one — and you definitely can't alert on it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;alert on p95, not the mean.&lt;/strong&gt; a Slack ping at "you've spent your monthly average" fires after the damage. a ping at "this task is in the top 5% of cost we've ever recorded for this task type" fires while it's still running. (this is the one spot a tool earns its keep — per-model / per-user / per-day budget alerts — but the logic is simple enough to roll yourself.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;if you route to DeepSeek for cost, write down the date.&lt;/strong&gt; the V4-Pro 75% promo expires 2026-05-31 15:59 UTC and every line item goes 4x at that second. that's not a forecast, it's a calendar entry — model it before may 30, not on june 1.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  the point
&lt;/h2&gt;

&lt;p&gt;the control-flow argument and the cost argument are the same argument. a deterministic harness is predictable behavior &lt;em&gt;and&lt;/em&gt; a predictable bill. an open-ended loop is "trust me" on both. the harness people are right — but the reason to draw the flowchart isn't only that the agent behaves better. it's that you can finally point at the box that cost you the money.&lt;/p&gt;

&lt;p&gt;if you can't draw the cost shape of your agent's loop, your control flow is just hope.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;i build &lt;a href="https://llmeter.org/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=hn-week-loop-pays-the-bill" rel="noopener noreferrer"&gt;llmeter&lt;/a&gt; — an open-source (AGPL-3.0) cost dashboard for OpenAI / Anthropic / DeepSeek / OpenRouter / Mistral / Azure OpenAI: per-model, per-user, per-day, with budget alerts. it's not a proxy and it doesn't sit in your request path — the SDK forwards usage metadata async. the per-call attribution stuff above is the part that made me build it. free tier is one provider / 7-day retention. genuinely want to hear how other people slice agentic cost — what does your rollup key on?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>saas</category>
      <category>agents</category>
    </item>
    <item>
      <title>Your prompt is getting longer without you knowing it (and it's killing your margins)</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Tue, 12 May 2026 21:39:59 +0000</pubDate>
      <link>https://dev.to/amedinat/your-prompt-is-getting-longer-without-you-knowing-it-and-its-killing-your-margins-1b71</link>
      <guid>https://dev.to/amedinat/your-prompt-is-getting-longer-without-you-knowing-it-and-its-killing-your-margins-1b71</guid>
      <description>&lt;p&gt;I've been looking at LLM billing patterns lately, and there's a silent killer that creeps up on almost every team: prompt inflation.&lt;/p&gt;

&lt;p&gt;When you first build an AI feature, your prompt is tight. Maybe 500 tokens for the system instructions and 100 for the user query. The math looks great. "This will cost us fractions of a cent per call," you tell the team.&lt;/p&gt;

&lt;p&gt;Fast forward three months.&lt;/p&gt;

&lt;p&gt;Someone added conversation history to make the bot "smarter." Another dev added a massive RAG context block because the model hallucinated once. Product asked for formatting instructions, so now the system prompt is a 2,000-word essay. &lt;/p&gt;

&lt;p&gt;Suddenly, your baseline request is 8k tokens. &lt;/p&gt;

&lt;p&gt;The worst part is that user value doesn't scale linearly with prompt size. But your OpenAI bill sure does. If you're running at scale, you're suddenly paying $0.05+ per request for a feature you modeled at $0.005. &lt;/p&gt;

&lt;p&gt;If you just look at your monthly total on the provider dashboard, it just looks like you're getting more usage. You think "growth is good" until the Stripe payout hits and you realize your margins are gone.&lt;/p&gt;

&lt;p&gt;You need to track cost &lt;em&gt;per user&lt;/em&gt; and cost &lt;em&gt;per feature&lt;/em&gt;, not just total spend. If you see specific users driving crazy costs, they're probably accumulating massive context windows that you need to truncate.&lt;/p&gt;

&lt;p&gt;fwiw, I ran into this exact issue, which is why I built LLMeter (&lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-21-prompt-inflation-margin-killer" rel="noopener noreferrer"&gt;https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-21-prompt-inflation-margin-killer&lt;/a&gt;). It's an open-source, proxy-free way to track this stuff. It attributes costs down to the user ID level so you can actually see who is dragging around a 10k token history.&lt;/p&gt;

&lt;p&gt;Stop assuming your prompt is the same size it was on day one. Track it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>The Hidden 43% — How Teams Are Wasting Almost Half Their LLM API Budget</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Fri, 08 May 2026 23:20:51 +0000</pubDate>
      <link>https://dev.to/amedinat/the-hidden-43-how-teams-are-wasting-almost-half-their-llm-api-budget-32b5</link>
      <guid>https://dev.to/amedinat/the-hidden-43-how-teams-are-wasting-almost-half-their-llm-api-budget-32b5</guid>
      <description>&lt;p&gt;You look at your provider dashboard and see one number: the total bill. It's like getting an electricity bill that just says "$5,000" with no breakdown of whether it was the AC, the fridge, or someone leaving the lights on all month.&lt;/p&gt;

&lt;p&gt;tbh, most AI startups are flying blind right now. We recently looked into the cost breakdown for several teams and found something crazy: almost 43% of LLM API spend is completely wasted. It’s not about paying for usage; it’s about paying for bad architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here’s where the leaks are actually happening:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Retry Storms (34% of waste)&lt;br&gt;
Your agent fails to parse a JSON response, so it retries. And retries. Sometimes 5-10 times in a loop. You aren't just paying for the failure, you are paying for the massive context window sent every single time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Duplicate Calls (85% of apps have this issue)&lt;br&gt;
Multiple users asking the exact same question, or internal systems running the same RAG pipeline on the same document. Without caching at the provider level, you're paying OpenAI to generate the identical tokens twice.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Context Bloat&lt;br&gt;
Sending the entire 50-page document history when the user just asked "what's the summary of page 2". RAG is great, but shoving everything into the prompt "just in case" is burning your runway.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Wrong Model Selection&lt;br&gt;
Using GPT-4o or Claude 3 Opus for simple classification tasks when Haiku or GPT-3.5-turbo would do it for a fraction of the cost.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can't fix what you can't see. That's exactly why I built LLMeter (&lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=hidden-43-percent-llm-waste" rel="noopener noreferrer"&gt;https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=hidden-43-percent-llm-waste&lt;/a&gt;). It's an open-source dashboard that gives you per-customer and per-model cost tracking. Stop guessing who or what is draining your API budget.&lt;/p&gt;

&lt;p&gt;Fwiw, just setting up basic budget alerts and seeing the breakdown by tenant usually drops a team's bill by 20% in the first week. Give it a try, it's open source (AGPL-3.0) and you can self-host or use the free tier.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>saas</category>
    </item>
  </channel>
</rss>
