<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: John Medina</title>
    <description>The latest articles on DEV Community by John Medina (@amedinat).</description>
    <link>https://dev.to/amedinat</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3854284%2F73b7fb73-f118-4d37-b5a7-37581d43bd0a.png</url>
      <title>DEV Community: John Medina</title>
      <link>https://dev.to/amedinat</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/amedinat"/>
    <language>en</language>
    <item>
      <title>How to Stop One Customer From Blowing Up Your Entire LLM Budget</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Mon, 22 Jun 2026 14:06:46 +0000</pubDate>
      <link>https://dev.to/amedinat/how-to-stop-one-customer-from-blowing-up-your-entire-llm-budget-4mac</link>
      <guid>https://dev.to/amedinat/how-to-stop-one-customer-from-blowing-up-your-entire-llm-budget-4mac</guid>
      <description>&lt;p&gt;So your SaaS is finally getting some traction. Congrats. Then you check your OpenAI bill and realize one power user just cost you $500 overnight running reports. Now what?&lt;/p&gt;

&lt;p&gt;This isn't a rare problem. If you're building any multi-tenant AI app, your biggest financial risk is a single user with a runaway script or an unpredictable use case. Standard API rate limits are too crude—they punish all users and can kill legitimate usage. Manually watching your dashboard doesn't scale past your first few customers.&lt;/p&gt;

&lt;p&gt;You need a way to track costs &lt;em&gt;per user&lt;/em&gt; and enforce budgets automatically.&lt;/p&gt;

&lt;p&gt;Most people start by trying to build this logic in-house. You can add a &lt;code&gt;user_id&lt;/code&gt; to your API calls and log the token counts to your own database. Then you run a cron job to aggregate costs and check against a &lt;code&gt;budget&lt;/code&gt; column in your &lt;code&gt;users&lt;/code&gt; table.&lt;/p&gt;

&lt;p&gt;It works, until it doesn't.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Provider differences:&lt;/strong&gt; The way you calculate costs for OpenAI is different from Anthropic, and different again for OpenRouter. Your logic gets complex fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timing issues:&lt;/strong&gt; Cron jobs aren't real-time. By the time your job runs, a user could have already gone 2x over their budget.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintenance:&lt;/strong&gt; It's another piece of infrastructure you have to build, test, and maintain. That's time you're not spending on your core product.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tbh, I got tired of rebuilding this for every project.&lt;/p&gt;

&lt;p&gt;So I built a simple, open-source tool to handle it: &lt;a href="https://llmeter.org" rel="noopener noreferrer"&gt;LLMeter&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It's a dashboard that sits on top of your existing LLM providers (OpenAI, Anthropic, etc.). You tell it which user made which API call, and it handles the rest.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tracks costs per-user, in real-time.&lt;/li&gt;
&lt;li&gt;Lets you set a budget for each user.&lt;/li&gt;
&lt;li&gt;Sends you a webhook or email when a user hits 50%, 90%, or 100% of their budget.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can then use that webhook to programmatically disable that user's access, switch them to a slower model, or just notify them. No more surprise bills.&lt;/p&gt;

&lt;p&gt;It's not a proxy, so it doesn't add latency. It's just a simple, open-source dashboard you can self-host or use the managed version. Fwiw, it solved my own problem. Maybe it'll solve yours too.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>Over-editing is a token tax: GPT-5.4 ships 6.5x more diff per fix than Claude Opus 4.6, and your bill notices</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Mon, 15 Jun 2026 20:13:04 +0000</pubDate>
      <link>https://dev.to/amedinat/over-editing-is-a-token-tax-gpt-54-ships-65x-more-diff-per-fix-than-claude-opus-46-and-your-4o9j</link>
      <guid>https://dev.to/amedinat/over-editing-is-a-token-tax-gpt-54-ships-65x-more-diff-per-fix-than-claude-opus-46-and-your-4o9j</guid>
      <description>&lt;p&gt;A model is over-editing if its output is functionally correct but structurally diverges from the original code more than the minimal fix requires. Left unconstrained, the extended reasoning gives models more room to 'improve' code that doesn't need improving.&lt;/p&gt;

&lt;p&gt;GPT-5.4 averages 0.395 normalized Levenshtein distance per edit. Claude Opus 4.6 averages 0.060. That is 6.5x more output tokens for the same class of fix, averaged across the benchmark. Pass@1 correctness is similar (0.723–0.912 across models), so the over-editing is paid waste, not paid capability.&lt;/p&gt;

&lt;p&gt;What does 6.5x look like on a bill? A 50-engineer org doing 800 agent edits per engineer per month = 40k edits/mo. At average 500 output tokens per minimal fix × $15/M Opus 4.7 output = $300/mo. At 3,250 output tokens per over-edited fix = $1,950/mo. Delta is $1,650/mo per 40k edits, pure output-token waste with no correctness upside. Scale to your actual traffic.&lt;/p&gt;

&lt;p&gt;Why 'just use a smaller model' isn't the answer: reasoning models got worse (not better) at minimal editing when given more reasoning budget. So you can't fix over-editing by paying more; you fix it by measuring the ratio and routing around it.&lt;/p&gt;

&lt;p&gt;The metric CFOs actually need is over-edit ratio per agent: &lt;code&gt;over_edit_ratio = output_tokens / minimum_required_tokens_to_achieve_green_tests&lt;/code&gt;. Infrastructure to compute this: log full diff of every agent edit, run patch-min on the diff offline, diff size ratio = your over-edit score.&lt;/p&gt;

&lt;p&gt;Instrument over-edit ratio this quarter, treat it as a first-class SLO per agent (budget for &amp;lt;0.2 average), and route high-stakes "minimal" tasks to models whose published over-edit score is &amp;lt;0.1.&lt;/p&gt;

&lt;p&gt;Attribution is the prerequisite for every other cost signal you'll want this year. &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=devto-over-editing-tax-20260423-simon" rel="noopener noreferrer"&gt;LLMeter&lt;/a&gt; ships per-customer + per-agent attribution today. Over-edit ratio is the first quality-flavored metric where LLMeter's attribution layer is the right home.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>Stop Paying for Failed AI Agent Retries</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Fri, 05 Jun 2026 14:08:15 +0000</pubDate>
      <link>https://dev.to/amedinat/stop-paying-for-failed-ai-agent-retries-5c1f</link>
      <guid>https://dev.to/amedinat/stop-paying-for-failed-ai-agent-retries-5c1f</guid>
      <description>&lt;p&gt;When your AI agent fails a step and retries, you are paying for the exact same context window over and over again.&lt;/p&gt;

&lt;p&gt;Most devs just stick a try-catch block around their LLM calls and call it a day. But tbh when an agent loops 5 times because of a hallucinated JSON schema, your cost per action just 5x'd. And standard dashboards? They just show a massive spike in "API Usage" without telling you it was a single runaway process.&lt;/p&gt;

&lt;p&gt;I built LLMeter specifically to catch this. It tracks costs per-customer and flags anomalous retry loops in real-time. If you're running agents in production, you need to monitor this or your margins will disappear before you notice.&lt;/p&gt;

&lt;p&gt;You can check it out at &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=devto-stop-paying-for-failed-retries" rel="noopener noreferrer"&gt;https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=devto-stop-paying-for-failed-retries&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>Over-editing is a token tax: GPT-5.4 ships 6.5x more diff per fix than Claude Opus 4.6, and your bill notices</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Wed, 03 Jun 2026 14:08:46 +0000</pubDate>
      <link>https://dev.to/amedinat/over-editing-is-a-token-tax-gpt-54-ships-65x-more-diff-per-fix-than-claude-opus-46-and-your-79d</link>
      <guid>https://dev.to/amedinat/over-editing-is-a-token-tax-gpt-54-ships-65x-more-diff-per-fix-than-claude-opus-46-and-your-79d</guid>
      <description>&lt;p&gt;A model is over-editing if its output is functionally correct but structurally diverges from the original code more than the minimal fix requires. Left unconstrained, the extended reasoning gives models more room to 'improve' code that doesn't need improving.&lt;/p&gt;

&lt;p&gt;GPT-5.4 averages 0.395 normalized Levenshtein distance per edit. Claude Opus 4.6 averages 0.060. That is 6.5x more output tokens for the same class of fix, averaged across the benchmark. Pass@1 correctness is similar (0.723–0.912 across models), so the over-editing is paid waste, not paid capability.&lt;/p&gt;

&lt;p&gt;What does 6.5x look like on a bill? A 50-engineer org doing 800 agent edits per engineer per month = 40k edits/mo. At average 500 output tokens per minimal fix × $15/M Opus 4.7 output = $300/mo. At 3,250 output tokens per over-edited fix = $1,950/mo. Delta is $1,650/mo per 40k edits, pure output-token waste with no correctness upside. Scale to your actual traffic.&lt;/p&gt;

&lt;p&gt;Why 'just use a smaller model' isn't the answer: reasoning models got worse (not better) at minimal editing when given more reasoning budget. So you can't fix over-editing by paying more; you fix it by measuring the ratio and routing around it.&lt;/p&gt;

&lt;p&gt;The metric CFOs actually need is over-edit ratio per agent: &lt;code&gt;over_edit_ratio = output_tokens / minimum_required_tokens_to_achieve_green_tests&lt;/code&gt;. Infrastructure to compute this: log full diff of every agent edit, run patch-min on the diff offline, diff size ratio = your over-edit score.&lt;/p&gt;

&lt;p&gt;Instrument over-edit ratio this quarter, treat it as a first-class SLO per agent (budget for &amp;lt;0.2 average), and route high-stakes "minimal" tasks to models whose published over-edit score is &amp;lt;0.1.&lt;/p&gt;

&lt;p&gt;Attribution is the prerequisite for every other cost signal you'll want this year. &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=devto-over-editing-tax-20260423-simon" rel="noopener noreferrer"&gt;LLMeter&lt;/a&gt; ships per-customer + per-agent attribution today. Over-edit ratio is the first quality-flavored metric where LLMeter's attribution layer is the right home.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>Stop sharing one OpenAI key across all your users</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Mon, 01 Jun 2026 14:09:04 +0000</pubDate>
      <link>https://dev.to/amedinat/stop-sharing-one-openai-key-across-all-your-users-3g8g</link>
      <guid>https://dev.to/amedinat/stop-sharing-one-openai-key-across-all-your-users-3g8g</guid>
      <description>&lt;p&gt;I see this pattern everywhere. A startup launches their AI feature, they drop a single &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; in their &lt;code&gt;.env&lt;/code&gt;, and call it a day. &lt;/p&gt;

&lt;p&gt;tbh, it works fine for the first 100 users. Then user 101 figures out how to write a 50-turn loop that triggers your agent to summarize War and Peace every hour, and your Stripe balance goes negative.&lt;/p&gt;

&lt;p&gt;The problem isn't the API cost. The problem is you have zero multi-tenant attribution. When the $5k bill hits, all you see is &lt;code&gt;gpt-4o&lt;/code&gt; usage. You have no idea &lt;em&gt;who&lt;/em&gt; caused it.&lt;/p&gt;

&lt;p&gt;If you are building B2B SaaS, you need to track cost per tenant from day one. Not per endpoint. Not per model. Per tenant. &lt;/p&gt;

&lt;p&gt;How to actually fix this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Stop using the raw OpenAI client everywhere. Wrap it.&lt;/li&gt;
&lt;li&gt;Inject &lt;code&gt;tenantId&lt;/code&gt; and &lt;code&gt;userId&lt;/code&gt; into every single completion request as metadata or a tag. &lt;/li&gt;
&lt;li&gt;Log the &lt;code&gt;usage&lt;/code&gt; object from the response asynchronously. Don't block the critical path.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I built &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=devto-multi-tenant-cost-attribution-20260422" rel="noopener noreferrer"&gt;LLMeter&lt;/a&gt; exactly for this because I got tired of building the same tracking wrapper at every company. It's open source (AGPL), uses Supabase, and tracks cost per user and per day out of the box. ymmv with other tools, but you need &lt;em&gt;something&lt;/em&gt; that gives you a dashboard of which users are burning your margin. &lt;/p&gt;

&lt;p&gt;Stop flying blind.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>Cache-hit dispersion is the 7th vendor-risk axis — and the one your invoice can't see</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Mon, 01 Jun 2026 14:04:58 +0000</pubDate>
      <link>https://dev.to/amedinat/cache-hit-dispersion-is-the-7th-vendor-risk-axis-and-the-one-your-invoice-cant-see-4b93</link>
      <guid>https://dev.to/amedinat/cache-hit-dispersion-is-the-7th-vendor-risk-axis-and-the-one-your-invoice-cant-see-4b93</guid>
      <description>&lt;p&gt;stavros dropped a comment on hn yesterday that should have ended the per-token billing conversation for anyone running a multi-tenant llm product, but it didn't, because the implication is too inconvenient to take seriously yet (&lt;a href="https://news.ycombinator.com/item?id=48261733" rel="noopener noreferrer"&gt;thread&lt;/a&gt;, 581 pts / 243 c on the deepseek reasonix front page).&lt;/p&gt;

&lt;p&gt;his numbers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"the prices are what equivalent Sonnet usage would have cost, the actual amount I paid was $10. On performance, DeepSeek V4 Pro is comparable to Sonnet for me. 97.27% cache hit rate."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;ten dollars actual. two hundred forty one dollars sonnet-equivalent. same task, same model, same pricing card. the only variable: how many of his calls landed on warm cache.&lt;/p&gt;

&lt;p&gt;twenty four x.&lt;/p&gt;

&lt;p&gt;and three more accounts in the same sub-thread confirmed similar dispersion — embedding-shape at 96.4% through a codex bridge, estebarb at 98.6% on opencode, metalspot saying his own steady-state agent loop sits "consistently above 95 once the context is primed." different stacks, different bridges, all converging on the same shape: once you're cache-warm, the per-token sticker price stops describing what you pay.&lt;/p&gt;

&lt;p&gt;if you run a saas where customers consume llm tokens — chat, agent, copilot, anything — that 24× spread is &lt;em&gt;between your tenants on the same model&lt;/em&gt;. and your vendor dashboard is reporting the aggregate. you have no idea which customers actually cost you money.&lt;/p&gt;

&lt;h2&gt;
  
  
  the seven axes, written down in one place
&lt;/h2&gt;

&lt;p&gt;we've been mapping a vendor-risk taxonomy on this blog for about six weeks, one axis per hn front-page incident. people keep asking for the consolidated version, so here it is, with the originating thread for each:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;axis&lt;/th&gt;
&lt;th&gt;originating signal&lt;/th&gt;
&lt;th&gt;what it costs you&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;acquihire-eol&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;helicone → mintlify (2025-08); stainless → anthropic (2026-05-19, &lt;a href="https://news.ycombinator.com/item?id=48182281" rel="noopener noreferrer"&gt;HN 48182281&lt;/a&gt;)&lt;/td&gt;
&lt;td&gt;observability vendor goes dark, you migrate under duress&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;multiplier-creep&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;gemini 3.5 flash repriced ~14× from flash 3 at launch (2026-05-20); copilot 27× credit multiplier; cursor team plan 5×&lt;/td&gt;
&lt;td&gt;unit cost moves under a stable model name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;suspension-without-recourse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;railway / gcp account terminations posted as ask hn (2026-05-21)&lt;/td&gt;
&lt;td&gt;provider kills your service with no human escalation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;tco opacity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"was my $48k gpu server worth it" devto 2026-05-22&lt;/td&gt;
&lt;td&gt;rent-vs-own per-experiment cost unknowable until you've already chosen&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;budget-blowout-at-scale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;microsoft killing claude code internally because a december pilot ate 2026's ai budget (&lt;a href="https://news.ycombinator.com/item?id=48238979" rel="noopener noreferrer"&gt;HN 48238979&lt;/a&gt;, 285 pts)&lt;/td&gt;
&lt;td&gt;pilot becomes annual line item before kill-switch fires&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;support-function-attrition&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;aws "four years and out" (&lt;a href="https://news.ycombinator.com/item?id=48254475" rel="noopener noreferrer"&gt;HN 48254475&lt;/a&gt;, 219 pts) — ex-ossm liaison leaves because human-in-the-loop roles are getting llm-restructured&lt;/td&gt;
&lt;td&gt;the non-fungible human who reverses a wrongful suspension isn't there next quarter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;cache-hit-rate dispersion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;deepseek reasonix (&lt;a href="https://news.ycombinator.com/item?id=48261733" rel="noopener noreferrer"&gt;HN 48261733&lt;/a&gt;, 581 pts, 2026-05-24) — $10 vs $241 / 24× at 97% cache&lt;/td&gt;
&lt;td&gt;unit cost spread of 10-25× between tenants on the same model is invisible until you measure it harness-side&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;axes 1–6 are detectable from billing data. you can look at your invoice and notice the change. they're slow, but they're legible.&lt;/p&gt;

&lt;p&gt;axis 7 is structurally different. the vendor invoice is &lt;strong&gt;already&lt;/strong&gt; weighted by the actual cache hit ratio you got. it doesn't tell you that customer A is paying you $0.04/request at 98% cache while customer B is generating $0.95/request at 61% cache on the same prompts. you see one line: "this month's deepseek bill: $3,500." you don't see that 3 of your 200 tenants generated 70% of the marginal cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  why this kept being invisible
&lt;/h2&gt;

&lt;p&gt;three reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;first&lt;/strong&gt;, the vendor pricing card lists per-token rates with a "cached input" line at 10-25% of the regular input rate. it implies cache is a 4-10× discount applied to your invoice in aggregate. it isn't. cache hit rate is a property of the &lt;em&gt;workload&lt;/em&gt;, not the model — and workloads vary across tenants by an order of magnitude on the same product.&lt;/p&gt;

&lt;p&gt;a tenant whose conversation loop reuses 30k tokens of system prompt + scratchpad + tool definitions on every turn lives at 95-98%. a tenant whose product spawns one-shot calls with fresh context per request lives at 0-15%. that's not a 4× spread, that's a 20-30× spread on input tokens alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;second&lt;/strong&gt;, vendor dashboards aggregate. anthropic's usage view, openai's billing dashboard, deepseek's portal — all report cache hit &lt;em&gt;across your entire account&lt;/em&gt;. for a single-tenant product that's fine. for a saas with 200 customers, you're looking at the average of two populations: the warm-cache power users and the cold-cache thrash users. the average tells you neither.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;third&lt;/strong&gt;, "cost equivalence" reporting masks it further. when deepseek tells stavros his calls would have cost $241 on sonnet, that's a &lt;em&gt;sonnet-equivalent&lt;/em&gt; price calculated on token volume. it doesn't subtract anthropic's prompt caching, which also offers 90% discounts on cache hits. the apples-to-apples number on sonnet would be lower than $241 — but stavros wouldn't know that without re-running on sonnet and measuring his own cache rate there too. the sticker comparison is doing what stickers do: hiding the variance.&lt;/p&gt;

&lt;h2&gt;
  
  
  what cache-hit dispersion does to your p&amp;amp;l
&lt;/h2&gt;

&lt;p&gt;let me run the math on a synthetic but realistic shape.&lt;/p&gt;

&lt;p&gt;assume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;100 tenants on your product&lt;/li&gt;
&lt;li&gt;identical pricing: $50/mo per tenant&lt;/li&gt;
&lt;li&gt;identical surface workload: ~2M input tokens per tenant per month&lt;/li&gt;
&lt;li&gt;cache hit rates distributed: 40% of tenants at 90-98%, 40% at 50-70%, 20% at 10-30%&lt;/li&gt;
&lt;li&gt;deepseek v4 pro pricing: $0.27/M input (uncached), $0.07/M cache hit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;the warm tenants cost you roughly $0.20-$0.30/month on inference.&lt;br&gt;
the mid tenants cost you $0.80-$1.10/month.&lt;br&gt;
the cold tenants cost you $4.50-$5.20/month.&lt;/p&gt;

&lt;p&gt;on a $50 sticker price these all look profitable. but: your top 20 cold-cache tenants are eating ~$100/month combined while the bottom 40 warm-cache tenants contribute ~$10. one cohort is subsidizing the other and you can't see it because you priced on average tokens.&lt;/p&gt;

&lt;p&gt;now scale that to coding agents — where prompt sizes are 50k–200k and cache hit rate dispersion is even wider — and the math gets worse. an agent loop on a 200k context can cost you $3-$8 per task at low cache or $0.10-$0.20 at high cache. &lt;strong&gt;two orders of magnitude on the same workload on the same model.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;this is the structural reason "what's our cogs per customer" stops being answerable from vendor dashboards in 2026. the question isn't "how much did i spend on llms" anymore. it's "how is that spend distributed across the customers who generated it" — and the answer lives in your runtime, not in the bill.&lt;/p&gt;
&lt;h2&gt;
  
  
  instrumenting axis 7 in your stack today
&lt;/h2&gt;

&lt;p&gt;three changes, low effort, you can ship this week:&lt;/p&gt;
&lt;h3&gt;
  
  
  1. capture &lt;code&gt;cache_read_input_tokens&lt;/code&gt; separately on every call
&lt;/h3&gt;

&lt;p&gt;every modern provider returns it. log it. don't roll cached and uncached input together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;attributedCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ledger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tenant&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;feature_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;feature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;cached_input_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cache_read_input_tokens&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;cache_hit_ratio&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cache_read_input_tokens&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;cost_usd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;priceWithCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;5-10 lines. it gives you the only field that actually predicts your invoice variance per tenant.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. rollup cache hit rate &lt;strong&gt;per tenant&lt;/strong&gt;, not per account
&lt;/h3&gt;

&lt;p&gt;a daily job that computes &lt;code&gt;tenant_id → median cache_hit_ratio, p10 cache_hit_ratio, n_requests&lt;/code&gt;. that's the table that tells you which customers are on the wrong end of axis 7. if the gap between p10 and median is wider than 30 percentage points inside a single tenant, that tenant has internal workload variance worth understanding — usually a bursty integration or a feature flagged on for them only.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. set a per-tenant marginal cogs threshold, not a total spend alert
&lt;/h3&gt;

&lt;p&gt;alert on "tenant T's marginal cost-per-request crossed $0.50 for the rolling 7-day window," not "this month's bill is up 20%." by the time the second alert fires, the bill is already up 20%. the first one fires while the workload is still in progress and you can intervene — change the model, throttle, route, talk to the customer about what changed.&lt;/p&gt;

&lt;h2&gt;
  
  
  why dashboards from observability vendors will continue to miss this
&lt;/h2&gt;

&lt;p&gt;datadog, new relic, sentry, helicone, langfuse, portkey, langsmith — almost all of them sit at the model layer or the gateway layer. they see calls. they tag calls. they aggregate calls. what they don't do is &lt;strong&gt;own the harness-side attribution&lt;/strong&gt;: which session, which feature, which tenant, which agent loop iteration, which retry — the keys that let you join &lt;code&gt;cache_hit_ratio&lt;/code&gt; to your business object.&lt;/p&gt;

&lt;p&gt;the vendors that ship at the gateway layer have a structural conflict, too: most of them are owned by, acquired by, or routing through the same providers whose pricing card they'd need to interrogate. helicone is mintlify property since 2025-08. langfuse is clickhouse property since 2026-01. stainless is part of anthropic as of 2026-05-19. portkey is mid-acquisition by palo alto networks per their 2026-04-30 release. axis 1 (acquihire-eol) and axis 7 (cache dispersion) collide here: the layer that &lt;em&gt;should&lt;/em&gt; measure dispersion is the layer that's getting acquired by the entity whose dispersion you're trying to measure.&lt;/p&gt;

&lt;p&gt;self-hosting the attribution layer — agpl, in your stack, owned by you — is the only configuration where the answer to "which tenant cost me what" stays answerable across provider acquisitions, pricing changes, and dashboard re-skins.&lt;/p&gt;

&lt;h2&gt;
  
  
  what to do this week, no tool required
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;pull last 30 days of llm calls from your logs.&lt;/strong&gt; group by tenant. compute median cache_hit_ratio per tenant. if you don't have &lt;code&gt;cache_read_input_tokens&lt;/code&gt; logged, add it today — 5 lines per call site.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;find the 5 tenants with the worst cache hit rate.&lt;/strong&gt; what's their workload shape? thrashing context? cold-start agents? you probably have a product problem, not just a cost problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;find the 5 tenants with the best cache hit rate.&lt;/strong&gt; how are they using your product? this is your retention shape. they're the ones priced correctly under your current sticker.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;compute marginal cost per active tenant per day.&lt;/strong&gt; divide by your sticker price. anything above 25% is a margin red flag. anything above 100% is a customer you're paying to keep.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;write down the calendar.&lt;/strong&gt; the june 15 anthropic agent sdk credit-pool split changes cache accounting semantics for everyone on claude. the deepseek v4-pro 75% promo expires 2026-05-31 15:59 UTC and the post-promo per-token rate quadruples. if you don't already model what your cache-hit distribution does at the new prices, you'll find out on the july invoice.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  the point
&lt;/h2&gt;

&lt;p&gt;per-token billing was a fine abstraction in 2023 when context windows were 4-8k, cache wasn't a line item, and most products had one workload shape. it stopped describing reality somewhere around the time agent loops normalized 200k contexts and providers shipped 90% prompt-cache discounts. the unit cost spread between cache-warm and cache-cold tenants on the same model is now larger than the spread between different models, and nothing in the vendor's billing surface tells you which side any given tenant is on.&lt;/p&gt;

&lt;p&gt;axes 1-6 of the taxonomy say "your vendor will surprise you on the bill." axis 7 says "your vendor will surprise you on which &lt;em&gt;customers&lt;/em&gt; generated the bill" — and that one is worse, because customer p&amp;amp;l drives the product decisions you make from here.&lt;/p&gt;

&lt;p&gt;if your cogs report rolls cache and non-cache together, you don't have an attribution model. you have an average that lies about your distribution.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;i build &lt;a href="https://llmeter.org/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=hn-week-cache-hit-axis-7" rel="noopener noreferrer"&gt;llmeter&lt;/a&gt; — open-source (agpl-3.0) attribution at the harness layer. per-tenant rollups, per-feature cogs, cache-hit ratio surfaced as a first-class metric. it's not a proxy and doesn't sit in your request path. genuinely curious what cache-hit-rate distribution looks like across your own tenant base — drop a number if you've measured it.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>costs</category>
      <category>saas</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Your prompt is getting longer without you knowing it (and it's killing your margins)</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Fri, 29 May 2026 14:02:11 +0000</pubDate>
      <link>https://dev.to/amedinat/your-prompt-is-getting-longer-without-you-knowing-it-and-its-killing-your-margins-2293</link>
      <guid>https://dev.to/amedinat/your-prompt-is-getting-longer-without-you-knowing-it-and-its-killing-your-margins-2293</guid>
      <description>&lt;p&gt;I've been looking at LLM billing patterns lately, and there's a silent killer that creeps up on almost every team: prompt inflation.&lt;/p&gt;

&lt;p&gt;When you first build an AI feature, your prompt is tight. Maybe 500 tokens for the system instructions and 100 for the user query. The math looks great. "This will cost us fractions of a cent per call," you tell the team.&lt;/p&gt;

&lt;p&gt;Fast forward three months.&lt;/p&gt;

&lt;p&gt;Someone added conversation history to make the bot "smarter." Another dev added a massive RAG context block because the model hallucinated once. Product asked for formatting instructions, so now the system prompt is a 2,000-word essay. &lt;/p&gt;

&lt;p&gt;Suddenly, your baseline request is 8k tokens. &lt;/p&gt;

&lt;p&gt;The worst part is that user value doesn't scale linearly with prompt size. But your OpenAI bill sure does. If you're running at scale, you're suddenly paying $0.05+ per request for a feature you modeled at $0.005. &lt;/p&gt;

&lt;p&gt;If you just look at your monthly total on the provider dashboard, it just looks like you're getting more usage. You think "growth is good" until the Stripe payout hits and you realize your margins are gone.&lt;/p&gt;

&lt;p&gt;You need to track cost &lt;em&gt;per user&lt;/em&gt; and cost &lt;em&gt;per feature&lt;/em&gt;, not just total spend. If you see specific users driving crazy costs, they're probably accumulating massive context windows that you need to truncate.&lt;/p&gt;

&lt;p&gt;fwiw, I ran into this exact issue, which is why I built LLMeter (&lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-21-prompt-inflation-margin-killer" rel="noopener noreferrer"&gt;https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-21-prompt-inflation-margin-killer&lt;/a&gt;). It's an open-source, proxy-free way to track this stuff. It attributes costs down to the user ID level so you can actually see who is dragging around a 10k token history.&lt;/p&gt;

&lt;p&gt;Stop assuming your prompt is the same size it was on day one. Track it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>You Don't Need Enterprise LLMOps, You Need a Better Dashboard</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Wed, 27 May 2026 14:02:57 +0000</pubDate>
      <link>https://dev.to/amedinat/you-dont-need-enterprise-llmops-you-need-a-better-dashboard-25bg</link>
      <guid>https://dev.to/amedinat/you-dont-need-enterprise-llmops-you-need-a-better-dashboard-25bg</guid>
      <description>&lt;p&gt;PLATAFORMA: Dev.to&lt;/p&gt;

&lt;p&gt;Token bills are getting out of hand. Everyone knows it. The default response has been to reach for massive, venture-backed "LLMOps" platforms that promise to solve everything. They offer observability, caching, prompt versioning, evaluation, and a dozen other features.&lt;/p&gt;

&lt;p&gt;tbh, for most of us, that's overkill. It's like buying a full-scale CI/CD platform when all you need is a simple cron job.&lt;/p&gt;

&lt;p&gt;The real problem for 90% of devs isn't complex prompt A/B testing or fine-tuning workflows. It's answering one basic question: "Who or what is costing me so much money?"&lt;/p&gt;

&lt;p&gt;Usually, the answer is buried in a CSV file from OpenAI or Anthropic. You end up writing custom scripts to parse it, attribute costs to users, and hope you catch the runaway agent that's stuck in a loop summarizing the same text 1,000 times.&lt;/p&gt;

&lt;p&gt;This isn't an "observability" problem. It's a dashboard problem.&lt;/p&gt;

&lt;p&gt;Before you invest in a complex system, you need a clear view of three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Cost per user:&lt;/strong&gt; Which tenant is burning through your credits?&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Cost per model:&lt;/strong&gt; Is &lt;code&gt;claude-3-opus&lt;/code&gt; really worth 15x more than &lt;code&gt;haiku&lt;/code&gt; for that simple task?&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Real-time alerts:&lt;/strong&gt; Can you get a Slack notification when a user's spend hits $100, &lt;em&gt;before&lt;/em&gt; it hits $1,000?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Most enterprise tools do this, but they bundle it with features you won't touch for months. And they aren't cheap.&lt;/p&gt;

&lt;p&gt;This is why we built LLMeter as an open-source tool. It's not a massive platform. It's a focused, self-hostable dashboard (Next.js, Supabase) that does one thing well: monitor costs across different providers (OpenAI, Anthropic, DeepSeek, OpenRouter).&lt;/p&gt;

&lt;p&gt;It gives you multi-tenant attribution and budget alerts without the enterprise complexity. You can see which user is calling which model and how much it's costing you, in real-time. AGPL-3.0, so you can host it yourself.&lt;/p&gt;

&lt;p&gt;fwiw, the next time your bill spikes, don't assume you need a revolutionary AI-powered solution. You might just need a better dashboard. Check out the project at llmeter.org.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>The Token Spiral: How One Runaway AI Agent Burned $2,847 in 4 Hours</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Mon, 25 May 2026 14:08:57 +0000</pubDate>
      <link>https://dev.to/amedinat/the-token-spiral-how-one-runaway-ai-agent-burned-2847-in-4-hours-269n</link>
      <guid>https://dev.to/amedinat/the-token-spiral-how-one-runaway-ai-agent-burned-2847-in-4-hours-269n</guid>
      <description>&lt;p&gt;traditional monitoring is completely broken when it comes to AI agents. &lt;/p&gt;

&lt;p&gt;we've all seen the dashboards. everything is green. HTTP 200s across the board. p99 latency looks fine. CPU is barely ticking. &lt;/p&gt;

&lt;p&gt;meanwhile, your agent is stuck in an infinite retry loop, burning $80 per iteration because it keeps hallucinating an invalid JSON payload and asking the LLM to fix it. &lt;/p&gt;

&lt;p&gt;this exact failure mode—the "token spiral"—recently burned $2,847 in just 4 hours for a dev team. and they only noticed because their card declined.&lt;/p&gt;

&lt;p&gt;here is why standard observability tools miss this:&lt;br&gt;
they track the container, the request, the database. they don't track the &lt;em&gt;tokens per customer task&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;when an agent starts spiraling, it's making valid API calls to OpenAI or Anthropic. the provider happily returns 200 OK. the latency might be slightly elevated, but not enough to trigger a generic PagerDuty alert. it just looks like heavy usage.&lt;/p&gt;

&lt;p&gt;to catch a token spiral before it bankrupts you, you need runtime cost enforcement. not just a daily digest, but active circuit breakers.&lt;/p&gt;

&lt;p&gt;if you're at an enterprise, you buy Braintrust or Vantage. &lt;br&gt;
if you're building a startup or just vibing in your garage, you can't afford those.&lt;/p&gt;

&lt;p&gt;imo, you need open-source per-customer cost attribution. i built &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=agent-token-spiral-silent-killer" rel="noopener noreferrer"&gt;LLMeter&lt;/a&gt; to solve exactly this problem. it tracks costs by model, by user, by day. you can set budget alerts and actually see which specific tenant is spiraling out of control.&lt;/p&gt;

&lt;p&gt;ymmv, but don't deploy agents without cost circuit breakers. the API providers aren't going to refund you for bad prompts.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>The Token Spiral: How One Runaway AI Agent Burned $2,847 in 4 Hours</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Fri, 22 May 2026 14:02:11 +0000</pubDate>
      <link>https://dev.to/amedinat/the-token-spiral-how-one-runaway-ai-agent-burned-2847-in-4-hours-12ok</link>
      <guid>https://dev.to/amedinat/the-token-spiral-how-one-runaway-ai-agent-burned-2847-in-4-hours-12ok</guid>
      <description>&lt;p&gt;traditional monitoring is completely broken when it comes to AI agents. &lt;/p&gt;

&lt;p&gt;we've all seen the dashboards. everything is green. HTTP 200s across the board. p99 latency looks fine. CPU is barely ticking. &lt;/p&gt;

&lt;p&gt;meanwhile, your agent is stuck in an infinite retry loop, burning $80 per iteration because it keeps hallucinating an invalid JSON payload and asking the LLM to fix it. &lt;/p&gt;

&lt;p&gt;this exact failure mode—the "token spiral"—recently burned $2,847 in just 4 hours for a dev team. and they only noticed because their card declined.&lt;/p&gt;

&lt;p&gt;here is why standard observability tools miss this:&lt;br&gt;
they track the container, the request, the database. they don't track the &lt;em&gt;tokens per customer task&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;when an agent starts spiraling, it's making valid API calls to OpenAI or Anthropic. the provider happily returns 200 OK. the latency might be slightly elevated, but not enough to trigger a generic PagerDuty alert. it just looks like heavy usage.&lt;/p&gt;

&lt;p&gt;to catch a token spiral before it bankrupts you, you need runtime cost enforcement. not just a daily digest, but active circuit breakers.&lt;/p&gt;

&lt;p&gt;if you're at an enterprise, you buy Braintrust or Vantage. &lt;br&gt;
if you're building a startup or just vibing in your garage, you can't afford those.&lt;/p&gt;

&lt;p&gt;imo, you need open-source per-customer cost attribution. i built &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=agent-token-spiral-silent-killer" rel="noopener noreferrer"&gt;LLMeter&lt;/a&gt; to solve exactly this problem. it tracks costs by model, by user, by day. you can set budget alerts and actually see which specific tenant is spiraling out of control.&lt;/p&gt;

&lt;p&gt;ymmv, but don't deploy agents without cost circuit breakers. the API providers aren't going to refund you for bad prompts.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>The Overlooked Costs of Your LLM API Calls</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Wed, 20 May 2026 14:01:53 +0000</pubDate>
      <link>https://dev.to/amedinat/the-overlooked-costs-of-your-llm-api-calls-21gd</link>
      <guid>https://dev.to/amedinat/the-overlooked-costs-of-your-llm-api-calls-21gd</guid>
      <description>&lt;p&gt;Everyone tracks the cost per token. It's the obvious metric. But if that's all you're watching, you're missing the bigger picture. After spending way too much time sifting through invoices and logs, I've found the real cost sinks are often hidden elsewhere.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Retries &amp;amp; Timeouts Tax
&lt;/h3&gt;

&lt;p&gt;Your code retries on a 503 from OpenAI. Standard practice, right? But are you tracking the cost of those retries? A temporary outage or a poorly optimized prompt can cause a spike in retries, doubling or tripling the cost of a single user action without you even noticing until the end of the month. It's not just the API cost, either. It's the extended function execution time, the user waiting, the potential for cascading failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The "Which User Was That?" Problem
&lt;/h3&gt;

&lt;p&gt;A huge bill comes in. You see a spike in &lt;code&gt;gpt-4-turbo&lt;/code&gt; usage last Tuesday. Who caused it? Was it a single power user, a misbehaving script, or a feature getting abused? If you're just passing an API key from your backend, you have no idea. Per-user attribution isn't a vanity metric; it's essential. Without it, you can't tell who your most expensive users are, or if a specific user is hammering your service in a way you didn't anticipate.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The "Development vs. Production" Blind Spot
&lt;/h3&gt;

&lt;p&gt;You run tests, you experiment in a staging environment. All those calls are hitting your single API key. How much are you spending on non-production traffic? Is your CI/CD pipeline making a dozen LLM calls on every commit? These small, untracked costs from multiple developers and automated systems add up. They muddy the waters, making it impossible to see your actual production COGS.&lt;/p&gt;




&lt;p&gt;I built LLMeter to get a handle on this. It's an open-source dashboard that helps me see costs per user, per model, and set alerts before the bill gets out of hand. It directly connects to OpenAI, Anthropic, and others, giving me a real-time view of what's actually happening. Fwiw, it's helped me catch more than one runaway script. Check it out if you're tired of flying blind: &lt;a href="https://llmeter.org" rel="noopener noreferrer"&gt;https://llmeter.org&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>Your AI costs are growing faster than your revenue</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Mon, 18 May 2026 14:09:10 +0000</pubDate>
      <link>https://dev.to/amedinat/your-ai-costs-are-growing-faster-than-your-revenue-3c6o</link>
      <guid>https://dev.to/amedinat/your-ai-costs-are-growing-faster-than-your-revenue-3c6o</guid>
      <description>&lt;p&gt;Most startups integrating LLMs run into the exact same wall around month 6.&lt;/p&gt;

&lt;p&gt;User growth looks great. ARR is going up. But your OpenAI/Anthropic bill is growing 3x faster than your MRR. Suddenly your gross margins are negative, and you have no idea why.&lt;/p&gt;

&lt;p&gt;I've talked to dozens of founders this year. Almost everyone starts the same way: one global API key, no caching, and a "we'll figure out costs later" mentality. Later is usually when Stripe fails to cover the API bill.&lt;/p&gt;

&lt;p&gt;The problem isn't the model pricing. GPT-4o mini and Claude 3.5 Haiku are cheap. The problem is lack of visibility.&lt;/p&gt;

&lt;p&gt;When a customer complains about an issue, your support team runs an agent loop. When a user uploads a PDF, your RAG pipeline chunks and embeds 50 pages. Who paid for that? Which customer is actually profitable? &lt;/p&gt;

&lt;p&gt;Usually, 20% of your power users are burning 80% of your API budget, while the rest are subsidizing them. But without per-tenant cost attribution, you can't tell them apart. You can't adjust your pricing tiers.&lt;/p&gt;

&lt;p&gt;If you don't track costs per user, you are flying blind. &lt;/p&gt;

&lt;p&gt;Start tracking per-tenant usage early. Even logging token counts to your database is better than nothing.&lt;br&gt;
fwiw, if you don't want to build it yourself, I built LLMeter (&lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-15-devto-ai-costs-revenue" rel="noopener noreferrer"&gt;https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-15-devto-ai-costs-revenue&lt;/a&gt;). It's an open-source dashboard that tracks LLM API costs by model, by user, and by day. Handles OpenAI, Anthropic, DeepSeek, and OpenRouter. &lt;/p&gt;

&lt;p&gt;Stop guessing where your margin went.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
  </channel>
</rss>
