<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: John Medina</title>
    <description>The latest articles on DEV Community by John Medina (@amedinat).</description>
    <link>https://dev.to/amedinat</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3854284%2F73b7fb73-f118-4d37-b5a7-37581d43bd0a.png</url>
      <title>DEV Community: John Medina</title>
      <link>https://dev.to/amedinat</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/amedinat"/>
    <language>en</language>
    <item>
      <title>Your prompt is getting longer without you knowing it (and it's killing your margins)</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Tue, 12 May 2026 21:39:59 +0000</pubDate>
      <link>https://dev.to/amedinat/your-prompt-is-getting-longer-without-you-knowing-it-and-its-killing-your-margins-1b71</link>
      <guid>https://dev.to/amedinat/your-prompt-is-getting-longer-without-you-knowing-it-and-its-killing-your-margins-1b71</guid>
      <description>&lt;p&gt;I've been looking at LLM billing patterns lately, and there's a silent killer that creeps up on almost every team: prompt inflation.&lt;/p&gt;

&lt;p&gt;When you first build an AI feature, your prompt is tight. Maybe 500 tokens for the system instructions and 100 for the user query. The math looks great. "This will cost us fractions of a cent per call," you tell the team.&lt;/p&gt;

&lt;p&gt;Fast forward three months.&lt;/p&gt;

&lt;p&gt;Someone added conversation history to make the bot "smarter." Another dev added a massive RAG context block because the model hallucinated once. Product asked for formatting instructions, so now the system prompt is a 2,000-word essay. &lt;/p&gt;

&lt;p&gt;Suddenly, your baseline request is 8k tokens. &lt;/p&gt;

&lt;p&gt;The worst part is that user value doesn't scale linearly with prompt size. But your OpenAI bill sure does. If you're running at scale, you're suddenly paying $0.05+ per request for a feature you modeled at $0.005. &lt;/p&gt;

&lt;p&gt;If you just look at your monthly total on the provider dashboard, it just looks like you're getting more usage. You think "growth is good" until the Stripe payout hits and you realize your margins are gone.&lt;/p&gt;

&lt;p&gt;You need to track cost &lt;em&gt;per user&lt;/em&gt; and cost &lt;em&gt;per feature&lt;/em&gt;, not just total spend. If you see specific users driving crazy costs, they're probably accumulating massive context windows that you need to truncate.&lt;/p&gt;

&lt;p&gt;fwiw, I ran into this exact issue, which is why I built LLMeter (&lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-21-prompt-inflation-margin-killer" rel="noopener noreferrer"&gt;https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-21-prompt-inflation-margin-killer&lt;/a&gt;). It's an open-source, proxy-free way to track this stuff. It attributes costs down to the user ID level so you can actually see who is dragging around a 10k token history.&lt;/p&gt;

&lt;p&gt;Stop assuming your prompt is the same size it was on day one. Track it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>The Hidden 43% — How Teams Are Wasting Almost Half Their LLM API Budget</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Fri, 08 May 2026 23:20:51 +0000</pubDate>
      <link>https://dev.to/amedinat/the-hidden-43-how-teams-are-wasting-almost-half-their-llm-api-budget-32b5</link>
      <guid>https://dev.to/amedinat/the-hidden-43-how-teams-are-wasting-almost-half-their-llm-api-budget-32b5</guid>
      <description>&lt;p&gt;You look at your provider dashboard and see one number: the total bill. It's like getting an electricity bill that just says "$5,000" with no breakdown of whether it was the AC, the fridge, or someone leaving the lights on all month.&lt;/p&gt;

&lt;p&gt;tbh, most AI startups are flying blind right now. We recently looked into the cost breakdown for several teams and found something crazy: almost 43% of LLM API spend is completely wasted. It’s not about paying for usage; it’s about paying for bad architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here’s where the leaks are actually happening:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Retry Storms (34% of waste)&lt;br&gt;
Your agent fails to parse a JSON response, so it retries. And retries. Sometimes 5-10 times in a loop. You aren't just paying for the failure, you are paying for the massive context window sent every single time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Duplicate Calls (85% of apps have this issue)&lt;br&gt;
Multiple users asking the exact same question, or internal systems running the same RAG pipeline on the same document. Without caching at the provider level, you're paying OpenAI to generate the identical tokens twice.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Context Bloat&lt;br&gt;
Sending the entire 50-page document history when the user just asked "what's the summary of page 2". RAG is great, but shoving everything into the prompt "just in case" is burning your runway.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Wrong Model Selection&lt;br&gt;
Using GPT-4o or Claude 3 Opus for simple classification tasks when Haiku or GPT-3.5-turbo would do it for a fraction of the cost.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can't fix what you can't see. That's exactly why I built LLMeter (&lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=hidden-43-percent-llm-waste" rel="noopener noreferrer"&gt;https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=hidden-43-percent-llm-waste&lt;/a&gt;). It's an open-source dashboard that gives you per-customer and per-model cost tracking. Stop guessing who or what is draining your API budget.&lt;/p&gt;

&lt;p&gt;Fwiw, just setting up basic budget alerts and seeing the breakdown by tenant usually drops a team's bill by 20% in the first week. Give it a try, it's open source (AGPL-3.0) and you can self-host or use the free tier.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>saas</category>
    </item>
    <item>
      <title>The week your AI coding tier got smaller</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Wed, 06 May 2026 23:15:24 +0000</pubDate>
      <link>https://dev.to/amedinat/the-week-your-ai-coding-tier-got-smaller-1a2j</link>
      <guid>https://dev.to/amedinat/the-week-your-ai-coding-tier-got-smaller-1a2j</guid>
      <description>&lt;p&gt;In 48 hours this week, two of the biggest AI coding platforms confirmed the same thing: your unlimited subscription was never sustainable for how you actually use it. The provider will be the one who decides when to cut you off.&lt;/p&gt;

&lt;p&gt;Anthropic silently removed Claude Code from Pro on a "2% A/B test" (later reversed). Their Head of Growth justified it saying "usage has changed a lot and our current plans weren't built for this." GitHub paused new Copilot Pro signups and dropped Opus from Pro entirely.&lt;/p&gt;

&lt;p&gt;One dev on HN said sending 3-4 messages to Opus 4.7 blew through their $20 plan limits and consumed $10 of extra usage.&lt;/p&gt;

&lt;p&gt;Simon Willison framed the trust break: "Should I be taking a bet on Claude Code if I know that they might 5x the minimum price of the product?"&lt;/p&gt;

&lt;p&gt;The structural takeaway for any team shipping AI features: the invoice is the governance boundary, not the plan page. The provider's unit economics are now public. Every user is a small loss when they exceed the pricing assumption, and no vendor has found the pricing floor yet.&lt;/p&gt;

&lt;p&gt;Teams that cannot meter their own spend per-customer, per-agent, per-task are now one pricing memo away from being unprofitable overnight.&lt;/p&gt;

&lt;p&gt;The concrete fix: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;track your tokens (not the invoice's)&lt;/li&gt;
&lt;li&gt;use per-customer attribution (so you know whose usage is killing you)&lt;/li&gt;
&lt;li&gt;implement hard budget caps at the agent level. Alerts don't stop a runaway loop.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is exactly what LLM Budget Guard is being built for. &lt;/p&gt;

&lt;p&gt;Here is how a wrapper around the SDK produces per-customer token attribution without waiting for invoice day:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;wrapOpenAI&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;llmeter&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;wrapOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;prod-cluster&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;tenantId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cust_883&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Cost is now tracked per customer automatically&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Generate report&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Track your costs early. Check out &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=ai-coding-tier-collapse" rel="noopener noreferrer"&gt;LLMeter&lt;/a&gt; to get started with attribution.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Bun Is Porting from Zig to Rust — Here's Why That Matters If You Run LLM Workloads</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Wed, 06 May 2026 17:11:01 +0000</pubDate>
      <link>https://dev.to/amedinat/bun-is-porting-from-zig-to-rust-heres-why-that-matters-if-you-run-llm-workloads-3kgo</link>
      <guid>https://dev.to/amedinat/bun-is-porting-from-zig-to-rust-heres-why-that-matters-if-you-run-llm-workloads-3kgo</guid>
      <description>&lt;p&gt;This week Bun published its internal &lt;a href="https://github.com/oven-sh/bun/commit/46d3bc29f270fa881dd5730ef1549e88407701a5" rel="noopener noreferrer"&gt;Zig→Rust porting guide&lt;/a&gt; — a signal that the runtime is migrating core components from Zig to Rust. The HN thread hit 700+ points in 24 hours.&lt;/p&gt;

&lt;p&gt;The Rust move is a reasonable technical bet. Rust's ecosystem maturity, tooling, and contributor onboarding advantages at Bun's scale are real. But for teams running LLM workloads in production, the migration surfaces a question worth thinking about: &lt;em&gt;what does it mean that your JavaScript runtime is now an Anthropic asset?&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Background: December 2025
&lt;/h2&gt;

&lt;p&gt;Anthropic acquired Bun in December 2025. For most developers this was a minor footnote. For teams running AI pipelines, it quietly changed the vendor dependency structure.&lt;/p&gt;

&lt;p&gt;If you're using Bun as your JS runtime, Claude Code as your AI CLI, and Anthropic as your LLM provider, all three now share the same balance sheet. The Rust port doesn't change that — it makes Bun faster and easier to contribute to, but it doesn't change who owns it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the shared balance sheet matters at the billing layer
&lt;/h2&gt;

&lt;p&gt;The last 90 days produced 6 separate LLM billing incidents across the major providers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;Incident&lt;/th&gt;
&lt;th&gt;HN&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mar 2026&lt;/td&gt;
&lt;td&gt;Anthropic Pro A/B — silent tier reclassification&lt;/td&gt;
&lt;td&gt;&lt;a href="https://news.ycombinator.com/item?id=47854477" rel="noopener noreferrer"&gt;47854477&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mar 2026&lt;/td&gt;
&lt;td&gt;Cursor per-token surprise — $200→$500 overnight&lt;/td&gt;
&lt;td&gt;&lt;a href="https://news.ycombinator.com/item?id=47847849" rel="noopener noreferrer"&gt;47847849&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apr 2026&lt;/td&gt;
&lt;td&gt;GitHub Copilot 7.5x billing multiplier&lt;/td&gt;
&lt;td&gt;&lt;a href="https://news.ycombinator.com/item?id=192911" rel="noopener noreferrer"&gt;#192911&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apr 2026&lt;/td&gt;
&lt;td&gt;GitHub Copilot 27x billing trap&lt;/td&gt;
&lt;td&gt;&lt;a href="https://news.ycombinator.com/item?id=47923357" rel="noopener noreferrer"&gt;47923357&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apr 2026&lt;/td&gt;
&lt;td&gt;OpenClaw trigger-word charges&lt;/td&gt;
&lt;td&gt;&lt;a href="https://news.ycombinator.com/item?id=47963204" rel="noopener noreferrer"&gt;47963204&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apr 2026&lt;/td&gt;
&lt;td&gt;HERMES.md rate reclassification&lt;/td&gt;
&lt;td&gt;&lt;a href="https://news.ycombinator.com/item?id=47952722" rel="noopener noreferrer"&gt;47952722&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;None of these were announced in advance. All of them hit teams that thought they had controls in place — dashboards, alerts, rate limits set through the vendor's own UI.&lt;/p&gt;

&lt;p&gt;The pattern: vendor-side controls fail at the worst moment because they live inside the system making the billing decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  What out-of-band enforcement looks like
&lt;/h2&gt;

&lt;p&gt;The durable fix isn't switching runtimes or providers. It's moving cost enforcement &lt;em&gt;outside&lt;/em&gt; the vendor stack.&lt;/p&gt;

&lt;p&gt;Enforcement that runs before the API call goes out — synchronously, without a network round-trip — can't be overridden by a policy update on the vendor side. The call either goes out or it doesn't. No spend is committed until your own cap logic says it's safe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;BudgetGuard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;wrapAnthropic&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@simplifai/budget-guard&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;guard&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BudgetGuard&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;global_cap_per_day_usd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;per_customer_cap_per_day_usd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;anthropic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;wrapAnthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nx"&gt;guard&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// BudgetCapError thrown before the call if cap exceeded&lt;/span&gt;
&lt;span class="c1"&gt;// The call never goes out. No spend incurred.&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a cap is hit you get structured data — &lt;code&gt;scope&lt;/code&gt;, &lt;code&gt;spend_usd&lt;/code&gt;, &lt;code&gt;cap_usd&lt;/code&gt;, &lt;code&gt;retry_after&lt;/code&gt; — not a Slack alert 10 minutes after the damage is done.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Rust port is good news
&lt;/h2&gt;

&lt;p&gt;Genuinely. Faster startup, better memory safety, easier for external contributors. If you're building on Bun, the migration is a net positive for stability.&lt;/p&gt;

&lt;p&gt;The vendor dependency concern is a separate question from runtime quality. Bun being well-engineered in Rust doesn't change that it's now part of the same vendor stack as your LLM calls. For teams where that matters, the answer isn't a different runtime — it's enforcement that lives outside all of them.&lt;/p&gt;




&lt;p&gt;SDK (TypeScript, zero deps, 29 tests): &lt;code&gt;npm install @simplifai/budget-guard&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Managed version waitlist → &lt;a href="https://simplifai.tools/validate/budget-guard" rel="noopener noreferrer"&gt;simplifai.tools/validate/budget-guard&lt;/a&gt;&lt;/p&gt;

</description>
      <category>typescript</category>
      <category>ai</category>
      <category>webdev</category>
      <category>devops</category>
    </item>
    <item>
      <title>27 days to the DeepSeek V4-Pro cliff: what a 4x price jump looks like in production</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Tue, 05 May 2026 13:48:06 +0000</pubDate>
      <link>https://dev.to/amedinat/27-days-to-the-deepseek-v4-pro-cliff-what-a-4x-price-jump-looks-like-in-production-4j4c</link>
      <guid>https://dev.to/amedinat/27-days-to-the-deepseek-v4-pro-cliff-what-a-4x-price-jump-looks-like-in-production-4j4c</guid>
      <description>&lt;p&gt;So here's the thing about the deepseek v4-pro pricing schedule that has been making the rounds on hn this week (thread 48002136, 494 pts / 198 comments) — the 75% promotional discount everyone has been routing to is &lt;strong&gt;calendar-deterministic&lt;/strong&gt;. it expires 2026-05-31 at 15:59 UTC. that's 27 days from today. on june 1 the same agent that costs you $87 in tokens will cost you $348.&lt;/p&gt;

&lt;p&gt;most teams i've talked to about this have a vague awareness ("yeah we've been saving with deepseek") and zero plan for the cliff. nobody has a runbook for what happens when a long-running agent session starts before the price flip and finishes after it. nobody has tested whether their billing dashboard refreshes mid-session or only on the next invoice. and nobody has checked if their fallback provider is actually configured to take over at 16:00 UTC on the 31st.&lt;/p&gt;

&lt;p&gt;this isn't a "scandal" the way openclaw or hermes.md or the copilot 27x pivot were. deepseek published the schedule openly. but the failure mode is identical to a stripe price-update event delivered with no buyer-side notification: you've been routing 60-80% of agent traffic to v4-pro since the 04-26 launch, the meter ticks up at 4x without an alert, and the line on june's bill is the first thing you see.&lt;/p&gt;

&lt;h2&gt;
  
  
  what's actually happening
&lt;/h2&gt;

&lt;p&gt;the published schedule (api-docs.deepseek.com/quick_start/pricing) is unambiguous:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;line item&lt;/th&gt;
&lt;th&gt;promo (now → may 31 15:59 UTC)&lt;/th&gt;
&lt;th&gt;post-cliff (june 1 onward)&lt;/th&gt;
&lt;th&gt;multiple&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;output tokens&lt;/td&gt;
&lt;td&gt;$0.87 / 1M&lt;/td&gt;
&lt;td&gt;$3.48 / 1M&lt;/td&gt;
&lt;td&gt;4x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;input cache miss&lt;/td&gt;
&lt;td&gt;$0.435 / 1M&lt;/td&gt;
&lt;td&gt;$1.74 / 1M&lt;/td&gt;
&lt;td&gt;4x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;input cache hit&lt;/td&gt;
&lt;td&gt;$0.03625 / 1M&lt;/td&gt;
&lt;td&gt;$0.145 / 1M&lt;/td&gt;
&lt;td&gt;4x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;every line is exactly 4x. the cache-hit tier is the one that catches people — at $0.036/M it's basically free and most teams stopped instrumenting after the first week. at $0.145/M it's still cheap but cache-hit volume is usually 5-10x cache-miss volume, so the absolute dollar delta on cache-hit can dwarf the headline output-token delta.&lt;/p&gt;

&lt;p&gt;the deepclaw thread (claude code → deepseek wrapper, hit hn front page same week) is full of "$0.06 per task with v4-pro" anecdotes. every one of those becomes $0.24 on june 1. the $30/mo dev-side budget becomes $120. the $4k/mo agent fleet becomes $16k. nothing on the deepseek side of the wire changes — same model, same endpoint, same response — just a 4x multiplier on the meter.&lt;/p&gt;

&lt;h2&gt;
  
  
  what we learned doing the math on our own usage
&lt;/h2&gt;

&lt;p&gt;i ran our last 30 days of llm spend through the cliff scenario assuming we keep the same routing distribution. three observations worth sharing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. cache-hit dominance flips the headline number.&lt;/strong&gt; our v4-pro mix is ~70% cache-hit / 25% cache-miss / 5% output-heavy. on the promo schedule cache-hit is ~6% of total spend; post-cliff it's ~22%. the "output tokens went 4x" story misses where the dollars actually sit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. mid-session pricing flips are a real failure mode.&lt;/strong&gt; we have agent runs that span 4-6 hours. on may 31 a session that starts at 12:00 UTC and finishes at 18:00 UTC will see the promo rate for 3h59m of tokens and the post-cliff rate for the rest. nothing in the deepseek api response surfaces which tier the request was billed at — you find out from the invoice. for a long-running summarization or repo-analysis job that's a $0.40 → $1.60 swing on a single run, but the agent has no way to detect it and slow down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. multi-provider failover is necessary but not sufficient.&lt;/strong&gt; the obvious move is "fail over to anthropic / openai if deepseek crosses some price threshold." but the threshold isn't price (deepseek hasn't changed prices, the schedule has) — it's calendar. you have to encode "after 2026-05-31 15:59 UTC, if i'm still routing to v4-pro, stop" in the routing layer itself. and you have to test it before the day, because if your e2e test environment hits real apis on june 1 you've lost the dry run.&lt;/p&gt;

&lt;h2&gt;
  
  
  practical implications (things you can do this week)
&lt;/h2&gt;

&lt;p&gt;these are the actions, in priority order, that close the cliff exposure for a typical team running 60-80% deepseek-v4-pro traffic:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;audit your last 30 days of v4-pro spend by tier.&lt;/strong&gt; group by cache-hit / cache-miss / output. multiply each line by 4x and look at the dollar delta, not the percent. the result is your post-cliff monthly run-rate floor. if it scares you, the rest of this list is mandatory; if it doesn't, you can stop here.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;add a hard cutoff in your router by may 30, not may 31.&lt;/strong&gt; "switch off v4-pro at 15:59 UTC on the 31st" is the obvious answer and the wrong one — it puts a billing event on the same calendar minute as a routing event with no headroom. switch off at midnight UTC may 30, run on the post-cliff equivalent (deepseek-v4 base, anthropic haiku, gpt-4.1-mini) for 24-48 hours, validate that quality + latency hold, then make a deliberate "do we want v4-pro at the new price" call with the actual numbers in front of you.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;instrument cache-hit-rate dollar impact, not just request count.&lt;/strong&gt; most observability stacks count cache hits as a vanity metric. on june 1 cache-hit becomes 4x its promo price and the volume doesn't change. plot dollar-cost-per-cache-hit over time so the cliff shows up as a step function and not a slow drift.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;kill any long-running session that starts after may 31 12:00 UTC.&lt;/strong&gt; for the 4-hour window before the cliff, force agent runs to stay under 30 minutes wall-clock. it's annoying for one shift but it eliminates the mid-session-pricing-flip class of bug entirely. cheaper than discovering on june 2 that your overnight run got billed at the post-cliff rate for 90% of its tokens.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;set an out-of-band budget cap that survives provider misbehavior.&lt;/strong&gt; the cliff is the deterministic case. the worse case is a provider you've never used before issuing a similar schedule on a tuesday with no warning. the only durable defense is a hard $/period ceiling on a layer the provider doesn't control. that's the slot llmeter sits in — count tokens before they leave your network, refuse to forward requests once the cap is hit, fail closed when the meter is uncertain. fwiw it's the same architecture every team rolls themselves on the 2nd of the month after the first surprise bill.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  tldr
&lt;/h2&gt;

&lt;p&gt;Deepseek did the honest thing and published the schedule. the failure mode is still that 99% of teams won't read it until june 1. you have 27 days. the cheap version of preparing is item 1 (run the numbers); the durable version is item 5 (own the cap). the rest is somewhere in between.&lt;/p&gt;

&lt;p&gt;if you're running llm spend across deepseek + anthropic + openai and want a dashboard that flags the pre-cliff routing exposure today rather than after the invoice, that's what i've been building at  &lt;a href="https://llmeter.org/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=hn-week-deepseek-may-31-cliff" rel="noopener noreferrer"&gt;https://llmeter.org/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=hn-week-deepseek-may-31-cliff&lt;/a&gt;.  open-source, agpl-3.0, no proxy in the request path. drop the sdk in, point it at your providers, see the cliff exposure as a number.&lt;/p&gt;

&lt;p&gt;happy to compare notes if you've already done the audit on your stack — what was the cache-hit delta vs the headline?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deepseek</category>
      <category>llm</category>
      <category>saas</category>
    </item>
    <item>
      <title>You Vibe-Coded Your SaaS Landing Page — Google Can't See It</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Mon, 04 May 2026 21:42:46 +0000</pubDate>
      <link>https://dev.to/amedinat/you-vibe-coded-your-saas-landing-page-google-cant-see-it-16cj</link>
      <guid>https://dev.to/amedinat/you-vibe-coded-your-saas-landing-page-google-cant-see-it-16cj</guid>
      <description>&lt;p&gt;Look, I get it. You used Lovable or Bolt, shipped a beautiful landing page in 3 hours, and felt like a god. The UI is slick, animations are butter. &lt;/p&gt;

&lt;p&gt;But it's been 3 weeks and you have 0 organic traffic. &lt;/p&gt;

&lt;p&gt;Here is why: you shipped a Client-Side Rendered (CSR) React app. Google hates those.&lt;/p&gt;

&lt;p&gt;I was digging into indexation stats recently. Google takes about 9x longer to index JS-heavy pages. If your site exceeds their rendering budget, indexation drops by up to 40%. &lt;/p&gt;

&lt;p&gt;When Googlebot hits your vibe-coded site, it sees a blank HTML file and a massive JS bundle. It has to queue it for rendering, parse the JS, execute it, and only &lt;em&gt;then&lt;/em&gt; sees the content. Most indie devs don't realize this until their Search Console stays flat for months.&lt;/p&gt;

&lt;p&gt;Vibe-coding tools default to SPA (Single Page Application) architectures because they're easier to generate and feel faster to the user. But for a landing page? It's SEO suicide.&lt;/p&gt;

&lt;p&gt;If you want organic users, you need SSR (Server-Side Rendering) or SSG (Static Site Generation). When Google hits the URL, the HTML needs to be there instantly. &lt;/p&gt;

&lt;p&gt;I built LLMeter (open-source LLM cost tracking) using Next.js on Vercel specifically for this reason. SSR out of the box. No waiting for Googlebot to execute JS. It just indexes.&lt;/p&gt;

&lt;p&gt;If you're stuck with a CSR landing page right now:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Prerender.io is an option, though it's a band-aid.&lt;/li&gt;
&lt;li&gt;Rebuild the marketing pages in Next.js/Astro. Keep the CSR app for the actual dashboard on a subdomain (&lt;code&gt;app.yoursaas.com&lt;/code&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Don't let the AI tools trick you into bad architecture. Ship fast, but make sure Google can actually read it, tbh.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-17-vibe-coding-seo-invisible" rel="noopener noreferrer"&gt;Check out LLMeter if you're dealing with LLM API costs&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vibecoding</category>
      <category>seo</category>
      <category>llm</category>
    </item>
    <item>
      <title>GitHub Copilot's 27x Billing Trap is Closing — The Budget Guard Deadline</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Sun, 03 May 2026 01:55:07 +0000</pubDate>
      <link>https://dev.to/amedinat/github-copilots-27x-billing-trap-is-closing-the-budget-guard-deadline-1ioc</link>
      <guid>https://dev.to/amedinat/github-copilots-27x-billing-trap-is-closing-the-budget-guard-deadline-1ioc</guid>
      <description>&lt;p&gt;GitHub Copilot is shifting to usage-based billing on June 1, 2026. If you've been relying on flat-rate pricing to shield you from the realities of AI coding costs, your runway just evaporated.&lt;/p&gt;

&lt;p&gt;We've been tracking a nasty trend in the wild: the "27x billing trap." When an AI coding assistant gets stuck in a recursive loop—hallucinating a fix, failing the test, and trying again—it burns tokens at maximum velocity. Under flat-rate billing, this was an invisible annoyance. Under usage-based billing, it's a catastrophic financial event.&lt;/p&gt;

&lt;p&gt;We saw one dev hit a -$563 bill on a $5 prepaid account because provider-side limits don't enforce in real-time. They are &lt;em&gt;eventually consistent&lt;/em&gt;, and by the time the provider shuts you down, the damage is done.&lt;/p&gt;

&lt;p&gt;The clock is ticking. You have less than 30 days to implement hard enforcement before the new billing model takes effect. &lt;/p&gt;

&lt;p&gt;You don't need another dashboard that sends you a Slack alert &lt;em&gt;after&lt;/em&gt; your budget is blown. You need an enforcement layer that kills the request &lt;em&gt;before&lt;/em&gt; the network call is made.&lt;/p&gt;

&lt;p&gt;That's why we open-sourced LLM Budget Guard. It's a dead-simple, local SDK enforcement layer that cuts the cord the millisecond your hard cap is reached. No proxy routing latency, no cloud gateway single points of failure. Just deterministic budget enforcement.&lt;/p&gt;

&lt;p&gt;Self-host it and protect your runway: &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=copilot-27x-budget-guard-deadline" rel="noopener noreferrer"&gt;llmeter.org&lt;/a&gt;&lt;/p&gt;

</description>
      <category>github</category>
      <category>ai</category>
      <category>opensource</category>
      <category>githubcopilot</category>
    </item>
    <item>
      <title>Your LLM budget alerts won't save you if you can't map costs to users</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Mon, 27 Apr 2026 19:41:33 +0000</pubDate>
      <link>https://dev.to/amedinat/your-llm-budget-alerts-wont-save-you-if-you-cant-map-costs-to-users-1k8n</link>
      <guid>https://dev.to/amedinat/your-llm-budget-alerts-wont-save-you-if-you-cant-map-costs-to-users-1k8n</guid>
      <description>&lt;p&gt;Most devs think they have AI costs under control because they set a $500 hard cap on OpenAI. &lt;/p&gt;

&lt;p&gt;tbh, that's not cost control. That's just setting a timer for when your app goes down.&lt;/p&gt;

&lt;p&gt;Here's the problem: when you hit $400 and get that warning email, what do you do? You look at the dashboard and see a massive spike in GPT-4o usage. But &lt;em&gt;who&lt;/em&gt; caused it?&lt;br&gt;
Was it your new enterprise client onboarding? (Good, increase the limit).&lt;br&gt;
Was it a free-tier user who figured out how to loop your agent? (Bad, ban them).&lt;br&gt;
Was it a bug in your own RAG pipeline retrying the same chunk? (Very bad, fix the code).&lt;/p&gt;

&lt;p&gt;The provider dashboard won't tell you. It just shows a giant wall of tokens.&lt;/p&gt;

&lt;p&gt;If you are building a multi-tenant SaaS, you need to map every single LLM call to a specific user ID. If you can't attribute the cost, you don't know your unit economics. You might be losing $2 on every $1 you make from your power users.&lt;/p&gt;

&lt;p&gt;We built LLMeter exactly for this. It's an open-source dashboard that sits between your app and the provider. You pass the user ID in the metadata, and we track the cost per user, per day, across OpenAI, Anthropic, DeepSeek, and OpenRouter. &lt;/p&gt;

&lt;p&gt;No proxies, no Stripe lock-in. Just raw data.&lt;/p&gt;

&lt;p&gt;You can self-host it (AGPL-3.0) or check it out here: &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-25-devto-cost-to-user-mapping" rel="noopener noreferrer"&gt;https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-25-devto-cost-to-user-mapping&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>saas</category>
      <category>llm</category>
    </item>
    <item>
      <title>The Hidden 43% — How Teams Waste Half Their LLM API Budget</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Fri, 24 Apr 2026 23:24:11 +0000</pubDate>
      <link>https://dev.to/amedinat/the-hidden-43-how-teams-waste-half-their-llm-api-budget-b8d</link>
      <guid>https://dev.to/amedinat/the-hidden-43-how-teams-waste-half-their-llm-api-budget-b8d</guid>
      <description>&lt;p&gt;The provider dashboards show you one number — your total bill. That's like getting an electricity bill with no breakdown. You just see the total and hope nobody left the AC on.&lt;/p&gt;

&lt;p&gt;Tbh, if you look closely at your API logs, you are probably wasting around 43% of your budget. I spent the last few weeks analyzing LLM usage across different teams, and the same leaks happen everywhere.&lt;/p&gt;

&lt;p&gt;Here is where your money is actually going:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Retry Storms (34% of waste)
&lt;/h3&gt;

&lt;p&gt;Your prompt fails to return valid JSON. The agent retries. It fails again. Next thing you know, your while-loop has fired 40 times. At 10k tokens a pop on Claude 3.5 Sonnet, that single user interaction just cost you a lot.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Duplicate Calls
&lt;/h3&gt;

&lt;p&gt;Users ask the same questions. Without semantic caching, you are paying OpenAI to generate the exact same answer 100 times a day.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Context Bloat
&lt;/h3&gt;

&lt;p&gt;Sending the entire chat history in every single request without truncation. You only need the last few turns, but your wrapper is sending 50k tokens "just in case."&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Wrong Model Selection
&lt;/h3&gt;

&lt;p&gt;Using GPT-4o for basic routing or classification tasks when a much smaller, cheaper model could do it 10x faster.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to stop the bleeding
&lt;/h3&gt;

&lt;p&gt;You can't fix what you can't see. If you don't have per-tenant cost attribution, you are flying blind. You need to know exactly which user, model, and feature is burning tokens.&lt;/p&gt;

&lt;p&gt;I built LLMeter (open-source AGPL-3.0) to solve this. It tracks costs per model, per user, per day. It connects directly to OpenAI, Anthropic, DeepSeek, and OpenRouter to give you the exact breakdown without needing to route your traffic through a proxy.&lt;/p&gt;

&lt;p&gt;Stop guessing. Track your per-user costs: &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=devto-hidden-43-percent" rel="noopener noreferrer"&gt;https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=devto-hidden-43-percent&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>opensource</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Why AI Agencies are flying blind (and how to fix your LLM margins)</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Wed, 22 Apr 2026 14:30:49 +0000</pubDate>
      <link>https://dev.to/amedinat/why-ai-agencies-are-flying-blind-and-how-to-fix-your-llm-margins-1cd7</link>
      <guid>https://dev.to/amedinat/why-ai-agencies-are-flying-blind-and-how-to-fix-your-llm-margins-1cd7</guid>
      <description>&lt;p&gt;If you're running an AI agency, you're probably building some&lt;br&gt;
variation of RAG or agentic workflows for your clients.&lt;/p&gt;

&lt;p&gt;You deliver the project, it works great, and then the first OpenAI bill hits.&lt;/p&gt;

&lt;p&gt;Most agencies I talk to are still in the "winging it" phase when it&lt;br&gt;
comes to API costs. They use one master key for dev, one for prod, and&lt;br&gt;
maybe—if they're feeling fancy—one key per client.&lt;/p&gt;

&lt;p&gt;But fwiw, per-client keys are a maintenance nightmare. And if you're&lt;br&gt;
using a single master key for multiple clients, you're flying blind.&lt;/p&gt;
&lt;h3&gt;
  
  
  The "Averages" Trap
&lt;/h3&gt;

&lt;p&gt;You might think: "I'll just charge a flat $100/mo for API usage."&lt;/p&gt;

&lt;p&gt;Then one client decides to run a bulk ingest of 5,000 PDFs. Your&lt;br&gt;
margin on that client just went negative, and you won't even know it&lt;br&gt;
until the end of the month when you see a spike in the dashboard that&lt;br&gt;
you can't explain.&lt;/p&gt;

&lt;p&gt;Averages don't work for LLMs. Usage is too spiky.&lt;/p&gt;
&lt;h3&gt;
  
  
  3 ways to handle client attribution
&lt;/h3&gt;
&lt;h4&gt;
  
  
  1. The Metadata Header (The "Minimum Viable" way)
&lt;/h4&gt;

&lt;p&gt;OpenAI, Anthropic, and OpenRouter all allow you to pass a &lt;code&gt;user&lt;/code&gt; or&lt;br&gt;
metadata header.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[...],&lt;/span&gt;
  &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`client_acme_corp`&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the bare minimum. It lets you export a CSV at the end of the&lt;br&gt;
month and spend 4 hours in Excel trying to pivot-table your way to a&lt;br&gt;
client invoice. tbh, it's better than nothing, but it's not real-time.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. The Custom Proxy
&lt;/h4&gt;

&lt;p&gt;You build a middleman. Every request from your client's app goes to&lt;br&gt;
your proxy first, you log the tokens, then you forward it to OpenAI.&lt;br&gt;
Pros: Absolute control.&lt;br&gt;
Cons: You just added a single point of failure and 200ms of latency to&lt;br&gt;
every request. Unless you're a DevOps wizard, this is usually&lt;br&gt;
over-engineering for a boutique agency.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Real-time Attribution (The "Sanity" way)
&lt;/h4&gt;

&lt;p&gt;You keep your direct provider connection but fire an async event for&lt;br&gt;
every request.&lt;/p&gt;

&lt;p&gt;This is why I built &lt;strong&gt;LLMeter&lt;/strong&gt;. I needed a way to see exactly which&lt;br&gt;
client was spending what, in real-time, without adding latency to the&lt;br&gt;
main request.&lt;/p&gt;

&lt;p&gt;We use it at Simplifai for our own tools. It tracks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost per client (tenant)&lt;/li&gt;
&lt;li&gt;Cost per model&lt;/li&gt;
&lt;li&gt;Daily burn rates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a client starts hitting the API harder than expected, I get an&lt;br&gt;
alert immediately. I can then decide to upsell them, cap their usage,&lt;br&gt;
or adjust the billing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this matters for your agency
&lt;/h3&gt;

&lt;p&gt;Clients don't like "surprise" bills. If you can show them a dashboard&lt;br&gt;
(or a report) with their exact usage and cost, the trust level goes up&lt;br&gt;
10x. It moves you from "freelancer with a script" to "professional AI&lt;br&gt;
partner."&lt;/p&gt;

&lt;p&gt;LLMeter is open-source (AGPL-3.0) and you can self-host it for free.&lt;br&gt;
Check it out: &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=ai-agencies-attribution" rel="noopener noreferrer"&gt;llmeter.org&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How are you billing your clients for LLM usage right now? Flat fee?&lt;br&gt;
Pass-through? Or just eating the cost?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>saas</category>
      <category>startup</category>
    </item>
    <item>
      <title>Why LLM Cost Dashboards Are Not Enough — The Runtime Enforcement Gap</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Thu, 16 Apr 2026 16:20:56 +0000</pubDate>
      <link>https://dev.to/amedinat/why-llm-cost-dashboards-are-not-enough-the-runtime-enforcement-gap-3fea</link>
      <guid>https://dev.to/amedinat/why-llm-cost-dashboards-are-not-enough-the-runtime-enforcement-gap-3fea</guid>
      <description>&lt;p&gt;I've been looking at how teams handle LLM API costs in production, and there's a weird gap in the tooling right now. Everyone is building observability — logs, traces, dashboards. But almost no one is actually enforcing budgets at runtime. &lt;/p&gt;

&lt;p&gt;If you are running multi-step agents or letting users chat indefinitely, discovering a $4,000 OpenAI bill at the end of the month via a dashboard doesn't help. The money is already gone.&lt;/p&gt;

&lt;p&gt;The problem breaks down into three layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Attribution (knowing which user/tenant caused the cost)&lt;/li&gt;
&lt;li&gt;Alerting (getting warned when a threshold is near)&lt;/li&gt;
&lt;li&gt;Enforcement (blocking requests at runtime)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Most teams are stuck at layer 1. You can't enforce a per-customer budget if you don't even know what each customer is costing you. &lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=llm-cost-enforcement-unsolved" rel="noopener noreferrer"&gt;LLMeter &lt;/a&gt; because I needed to solve that first layer. It's an open-source dashboard that tracks OpenAI, Anthropic, DeepSeek, and OpenRouter costs per user and per day. It also handles budget alerts.&lt;/p&gt;

&lt;p&gt;Until you have per-tenant attribution figured out, trying to build runtime enforcement with API gateways is just guessing. Get the data first, then block the requests.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>saas</category>
      <category>claude</category>
    </item>
    <item>
      <title>LLM prices dropped 80% — but are you actually saving money?</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Thu, 16 Apr 2026 16:17:17 +0000</pubDate>
      <link>https://dev.to/amedinat/llm-prices-dropped-80-but-are-you-actually-saving-money-2o0e</link>
      <guid>https://dev.to/amedinat/llm-prices-dropped-80-but-are-you-actually-saving-money-2o0e</guid>
      <description>&lt;p&gt;veryone is cheering about Anthropic and OpenAI dropping API prices by 80%.&lt;br&gt;
It sounds great on Twitter. But if you look at your actual billing dashboard, your costs probably haven't moved that much.&lt;/p&gt;

&lt;p&gt;Why? Because cheaper tokens usually just mean you start wasting more tokens.&lt;/p&gt;

&lt;p&gt;Here is the thing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1- Context bloat&lt;/strong&gt;&lt;br&gt;
When GPT-4 was expensive, we carefully truncated histories and compressed prompts. Now that it's cheap, devs just throw the entire 128k context window at it on every single retry. The cost per token dropped, but you are sending 10x more tokens per request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2- Agent loops&lt;/strong&gt;&lt;br&gt;
Cheaper models make agentic workflows viable, but a poorly configured while loop can still burn through your budget in minutes. When an agent gets stuck and retries 40 times, cheaper tokens don't save you—you still bleed cash.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3- Lack of per-customer attribution&lt;/strong&gt;&lt;br&gt;
It's easy to see your total OpenAI bill. But if you don't know which specific tenant or user is driving the cost, you can't optimize it. You just eat the cost.&lt;/p&gt;

&lt;p&gt;tbh, the raw price per token is only half the story. If you can't attribute the cost per-user or per-model, you're still flying blind.&lt;/p&gt;

&lt;p&gt;fwiw I built LLMeter to fix this for my own projects. It tracks costs per model and per user, and sets budget alerts—without a proxy in the middle. It's open-source (AGPL).&lt;/p&gt;

&lt;p&gt;Check it out if you're tired of guessing your AI bills: &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-16-llm-prices-dropped-are-you-saving" rel="noopener noreferrer"&gt;https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-16-llm-prices-dropped-are-you-saving&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>openai</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
