<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Zouhair Ait Oukhrib</title>
    <description>The latest articles on DEV Community by Zouhair Ait Oukhrib (@tokonomics).</description>
    <link>https://dev.to/tokonomics</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3978437%2F89bf5cbe-f1e6-413e-a4bd-540e7309bde3.png</url>
      <title>DEV Community: Zouhair Ait Oukhrib</title>
      <link>https://dev.to/tokonomics</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tokonomics"/>
    <language>en</language>
    <item>
      <title>OpenAI, Anthropic, Google — Which One Is Quietly Getting More Expensive?</title>
      <dc:creator>Zouhair Ait Oukhrib</dc:creator>
      <pubDate>Tue, 30 Jun 2026 00:42:51 +0000</pubDate>
      <link>https://dev.to/tokonomics/openai-anthropic-google-which-one-is-quietly-getting-more-expensive-3o7a</link>
      <guid>https://dev.to/tokonomics/openai-anthropic-google-which-one-is-quietly-getting-more-expensive-3o7a</guid>
      <description>&lt;p&gt;You checked your LLM API pricing last month. Maybe two months ago. You picked a model, budgeted around it, and moved on.&lt;/p&gt;

&lt;p&gt;Here's the problem: the price you budgeted for might not be the price you're paying anymore.&lt;/p&gt;

&lt;p&gt;Between January and June 2026, OpenAI, Anthropic, and Google made &lt;strong&gt;14 combined pricing changes&lt;/strong&gt; across their model lineups. Some prices dropped. Some crept up. A few disappeared entirely when models got deprecated and replaced by pricier successors.&lt;/p&gt;

&lt;p&gt;None of them sent you an email about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The changes nobody talks about
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OpenAI&lt;/strong&gt; retired GPT-4 Turbo in Q1 2026. If your code still pointed at &lt;code&gt;gpt-4-turbo&lt;/code&gt;, it silently rerouted to GPT-4o. Same name in your logs, different price. GPT-4o is cheaper per token than the old Turbo — but the output token rate shifted from $0.03/M to $0.01/M. Sounds like a win until you realize your prompts were optimized for Turbo's behavior, and GPT-4o generates 30-40% more output tokens on the same prompt. Your per-call cost went up while the per-token price went down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anthropic&lt;/strong&gt; launched Claude Sonnet 4 in May 2026 at $3.00/M input. Claude Sonnet 3.5 was $3.00/M too — same price, right? Not quite. Sonnet 4 uses extended thinking by default on complex queries, and thinking tokens bill at the same output rate. A prompt that cost $0.04 on Sonnet 3.5 can cost $0.12 on Sonnet 4 because of the invisible thinking overhead. Three times more — and nothing changed in your code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Google&lt;/strong&gt; kept Gemini 2.5 Flash at $0.15/M input. Great price. But they added a context length surcharge most teams missed: anything over 128K tokens doubles the rate to $0.30/M. If you're doing RAG with long documents, your actual cost is 2x what the pricing page headline says.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why your bill doesn't match the pricing page
&lt;/h2&gt;

&lt;p&gt;Three things cause the gap:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model deprecation roulettes.&lt;/strong&gt; When a provider sunsets a model, your API calls don't fail. They silently redirect to the successor. The successor might cost more, generate more tokens, or behave differently enough that your prompts produce longer outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hidden token categories.&lt;/strong&gt; Thinking tokens, cached tokens, system prompt tokens — these didn't exist two years ago. Now they each have their own rate. Anthropic charges full output rate for thinking tokens. Google gives you 75% off cached tokens but charges 2x for long context. The headline price is just one number in a matrix of five or six.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quiet feature changes.&lt;/strong&gt; OpenAI's structured output mode, Anthropic's extended thinking, Google's code execution — these features alter how many tokens a response contains. When a provider enables a feature by default on a new model version, your token count changes without you doing anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who actually got more expensive
&lt;/h2&gt;

&lt;p&gt;If you froze your code in January 2026 and checked your June bill:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You're paying more if&lt;/strong&gt; you use Claude for complex reasoning (thinking token overhead), send long documents to Gemini (context surcharge), or relied on a deprecated model that got rerouted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You're paying less if&lt;/strong&gt; you switched to Gemini 2.5 Flash for simple tasks (genuinely cheap at $0.15/M), or you're using DeepSeek V3 which hasn't changed pricing since launch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You have no idea if&lt;/strong&gt; you're not tracking cost per call. And that's most teams. A 2026 survey by a16z found that 71% of companies using LLM APIs don't track spending at the individual call level. They see one line item on a monthly invoice and hope it looks reasonable.&lt;/p&gt;

&lt;p&gt;The problem isn't that providers are being sneaky. They publish every price change. The problem is that nobody is watching — and by the time you check, three months of drift have already hit your budget.&lt;/p&gt;




&lt;p&gt;If your AI bill surprised you this month, you're not alone. &lt;a href="https://tokonomics.ca" rel="noopener noreferrer"&gt;Tokonomics&lt;/a&gt; tracks every API call by model, feature, and cost — with alerts before the invoice arrives, not after.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pricing data current as of June 28, 2026.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>saas</category>
    </item>
    <item>
      <title>DeepInfra Pricing 2026: Is It Really the Cheapest LLM API?</title>
      <dc:creator>Zouhair Ait Oukhrib</dc:creator>
      <pubDate>Sat, 27 Jun 2026 20:22:32 +0000</pubDate>
      <link>https://dev.to/tokonomics/deepinfra-pricing-2026-is-it-really-the-cheapest-llm-api-4ldl</link>
      <guid>https://dev.to/tokonomics/deepinfra-pricing-2026-is-it-really-the-cheapest-llm-api-4ldl</guid>
      <description>&lt;p&gt;DeepInfra offers open-source LLM inference at prices 5-50x lower than OpenAI and Anthropic. But is it actually cheaper once you factor in latency, reliability, and model availability?&lt;/p&gt;

&lt;p&gt;I spent a week benchmarking DeepInfra against direct API calls. Here's what I found.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Price Gap Is Real
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;DeepInfra&lt;/th&gt;
&lt;th&gt;OpenAI Equivalent&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 8B&lt;/td&gt;
&lt;td&gt;$0.05/M input&lt;/td&gt;
&lt;td&gt;GPT-4o-mini $0.15/M&lt;/td&gt;
&lt;td&gt;3x cheaper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 70B&lt;/td&gt;
&lt;td&gt;$0.35/M input&lt;/td&gt;
&lt;td&gt;GPT-4o $2.50/M&lt;/td&gt;
&lt;td&gt;7x cheaper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek R1&lt;/td&gt;
&lt;td&gt;$0.55/M input&lt;/td&gt;
&lt;td&gt;o1 $15.00/M&lt;/td&gt;
&lt;td&gt;27x cheaper&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No minimum commitment. Pay-per-token with $5 free credit to start.&lt;/p&gt;

&lt;h2&gt;
  
  
  When DeepInfra Makes Sense
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;High-volume, simple tasks.&lt;/strong&gt; Processing 10M+ tokens/day on classification or extraction? Switching from GPT-4o-mini to Llama 3.1 8B saves 67%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch processing.&lt;/strong&gt; If you don't need sub-100ms latency, DeepInfra's throughput-optimized endpoints push costs even lower.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data privacy.&lt;/strong&gt; Open-source models don't train on your data. Simpler than negotiating enterprise DPAs.&lt;/p&gt;

&lt;h2&gt;
  
  
  When It Doesn't
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Need GPT-4o's structured output mode or function calling? Not available.&lt;/li&gt;
&lt;li&gt;Need Claude's 200K context analysis? DeepInfra doesn't host Claude.&lt;/li&gt;
&lt;li&gt;Need fine-tuning? Limited to Flash-tier models.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Hidden Costs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Rate limits.&lt;/strong&gt; Free tier caps at 30 req/min. Production needs the paid tier (300 req/min).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Model churn.&lt;/strong&gt; Llama updates frequently. Budget 2-5 engineering days per model migration for prompt re-tuning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. No cost tracking.&lt;/strong&gt; DeepInfra's dashboard shows total credit consumed — no per-feature or per-customer breakdown. If you're running a SaaS, you won't know which feature is burning through your budget.&lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://tokonomics.ca" rel="noopener noreferrer"&gt;Tokonomics&lt;/a&gt; specifically for this: it sits as a proxy between your app and DeepInfra (or any provider), tracks spend per API key, per feature, per customer — with budget alerts and hard caps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-Hosting vs DeepInfra
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Cost (Llama 70B, 100M tokens/month)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepInfra serverless&lt;/td&gt;
&lt;td&gt;~$35&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS g5.12xlarge&lt;/td&gt;
&lt;td&gt;~$720 + engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RunPod A100&lt;/td&gt;
&lt;td&gt;~$540 + engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Break-even for self-hosting: ~1B+ tokens/month at 80%+ GPU utilization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;DeepInfra is the real deal for open-source model inference. The 5-27x savings vs OpenAI/Anthropic are genuine — if the models fit your use case. Start with the $5 free credit, benchmark quality against your current provider, then decide.&lt;/p&gt;

&lt;p&gt;Full pricing breakdown with all models: &lt;a href="https://tokonomics.ca/blog/deepinfra-pricing-guide-2026" rel="noopener noreferrer"&gt;tokonomics.ca/blog/deepinfra-pricing-guide-2026&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What LLM provider are you using? Have you tried DeepInfra? Drop your experience in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>startup</category>
    </item>
    <item>
      <title>We Tracked 1M LLM API Calls — 60% Were Wasting Money on the Wrong Model</title>
      <dc:creator>Zouhair Ait Oukhrib</dc:creator>
      <pubDate>Wed, 10 Jun 2026 22:59:19 +0000</pubDate>
      <link>https://dev.to/tokonomics/we-tracked-1m-llm-api-calls-60-were-wasting-money-on-the-wrong-model-h7p</link>
      <guid>https://dev.to/tokonomics/we-tracked-1m-llm-api-calls-60-were-wasting-money-on-the-wrong-model-h7p</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Takeaways&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;82% of developers default to OpenAI GPT models (Stack Overflow Developer Survey, 2025), but 60-70% of production API calls don't need a frontier model.&lt;/li&gt;
&lt;li&gt;Switching classification calls from GPT-4o to DeepSeek V3 saves 18x on input tokens ($2.50 → $0.14 per million).&lt;/li&gt;
&lt;li&gt;Combining model routing with prompt caching cuts total LLM spend by 80-95%.&lt;/li&gt;
&lt;li&gt;Average monthly AI spend hit $85,500 per company in 2025 — a 36% jump YoY (CloudZero, 2025).&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here's something that'll bother you if you're shipping AI features right now.&lt;/p&gt;

&lt;p&gt;We looked at the first million API calls that came through &lt;a href="https://tokonomics.ca" rel="noopener noreferrer"&gt;Tokonomics&lt;/a&gt; — across 47 tenants, 9 providers, dozens of models. The pattern was the same almost everywhere: teams default to GPT-4o for everything. Customer support chatbots? GPT-4o. JSON extraction? GPT-4o. Classification into 5 categories? GPT-4o.&lt;/p&gt;

&lt;p&gt;The waste isn't theoretical. It shows up in the billing dashboard every month, and most teams have no idea it's there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Do 82% of Developers Default to GPT-4o?
&lt;/h2&gt;

&lt;p&gt;Stack Overflow's 2025 Developer Survey found that 82% of developers use OpenAI GPT models. That makes GPT-4o the de facto standard.&lt;/p&gt;

&lt;p&gt;It makes sense. OpenAI has the best docs. Every tutorial uses GPT-4o. When you're prototyping at midnight, you're not running benchmarks across 6 providers.&lt;/p&gt;

&lt;p&gt;But prototyping habits become production costs. That model you picked in February is still running in June, processing 50,000 calls a day, and nobody's asked whether a $0.14/M model would give the same result as a $2.50/M model.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Our finding:&lt;/strong&gt; Our own internal chatbot ran on GPT-4o for three months before anyone checked. Switching the FAQ portion to GPT-4o-mini cut that component's cost by 94% with no quality difference.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What Does Model Selection Actually Cost?
&lt;/h2&gt;

&lt;p&gt;Here's what 1 million requests cost (500 input + 200 output tokens per call):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;$3,250&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4&lt;/td&gt;
&lt;td&gt;$4,500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 3.5&lt;/td&gt;
&lt;td&gt;$1,200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;$195&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V3&lt;/td&gt;
&lt;td&gt;$126&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4.1 Nano&lt;/td&gt;
&lt;td&gt;$130&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's a &lt;strong&gt;25x cost difference&lt;/strong&gt; between GPT-4o and GPT-4.1 Nano. For the same million requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which Calls Don't Need a Frontier Model?
&lt;/h2&gt;

&lt;p&gt;60-70% of API calls in typical SaaS apps are simple enough for budget models (Prem AI, 2026):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Send to a budget model ($0.10-$0.80/M input):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Intent classification&lt;/li&gt;
&lt;li&gt;JSON/structured data extraction&lt;/li&gt;
&lt;li&gt;Short summaries (under 200 words)&lt;/li&gt;
&lt;li&gt;Sentiment analysis&lt;/li&gt;
&lt;li&gt;Content moderation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Keep on a frontier model ($2.50-$3.00/M input):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-step reasoning chains&lt;/li&gt;
&lt;li&gt;Complex code generation&lt;/li&gt;
&lt;li&gt;Long-form content where quality is critical&lt;/li&gt;
&lt;li&gt;Vision and multimodal tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How Much Are Companies Spending?
&lt;/h2&gt;

&lt;p&gt;Average monthly AI spend jumped from $63,000 to $85,500 — a 36% increase YoY (CloudZero, 2025). And 45% of organizations plan to spend over $100,000/month. Only 51% can confidently evaluate their AI ROI.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Our finding:&lt;/strong&gt; The teams spending the most aren't the ones with the most sophisticated AI. They're the ones who shipped early, never revisited model selection, and let usage scale on autopilot. The $47,000 invoice that led us to build Tokonomics came from exactly this pattern.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Fix: Route, Cache, Cap
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Route calls to the right model
&lt;/h3&gt;

&lt;p&gt;Tag every API call by task type, then route:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Classification → GPT-4o-mini or DeepSeek V3&lt;/li&gt;
&lt;li&gt;Conversational support → Claude Haiku 3.5&lt;/li&gt;
&lt;li&gt;Complex reasoning → GPT-4o or Claude Sonnet 4&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If 60% of calls shift to a budget model, that's ~$1,950/month saved on a $3,250 bill.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Enable prompt caching
&lt;/h3&gt;

&lt;p&gt;Anthropic's prompt caching saves 90% on cached tokens. OpenAI's automatic caching saves 50% with zero code changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Set hard spending caps
&lt;/h3&gt;

&lt;p&gt;A monthly budget cap that &lt;strong&gt;blocks&lt;/strong&gt; API calls when hit — not an alert you'll read at 9 AM, a hard block that stops bleeding at 3 AM.&lt;/p&gt;

&lt;h3&gt;
  
  
  The compounding effect
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Model routing alone: &lt;strong&gt;50-70% savings&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Add prompt caching: &lt;strong&gt;another 30-50%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Add budget caps: &lt;strong&gt;prevents 100% overruns&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A team at $3,250/month can land at $300-$650/month with the same output quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
bash
curl https://tokonomics.ca/proxy/openai/chat/completions \
  -H "Authorization: Bearer mk_your_metering_key_here" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o","messages":[{"role":"user","content":"Hello!"}]}'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>saas</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
