<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Bo Shen</title>
    <description>The latest articles on DEV Community by Bo Shen (@aplomb2).</description>
    <link>https://dev.to/aplomb2</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3950497%2Fbd65fe5d-412b-4708-a30c-508513af3db7.png</url>
      <title>DEV Community: Bo Shen</title>
      <link>https://dev.to/aplomb2</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aplomb2"/>
    <language>en</language>
    <item>
      <title>Claude Fable 5 Went from Free to Offline in 72 Hours — What I Learned About AI Coding Costs</title>
      <dc:creator>Bo Shen</dc:creator>
      <pubDate>Mon, 15 Jun 2026 20:30:01 +0000</pubDate>
      <link>https://dev.to/aplomb2/claude-fable-5-went-from-free-to-offline-in-72-hours-what-i-learned-about-ai-coding-costs-4faa</link>
      <guid>https://dev.to/aplomb2/claude-fable-5-went-from-free-to-offline-in-72-hours-what-i-learned-about-ai-coding-costs-4faa</guid>
      <description>&lt;p&gt;Last week, Anthropic launched Fable 5 — their most powerful model ever — free for all Pro/Max subscribers through June 22.&lt;/p&gt;

&lt;p&gt;Three days later, the US government issued an export control directive. Fable 5 went dark worldwide.&lt;/p&gt;

&lt;p&gt;Developers who hardcoded &lt;code&gt;claude-fable-5&lt;/code&gt; in their workflows woke up to broken pipelines. Anthropic received the directive at 5:21pm ET on June 12 and had to comply immediately.&lt;/p&gt;

&lt;p&gt;This isn't a post about geopolitics. It's about what this event reveals about the true cost of AI-assisted coding — and why &lt;strong&gt;model routing&lt;/strong&gt; is the most underrated skill in a developer's toolkit right now.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Cost of AI Coding in June 2026
&lt;/h2&gt;

&lt;p&gt;Let's talk numbers that most people aren't tracking:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input (per 1M tokens)&lt;/th&gt;
&lt;th&gt;Output (per 1M tokens)&lt;/th&gt;
&lt;th&gt;Typical coding session cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Fable 5&lt;/td&gt;
&lt;td&gt;$10&lt;/td&gt;
&lt;td&gt;$50&lt;/td&gt;
&lt;td&gt;$5-15 per task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.8&lt;/td&gt;
&lt;td&gt;$5&lt;/td&gt;
&lt;td&gt;$25&lt;/td&gt;
&lt;td&gt;$2-8 per task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4&lt;/td&gt;
&lt;td&gt;$1.50&lt;/td&gt;
&lt;td&gt;$7.50&lt;/td&gt;
&lt;td&gt;$0.50-2 per task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5&lt;/td&gt;
&lt;td&gt;~$2.50&lt;/td&gt;
&lt;td&gt;~$10&lt;/td&gt;
&lt;td&gt;$1-3 per task&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One Reddit user reported burning &lt;strong&gt;$200 in under 60 minutes&lt;/strong&gt; with Fable 5. Another tracked 35 Claude Code subscriptions that would cost &lt;strong&gt;$80K/month&lt;/strong&gt; at API rates.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Insight: 80% of Your Coding Tasks Don't Need the Most Powerful Model
&lt;/h2&gt;

&lt;p&gt;I run multiple AI coding agents daily across a portfolio of 10+ apps. Six months ago, my monthly AI coding bill hit &lt;strong&gt;$10K&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Today it's around &lt;strong&gt;$3K&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The difference wasn't switching to cheaper models across the board. It was &lt;strong&gt;routing different task types to the right model&lt;/strong&gt;:&lt;/p&gt;

&lt;h3&gt;
  
  
  What Actually Needs Frontier Models (Fable/Opus)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Complex architectural decisions&lt;/li&gt;
&lt;li&gt;Multi-file refactoring with subtle dependencies&lt;/li&gt;
&lt;li&gt;Novel algorithm implementation&lt;/li&gt;
&lt;li&gt;Debugging race conditions or memory leaks&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What Works Great with Mid-Tier Models (Sonnet/GPT-5.5)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Boilerplate generation and scaffolding&lt;/li&gt;
&lt;li&gt;Unit test writing&lt;/li&gt;
&lt;li&gt;Documentation&lt;/li&gt;
&lt;li&gt;Simple bug fixes&lt;/li&gt;
&lt;li&gt;Code formatting and linting&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What Smaller Models Handle Fine
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Commit message generation&lt;/li&gt;
&lt;li&gt;Simple string transformations&lt;/li&gt;
&lt;li&gt;Template filling&lt;/li&gt;
&lt;li&gt;Configuration file updates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When I actually tracked which model was doing what, I found that &lt;strong&gt;roughly 60-70% of my tokens were going to tasks that a Sonnet-class model would handle equally well&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fable 5 Shutdown Proved Something Else
&lt;/h2&gt;

&lt;p&gt;Beyond cost, the overnight shutdown exposed a &lt;strong&gt;resilience problem&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If your entire workflow depends on a single model from a single provider, you don't have a workflow — you have a single point of failure.&lt;/p&gt;

&lt;p&gt;My setup auto-fell back to Opus 4.8 when Fable went offline. No configuration changes, no manual intervention, no lost work. That's not because I predicted a government export control order. It's because I assumed &lt;strong&gt;any model can become unavailable at any time&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This has happened before:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI rate limits during peak hours&lt;/li&gt;
&lt;li&gt;Anthropic's extended outage in March&lt;/li&gt;
&lt;li&gt;Google's API deprecation cycle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Building model fallback chains isn't paranoia. It's good engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Start Routing Today
&lt;/h2&gt;

&lt;p&gt;You don't need fancy infrastructure. Here's a simple approach:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Classify your tasks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before sending a prompt, tag it: &lt;code&gt;planning&lt;/code&gt;, &lt;code&gt;implementation&lt;/code&gt;, &lt;code&gt;debugging&lt;/code&gt;, &lt;code&gt;testing&lt;/code&gt;, &lt;code&gt;documentation&lt;/code&gt;, &lt;code&gt;formatting&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Create a routing table&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;planning       → opus/fable (complex reasoning matters)
implementation → sonnet (good enough, 5x cheaper)
debugging      → opus (needs deep understanding)
testing        → sonnet (formulaic, template-driven)
documentation  → sonnet (clarity over intelligence)
formatting     → haiku/small (trivial tasks)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Track and iterate&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Log which model handled which task, then review: did the cheaper model produce acceptable results? Over time, you'll discover your personal routing table.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;The AI coding landscape in June 2026 looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Models are getting more capable AND more expensive at the top end&lt;/li&gt;
&lt;li&gt;The gap between tiers is narrowing for common tasks&lt;/li&gt;
&lt;li&gt;Availability is no longer guaranteed (regulatory, rate limits, outages)&lt;/li&gt;
&lt;li&gt;Smart routing beats brute-force spending every time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The developers who'll thrive aren't the ones with unlimited API budgets. They're the ones who treat model selection as an engineering problem — matching the right tool to the right task, with fallbacks for when things go wrong.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Bo. I run 10+ AI-powered apps and spend too much time thinking about model costs. Previously cut our team's Claude Code bill from $10K/mo to $3K with task-level routing. Find me &lt;a href="https://x.com/aplomb2" rel="noopener noreferrer"&gt;@aplomb2&lt;/a&gt; on X.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ai</category>
      <category>programming</category>
      <category>claude</category>
    </item>
    <item>
      <title>Claude Fable 5 Costs $50/M Output Tokens — Here's How I Cut My AI Coding Bill by 70%</title>
      <dc:creator>Bo Shen</dc:creator>
      <pubDate>Thu, 11 Jun 2026 20:28:16 +0000</pubDate>
      <link>https://dev.to/aplomb2/claude-fable-5-costs-50m-output-tokens-heres-how-i-cut-my-ai-coding-bill-by-70-39i7</link>
      <guid>https://dev.to/aplomb2/claude-fable-5-costs-50m-output-tokens-heres-how-i-cut-my-ai-coding-bill-by-70-39i7</guid>
      <description>&lt;p&gt;Anthropic just dropped Claude Fable 5, and the reactions are split right down the middle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Developers who tried it&lt;/strong&gt;: "This is the best coding model I've ever used."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developers who saw the bill&lt;/strong&gt;: "Wait, $50 per million output tokens?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both reactions are correct. Fable 5 is genuinely a leap forward for complex reasoning tasks. But at 2x the price of Opus 4.8, sending &lt;em&gt;everything&lt;/em&gt; through it is financial suicide — especially for teams running multiple concurrent coding agents.&lt;/p&gt;

&lt;p&gt;I've been running AI coding agents at scale for the past year. Here's how I keep my monthly bill under $3K while still using frontier models where they actually matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Not All Coding Tasks Are Equal
&lt;/h2&gt;

&lt;p&gt;Here's what most developers get wrong: they pick one model and use it for everything. Whether they're generating boilerplate, writing tests, debugging a gnarly race condition, or architecting a new service — same model, same cost.&lt;/p&gt;

&lt;p&gt;But when I audited my team's actual Claude Code usage over 3 months, the breakdown was shocking:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task Type&lt;/th&gt;
&lt;th&gt;% of Total Tokens&lt;/th&gt;
&lt;th&gt;Frontier Model Needed?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Boilerplate &amp;amp; scaffolding&lt;/td&gt;
&lt;td&gt;~25%&lt;/td&gt;
&lt;td&gt;❌ Haiku handles it fine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test generation&lt;/td&gt;
&lt;td&gt;~20%&lt;/td&gt;
&lt;td&gt;❌ Sonnet is perfect&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Simple refactors &amp;amp; linting&lt;/td&gt;
&lt;td&gt;~15%&lt;/td&gt;
&lt;td&gt;❌ Any model works&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feature implementation&lt;/td&gt;
&lt;td&gt;~25%&lt;/td&gt;
&lt;td&gt;⚠️ Sonnet/Opus depending on complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex architecture &amp;amp; debugging&lt;/td&gt;
&lt;td&gt;~15%&lt;/td&gt;
&lt;td&gt;✅ This is where Fable shines&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Only ~15% of our token spend actually benefited from frontier-tier models.&lt;/strong&gt; The other 85% was burning money on tasks where cheaper models produce identical results.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Math That Changed Everything
&lt;/h2&gt;

&lt;p&gt;Let's do some napkin math with current pricing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before routing (everything on Opus 4.8):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;My team: ~400K output tokens/day across all agents&lt;/li&gt;
&lt;li&gt;400K × 30 days × $25/M = &lt;strong&gt;~$300/day = ~$9,000/month&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After task-level routing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;60% of tokens → Haiku ($0.25/M output): &lt;strong&gt;$1.80/day&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;25% of tokens → Sonnet ($3/M output): &lt;strong&gt;$9.00/day&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;15% of tokens → Opus/Fable ($25-50/M output): &lt;strong&gt;$67.50/day&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: ~$78/day = ~$2,340/month&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's a &lt;strong&gt;74% reduction&lt;/strong&gt; with zero quality loss on the complex work. In fact, the complex work got &lt;em&gt;better&lt;/em&gt; because now we can afford to use Fable 5 where it actually matters instead of rationing a mid-tier model across everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Task-Level Routing Actually Works
&lt;/h2&gt;

&lt;p&gt;The concept is simple: &lt;strong&gt;classify the task, then pick the cheapest model that can handle it well.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's the decision tree I use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Is this a known pattern? (boilerplate, CRUD, test scaffolding)
   → Haiku. Fast, cheap, good enough.

2. Does it require understanding context across multiple files?
   → Sonnet. Great balance of capability and cost.

3. Does it involve:
   - Complex multi-step reasoning?
   - Subtle bug hunting across a large codebase?
   - Architecture decisions with tradeoffs?
   - Novel algorithm design?
   → Opus or Fable 5. Worth every penny.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;you don't need to be perfect at classification.&lt;/strong&gt; Even a rough 60/30/10 split saves massive money compared to running everything on a single tier.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned Running This for 6 Months
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Cheaper models fail gracefully
&lt;/h3&gt;

&lt;p&gt;When Haiku gets a task that's slightly too complex, it doesn't produce garbage — it produces a reasonable attempt that Sonnet can refine. The cost of occasionally "upgrading" a task is way less than running everything on expensive models.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Token count is a poor proxy for complexity
&lt;/h3&gt;

&lt;p&gt;A 200-line test file generation burns lots of tokens but needs zero frontier reasoning. A 5-line debugging insight might need Fable-tier understanding. Route on &lt;em&gt;task type&lt;/em&gt;, not token volume.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The model landscape is fragmenting fast
&lt;/h3&gt;

&lt;p&gt;This week alone: Microsoft dropped 7 MAI models at Build, MiMo Code is matching Sonnet at a fraction of the cost, and Apple revealed Siri AI routes between multiple models including Gemini.&lt;/p&gt;

&lt;p&gt;Even Apple — the world's biggest company — doesn't bet on a single model. They route dynamically based on the task.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The cost spread is only getting wider
&lt;/h3&gt;

&lt;p&gt;Fable 5 at $50/M output vs Haiku at $0.25/M means a &lt;strong&gt;200x price difference&lt;/strong&gt; between the cheapest and most expensive Claude models. That spread makes routing not just nice-to-have — it's table stakes for anyone spending more than a few hundred a month.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;You don't need a fancy routing framework to start:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Audit your usage&lt;/strong&gt;: What percentage of your AI coding tasks actually need frontier reasoning? Most teams overestimate this by 3-5x.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with two tiers&lt;/strong&gt;: Route "simple" tasks to Haiku, everything else to your current model. Even this basic split saves 40-50%.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Iterate&lt;/strong&gt;: As you get comfortable, add a middle tier (Sonnet) and refine your classification rules based on actual results.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Measure&lt;/strong&gt;: Track quality metrics alongside cost. You'll likely find that quality stays flat or improves (because you can afford frontier models for the hard stuff).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;The AI model market is heading toward massive fragmentation. New models ship weekly. Prices vary 200x between tiers. Every major player is building their own models.&lt;/p&gt;

&lt;p&gt;The developers and teams who thrive won't be the ones using the most expensive model for everything. They'll be the ones who match the right model to the right task — automatically, at scale.&lt;/p&gt;

&lt;p&gt;Fable 5 is incredible. Use it where it matters. Use something cheaper everywhere else. Your wallet will thank you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Bo — I've been building AI-powered apps and cutting costs through intelligent model routing. Follow me on &lt;a href="https://x.com/aplomb2" rel="noopener noreferrer"&gt;X (@aplomb2)&lt;/a&gt; for more on making AI coding affordable.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
    </item>
    <item>
      <title>Claude Fable 5: The 7.5x Cost Trap and How to Fix It with Task-Level Routing</title>
      <dc:creator>Bo Shen</dc:creator>
      <pubDate>Wed, 10 Jun 2026 09:26:25 +0000</pubDate>
      <link>https://dev.to/aplomb2/claude-fable-5-the-75x-cost-trap-and-how-to-fix-it-with-task-level-routing-56cd</link>
      <guid>https://dev.to/aplomb2/claude-fable-5-the-75x-cost-trap-and-how-to-fix-it-with-task-level-routing-56cd</guid>
      <description>&lt;p&gt;Anthropic dropped Claude Fable 5 yesterday — their most capable model ever. Everyone's talking about the benchmarks. But here's what actually matters for your bill: &lt;strong&gt;the same model can cost you 7.5x more depending on one setting.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let me explain, and then show you exactly how we used this to cut our AI coding costs from $10K/mo to $3K/mo.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hidden Cost Lever
&lt;/h2&gt;

&lt;p&gt;Fable 5 introduces 5 thinking effort levels: &lt;code&gt;low&lt;/code&gt;, &lt;code&gt;medium&lt;/code&gt;, &lt;code&gt;medium-high&lt;/code&gt;, &lt;code&gt;high&lt;/code&gt;, and &lt;code&gt;max&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Same model. Same intelligence. But wildly different costs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Thinking Level&lt;/th&gt;
&lt;th&gt;Cost per Query&lt;/th&gt;
&lt;th&gt;Relative Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;~$0.10&lt;/td&gt;
&lt;td&gt;1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;~$0.20&lt;/td&gt;
&lt;td&gt;2x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium-High&lt;/td&gt;
&lt;td&gt;~$0.35&lt;/td&gt;
&lt;td&gt;3.5x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;~$0.50&lt;/td&gt;
&lt;td&gt;5x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max&lt;/td&gt;
&lt;td&gt;~$0.72&lt;/td&gt;
&lt;td&gt;7.5x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most developers will leave this on default (high/max) and never think about it. That's the trap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters More Than You Think
&lt;/h2&gt;

&lt;p&gt;Fable 5 costs $10/M input and $50/M output — exactly &lt;strong&gt;double&lt;/strong&gt; Opus 4.8. Combined with the thinking effort multiplier, a heavy coding session can burn through budget shockingly fast.&lt;/p&gt;

&lt;p&gt;One user on r/ClaudeAI reported burning &lt;strong&gt;2% of their Max 20x plan per minute&lt;/strong&gt; during a heavy Fable 5 session. At that rate, you'd exhaust a $200/mo plan in under an hour of focused work.&lt;/p&gt;

&lt;p&gt;But here's the thing: &lt;strong&gt;most coding tasks don't need max thinking.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Renaming variables? Low thinking is fine.&lt;/li&gt;
&lt;li&gt;Writing unit tests from existing code? Medium at most.&lt;/li&gt;
&lt;li&gt;Fixing a typo or config change? Low.&lt;/li&gt;
&lt;li&gt;Complex architecture decisions? Now you want max.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The skill gap in 2026 isn't "which model do I use." It's &lt;strong&gt;"how much thinking does this task actually need."&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three-Layer Routing Approach
&lt;/h2&gt;

&lt;p&gt;Here's what actually worked for us across 10+ products:&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Model Selection
&lt;/h3&gt;

&lt;p&gt;Not everything needs Fable 5. We route across three tiers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Routine tasks&lt;/strong&gt; (config changes, formatting, boilerplate) → Haiku-class models (~$0.01/query)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standard reasoning&lt;/strong&gt; (code review, debugging, feature implementation) → Sonnet/Opus tier (~$0.05-0.15/query)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frontier-required&lt;/strong&gt; (architecture decisions, complex multi-step reasoning) → Fable 5&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 2: Thinking Effort (NEW with Fable 5)
&lt;/h3&gt;

&lt;p&gt;When a task does need Fable 5, match the thinking effort:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pseudocode for thinking effort routing
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_thinking_effort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;classification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="c1"&gt;# ~$0.10
&lt;/span&gt;    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_review&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;debugging&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refactoring&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    &lt;span class="c1"&gt;# ~$0.20
&lt;/span&gt;    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;architecture&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;security_audit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;migration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;       &lt;span class="c1"&gt;# ~$0.72
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    &lt;span class="c1"&gt;# safe default
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 3: Prompt Caching
&lt;/h3&gt;

&lt;p&gt;Fable 5 offers a &lt;strong&gt;90% discount on cached input tokens&lt;/strong&gt;. If your system prompt and tool definitions are consistent across calls, cached input drops from $10/M to $1/M.&lt;/p&gt;

&lt;p&gt;This is massive for agentic workflows where the same context gets sent repeatedly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Numbers: Our Before and After
&lt;/h2&gt;

&lt;p&gt;Before routing (everything on Claude Opus, max settings):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;$10,200/mo&lt;/strong&gt; across 10 products&lt;/li&gt;
&lt;li&gt;Average cost per developer task: $0.85&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After three-layer routing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;$3,100/mo&lt;/strong&gt; — a 70% reduction&lt;/li&gt;
&lt;li&gt;Average cost per developer task: $0.26&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The breakdown of where tasks actually land:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;62%&lt;/strong&gt; of tasks → cheap models (Haiku/Sonnet class)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;31%&lt;/strong&gt; of tasks → mid-tier (Opus 4.8 or Fable 5 low/medium thinking)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;7%&lt;/strong&gt; of tasks → Fable 5 max thinking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That 7% is doing the heavy lifting. The other 93% was burning money for no quality improvement.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Classification Trick
&lt;/h2&gt;

&lt;p&gt;"But how do you know which tier a task needs?"&lt;/p&gt;

&lt;p&gt;A lightweight classifier. We use a Haiku-class model to analyze each task before routing it. The classifier itself costs ~0.1% of what it saves. Here's the approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Take the task description/prompt&lt;/li&gt;
&lt;li&gt;Ask a cheap model: "Rate this task's complexity: routine/standard/frontier"&lt;/li&gt;
&lt;li&gt;Route based on the answer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It's not perfect — maybe 85% accurate. But an 85%-accurate router that saves 70% is vastly better than a 100%-accurate "just send everything to the best model" approach that costs 3x more.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Doesn't Work
&lt;/h2&gt;

&lt;p&gt;Things we tried that failed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Static rules per task type&lt;/strong&gt;: Too rigid. "All debugging goes to Opus" misses that most debugging is simple.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM judge picking the model&lt;/strong&gt;: Recursive cost problem — the judge itself is expensive if you use a good model for it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Just eating the cost&lt;/strong&gt;: Works until your CFO sees the bill. Or until Microsoft bans your Claude Code license (yes, this happened last week).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Industry Is Catching Up
&lt;/h2&gt;

&lt;p&gt;Within 12 hours of Fable 5 launching, every major technical guide led with cost control:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TrueFoundry&lt;/strong&gt;: "cost control isn't optional"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spicy Advisory&lt;/strong&gt;: "reserve it for high-value work, keep cheaper models for routine"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenRouter&lt;/strong&gt;: already listing it with routing support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When every guide about a new model starts with "here's how to NOT use it for everything" — the single-model era is officially over.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;If you're running any AI coding workflow (Claude Code, Cursor, Aider, custom agents):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Audit your current usage&lt;/strong&gt;: What percentage of your API calls actually need frontier-level reasoning?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Start simple&lt;/strong&gt;: Route "obviously easy" tasks to a cheaper model. Even a basic keyword-based router saves 30-40%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add thinking effort&lt;/strong&gt;: For tasks that do need Fable 5, default to &lt;code&gt;medium&lt;/code&gt; instead of &lt;code&gt;max&lt;/code&gt;. Upgrade to &lt;code&gt;max&lt;/code&gt; only for specific task types.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure per-task cost&lt;/strong&gt;: Not per-API-call cost. A single "refactor this module" can fan out into 30+ sub-agent calls. Track cost per user intent.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The model keeps getting better. The pricing keeps going up. The only sustainable strategy is routing.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We've been building routing tools for AI coding workflows at &lt;a href="https://coderouter.io" rel="noopener noreferrer"&gt;CodeRouter&lt;/a&gt;. If you're spending more than $500/mo on AI coding APIs, task-level routing will pay for itself in the first week.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>What Apple's WWDC Just Taught Us About AI Model Routing</title>
      <dc:creator>Bo Shen</dc:creator>
      <pubDate>Mon, 08 Jun 2026 19:58:44 +0000</pubDate>
      <link>https://dev.to/aplomb2/what-apples-wwdc-just-taught-us-about-ai-model-routing-22p6</link>
      <guid>https://dev.to/aplomb2/what-apples-wwdc-just-taught-us-about-ai-model-routing-22p6</guid>
      <description>&lt;p&gt;Today at WWDC 2026, Apple dropped what might be the most important signal in AI strategy this year — and it wasn't Siri AI, macOS Golden Gate, or iOS 27.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apple chose NOT to build their own foundation model.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Siri AI, the ground-up rebuild Apple has been working on for two years, runs on Google's Gemini models. The world's most valuable company decided that model &lt;em&gt;selection and integration&lt;/em&gt; matters more than model &lt;em&gt;ownership&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This should change how every developer thinks about AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "One Model" Trap
&lt;/h2&gt;

&lt;p&gt;Most teams today pick a single model and route everything through it. Claude for coding. GPT for chat. Gemini for multimodal. Then they wonder why their bills are astronomical or their output quality is inconsistent.&lt;/p&gt;

&lt;p&gt;Microsoft just banned engineers from using Claude Code after realizing it cost more than the humans it was supposed to assist. Multiple Reddit threads this week show developers racking up $120K+ in API tokens, or getting called into HR after 10 days of heavy usage.&lt;/p&gt;

&lt;p&gt;The problem isn't the model. It's the assumption that one model fits all tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Apple Got Right
&lt;/h2&gt;

&lt;p&gt;Apple's approach with Siri AI is essentially a routing layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-device models handle simple queries (privacy-preserving, instant)&lt;/li&gt;
&lt;li&gt;Apple's own server-side models handle mid-tier tasks&lt;/li&gt;
&lt;li&gt;Gemini handles the complex reasoning and generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each query goes to the &lt;em&gt;right-sized&lt;/em&gt; model for the job. A timer request doesn't need a trillion-parameter model. A complex research question does.&lt;/p&gt;

&lt;p&gt;This is the same architecture pattern that's emerging in AI coding tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;What You Need&lt;/th&gt;
&lt;th&gt;What Most People Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Planning/Architecture&lt;/td&gt;
&lt;td&gt;Strong reasoning (Opus/o3-level)&lt;/td&gt;
&lt;td&gt;Opus for everything&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Implementation&lt;/td&gt;
&lt;td&gt;Fast, accurate code gen (Sonnet/Flash)&lt;/td&gt;
&lt;td&gt;Opus for everything&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Debugging&lt;/td&gt;
&lt;td&gt;Pattern matching + context (Flash/Haiku)&lt;/td&gt;
&lt;td&gt;Opus for everything&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tests/Boilerplate&lt;/td&gt;
&lt;td&gt;Speed over intelligence (Haiku/Flash)&lt;/td&gt;
&lt;td&gt;Opus for everything&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;See the pattern? Using a frontier model for boilerplate is like using a Ferrari to go grocery shopping. It works, but the economics don't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Math That Changed My Workflow
&lt;/h2&gt;

&lt;p&gt;I run 10+ apps and was spending about $10K/month on Claude Code before I started routing tasks by complexity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Planning phase&lt;/strong&gt;: Opus (worth the cost for architecture decisions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implementation&lt;/strong&gt;: Sonnet 4 (90% of Opus quality at 20% of the cost)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test writing&lt;/strong&gt;: Flash/Haiku (these are pattern-matching tasks)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debugging&lt;/strong&gt;: Depends on complexity — start with Flash, escalate if needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result: &lt;strong&gt;$10K → ~$3K/month&lt;/strong&gt; with no measurable drop in output quality. The savings came entirely from &lt;em&gt;not using the most expensive model for tasks that didn't need it&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How To Implement This
&lt;/h2&gt;

&lt;p&gt;You don't need special tools. You need discipline:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Categorize Before You Prompt
&lt;/h3&gt;

&lt;p&gt;Before sending a task to your AI coding tool, ask: "Does this need frontier intelligence, or am I just generating predictable code?"&lt;/p&gt;

&lt;p&gt;Most tasks fall into the "predictable" bucket — CRUD endpoints, test scaffolding, config files, migration scripts.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Use Model Tiers
&lt;/h3&gt;

&lt;p&gt;Set up your environment with at least two model tiers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="c"&gt;# .env or tool config
&lt;/span&gt;&lt;span class="n"&gt;PLANNING_MODEL&lt;/span&gt;=&lt;span class="n"&gt;claude&lt;/span&gt;-&lt;span class="n"&gt;opus&lt;/span&gt;-&lt;span class="m"&gt;4&lt;/span&gt;
&lt;span class="n"&gt;IMPLEMENTATION_MODEL&lt;/span&gt;=&lt;span class="n"&gt;claude&lt;/span&gt;-&lt;span class="n"&gt;sonnet&lt;/span&gt;-&lt;span class="m"&gt;4&lt;/span&gt;
&lt;span class="n"&gt;FAST_MODEL&lt;/span&gt;=&lt;span class="n"&gt;gemini&lt;/span&gt;-&lt;span class="m"&gt;2&lt;/span&gt;.&lt;span class="m"&gt;5&lt;/span&gt;-&lt;span class="n"&gt;flash&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Route manually at first. The patterns become obvious quickly.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Measure Before You Optimize
&lt;/h3&gt;

&lt;p&gt;Track your usage by task type for one week. Most developers find that 60-70% of their tokens go to tasks that a cheaper model handles equally well.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Escalation, Not Default
&lt;/h3&gt;

&lt;p&gt;Start every task at the lowest reasonable tier. If Flash can't handle it, escalate to Sonnet. If Sonnet struggles, bring in Opus. This "escalation ladder" catches most tasks at the cheapest tier.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;Apple just validated at the largest scale possible what indie developers have been learning the hard way: &lt;strong&gt;the future of AI isn't picking the best model — it's building the routing layer that picks the right model for each task.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every week brings new models. DeepSeek, MiMo, Gemini, Claude, GPT — the landscape shifts constantly. Betting on one model is a losing strategy. Building an orchestration layer that can swap models as the landscape evolves? That's the moat.&lt;/p&gt;

&lt;p&gt;If the world's most valuable company decided that model routing beats model ownership, maybe it's time to rethink your own AI stack.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's your experience with multi-model workflows? Are you routing tasks to different models, or still using one model for everything? Drop your approach in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>GitHub Copilot's New Credit Pricing: A Token-by-Token Breakdown (And How to Cut Your AI Coding Bill by 70%)</title>
      <dc:creator>Bo Shen</dc:creator>
      <pubDate>Mon, 01 Jun 2026 20:37:22 +0000</pubDate>
      <link>https://dev.to/aplomb2/github-copilots-new-credit-pricing-a-token-by-token-breakdown-and-how-to-cut-your-ai-coding-bill-18jb</link>
      <guid>https://dev.to/aplomb2/github-copilots-new-credit-pricing-a-token-by-token-breakdown-and-how-to-cut-your-ai-coding-bill-18jb</guid>
      <description>&lt;p&gt;GitHub just switched Copilot to credit-based pricing and the community is in shock. Users are reporting bills jumping from $38/month to $800+. Here's what's actually happening, why, and what you can do about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Credit Math Nobody Did
&lt;/h2&gt;

&lt;p&gt;1 AI Credit = $0.01. Copilot Pro+ gives you 7,500 credits/month for $39. That's $75 worth of credits for $39 — sounds generous until you realize how fast agent mode burns through them.&lt;/p&gt;

&lt;p&gt;A typical agentic coding session uses Claude 4.6 Sonnet at 1,500 credits per million output tokens. A single complex refactoring prompt can easily consume 50K-100K output tokens, costing $0.75-$1.50 per request. Eight requests and you've burned $6-12, or roughly half your monthly allowance.&lt;/p&gt;

&lt;p&gt;The math gets worse with Opus: 7,500 credits per million output tokens. One architecture planning session with Opus can eat 20% of your monthly budget in 10 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Problem: One Pipeline for Every Task
&lt;/h2&gt;

&lt;p&gt;Here's what most people miss. The cost explosion isn't just about token pricing — it's about model selection.&lt;/p&gt;

&lt;p&gt;Copilot's agent mode fires the same heavy-reasoning model at every task:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tab completion?&lt;/strong&gt; Full reasoning pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writing a test?&lt;/strong&gt; Full context window loaded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple rename refactor?&lt;/strong&gt; Same model as architecture planning.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When I tracked my actual AI coding usage for a month, the breakdown looked like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task Type&lt;/th&gt;
&lt;th&gt;% of Requests&lt;/th&gt;
&lt;th&gt;Ideal Model Tier&lt;/th&gt;
&lt;th&gt;Cost/1M tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Implementation/boilerplate&lt;/td&gt;
&lt;td&gt;~60%&lt;/td&gt;
&lt;td&gt;Sonnet/4o&lt;/td&gt;
&lt;td&gt;$3-5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tests, linting, docs&lt;/td&gt;
&lt;td&gt;~20%&lt;/td&gt;
&lt;td&gt;Flash/mini&lt;/td&gt;
&lt;td&gt;$0.15-0.30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Architecture/complex debug&lt;/td&gt;
&lt;td&gt;~15%&lt;/td&gt;
&lt;td&gt;Opus/GPT-5&lt;/td&gt;
&lt;td&gt;$15-75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tab completion&lt;/td&gt;
&lt;td&gt;~5%&lt;/td&gt;
&lt;td&gt;Flash&lt;/td&gt;
&lt;td&gt;$0.075&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;65% of requests don't need an expensive model.&lt;/strong&gt; But when everything runs through one pipeline, you pay Opus prices for autocomplete.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Actually Pay: $60/month for $200 Worth of Work
&lt;/h2&gt;

&lt;p&gt;I ditched Copilot three months ago and switched to direct API keys. Here's my actual setup:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Planning &amp;amp; Architecture:&lt;/strong&gt; Claude Opus or GPT-5 via API&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Used for: system design, complex debugging, multi-file refactors&lt;/li&gt;
&lt;li&gt;~15% of my requests, ~$25/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implementation:&lt;/strong&gt; Claude Sonnet 4.5&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Used for: writing new features, code generation, PR descriptions&lt;/li&gt;
&lt;li&gt;~60% of my requests, ~$20/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tests &amp;amp; Docs:&lt;/strong&gt; Gemini Flash or GPT-4o-mini&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Used for: unit tests, documentation, linting suggestions&lt;/li&gt;
&lt;li&gt;~25% of my requests, ~$5/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Total: ~$50-60/month&lt;/strong&gt; for the same (often better) output that would cost $400+ on Copilot's new credit system.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Insight: Task Complexity Should Drive Model Selection
&lt;/h2&gt;

&lt;p&gt;Think about it like this. You wouldn't use a chainsaw to cut bread, and you wouldn't use a butter knife to fell a tree. But that's exactly what single-model coding tools do — they give you one tool for everything.&lt;/p&gt;

&lt;p&gt;The 70% savings don't come from cheaper API rates. They come from &lt;strong&gt;matching model capability to task complexity&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Identify the task type&lt;/strong&gt; before sending a prompt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route to the appropriate model tier&lt;/strong&gt; based on reasoning requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track actual token usage&lt;/strong&gt; to validate your model selections&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This isn't hypothetical. My team went from $10K/month in AI coding costs to $3K by implementing this approach. Same output quality, same development velocity. The only difference was being intentional about which model handles which task.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Steps If You're Leaving Copilot
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Start tracking your actual usage patterns&lt;/strong&gt;&lt;br&gt;
Before optimizing, you need data. Log your requests for a week and categorize them. You'll probably find that 60-70% of your interactions don't need the top-tier model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Set up direct API access&lt;/strong&gt;&lt;br&gt;
Both Anthropic and OpenAI offer straightforward API pricing. Claude Sonnet at $3/$15 per million input/output tokens and Gemini Flash at $0.075/$0.30 are your workhorses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Use different models for different tasks&lt;/strong&gt;&lt;br&gt;
Tools like Claude Code, Cursor, and Aider all support model switching. Set your default to Sonnet, and manually switch to Opus only for complex reasoning tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Monitor your costs weekly&lt;/strong&gt;&lt;br&gt;
Track spend per model tier. If your Opus usage exceeds 20%, you're probably over-using it. If Flash usage is under 15%, you're probably under-using it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Copilot Credit System Might Actually Be a Good Thing
&lt;/h2&gt;

&lt;p&gt;Hot take: transparent token pricing is better than opaque subscriptions. The old flat-rate model hid the true cost of AI coding, which meant nobody optimized their usage. Now that the costs are visible, the developers who learn to match models to tasks will come out way ahead.&lt;/p&gt;

&lt;p&gt;The ones who keep blasting everything through one expensive model? They'll pay $800/month and wonder why.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I run a portfolio of AI-powered apps and spend way too much time thinking about model costs. If you want to compare notes on cutting AI coding bills, I'm &lt;a href="https://x.com/aplomb2" rel="noopener noreferrer"&gt;@aplomb2&lt;/a&gt; on X.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>github</category>
      <category>githubcopilot</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How We Cut Our AI Coding Bill by 65% Without Sacrificing Quality</title>
      <dc:creator>Bo Shen</dc:creator>
      <pubDate>Sun, 31 May 2026 03:45:31 +0000</pubDate>
      <link>https://dev.to/aplomb2/how-we-cut-our-ai-coding-bill-by-65-without-sacrificing-quality-2api</link>
      <guid>https://dev.to/aplomb2/how-we-cut-our-ai-coding-bill-by-65-without-sacrificing-quality-2api</guid>
      <description>&lt;p&gt;Last month, a post on r/ExperiencedDevs went viral: a company spending &lt;strong&gt;$1 million per month&lt;/strong&gt; on AI API costs. Layoffs wouldn't even make a meaningful dent.&lt;/p&gt;

&lt;p&gt;The painful part? They couldn't force teams onto cheaper models because quality genuinely dropped on complex tasks. Sound familiar?&lt;/p&gt;

&lt;p&gt;We faced the same wall at $10K/month across our team. Here's how we solved it — and cut costs by 65% without a single developer complaint.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: All-or-Nothing Model Selection
&lt;/h2&gt;

&lt;p&gt;Most teams pick one model and use it for everything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Opus for every API call? Powerful but expensive.&lt;/li&gt;
&lt;li&gt;Switch everyone to Haiku? Cheap but quality suffers on hard problems.&lt;/li&gt;
&lt;li&gt;Let devs choose per-request? Nobody bothers. They default to the best model every time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the "blanket model" trap. You're either overpaying or underperforming.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Insight: Not All Tasks Are Equal
&lt;/h2&gt;

&lt;p&gt;We audited 30 days of our API usage and discovered something obvious in hindsight:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task Type&lt;/th&gt;
&lt;th&gt;% of Calls&lt;/th&gt;
&lt;th&gt;Needs Frontier Model?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Linting &amp;amp; formatting&lt;/td&gt;
&lt;td&gt;15%&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Boilerplate generation&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Simple completions&lt;/td&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test generation&lt;/td&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;td&gt;Rarely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex debugging&lt;/td&gt;
&lt;td&gt;15%&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Architecture decisions&lt;/td&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code review (nuanced)&lt;/td&gt;
&lt;td&gt;5%&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;60-70% of calls didn't need a frontier model at all.&lt;/strong&gt; They ran identically on Haiku, Gemini Flash, or even smaller models.&lt;/p&gt;

&lt;p&gt;But the remaining 30%? Those genuinely needed Opus-tier reasoning.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: Task-Level Routing
&lt;/h2&gt;

&lt;p&gt;Instead of forcing a model choice at the team level, we implemented &lt;strong&gt;automatic routing by task complexity&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Classify the request&lt;/strong&gt; — Is this a simple completion or a complex reasoning task?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route to the appropriate model&lt;/strong&gt; — Simple → Haiku/Flash. Complex → Opus/GPT-4.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make it invisible&lt;/strong&gt; — Developers don't pick models. The system picks for them.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key insight: &lt;strong&gt;routing should be invisible to developers&lt;/strong&gt;. If they have to think about which model to use, they'll always pick the most powerful one (just in case). The system needs to make that decision automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;After 30 days of task-level routing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost reduction: ~65%&lt;/strong&gt; ($10K → $3.5K/month)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality complaints: zero&lt;/strong&gt; — Complex tasks still got Opus&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developer friction: zero&lt;/strong&gt; — They didn't even notice the switch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency improvement: 40%&lt;/strong&gt; — Smaller models respond faster for simple tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The biggest surprise? &lt;strong&gt;Quality actually improved&lt;/strong&gt; on some tasks. Smaller models are less prone to over-thinking simple requests. Ask Opus to format an import statement and it might refactor your entire file. Ask Haiku and it just... formats the import.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Implement This
&lt;/h2&gt;

&lt;p&gt;You have a few options:&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 1: Manual Rules (Free)
&lt;/h3&gt;

&lt;p&gt;Write a classifier that routes based on prompt length, keywords, or context. Crude but effective for simple cases.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Simple heuristic routing
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;lint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;haiku&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;debug&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;architecture&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;review&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;opus&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sonnet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  &lt;span class="c1"&gt;# middle ground
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Option 2: LLM-Based Classification
&lt;/h3&gt;

&lt;p&gt;Use a tiny model to classify the task before routing. Adds ~50ms latency but much more accurate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 3: Specialized Routing Tools
&lt;/h3&gt;

&lt;p&gt;Tools like &lt;a href="https://coderouter.io" rel="noopener noreferrer"&gt;CodeRouter&lt;/a&gt; handle this automatically for coding workflows — they classify by development phase (planning, implementation, testing, debugging) and route accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with data.&lt;/strong&gt; Audit your actual API usage before optimizing. You'll be surprised how many calls are trivial.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't trust developers to self-route.&lt;/strong&gt; They'll always pick the best model "just in case." Make it automatic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Measure quality, not just cost.&lt;/strong&gt; Some tasks genuinely need frontier models. Don't cheap out on the 30% that matters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The biggest savings aren't from switching models — they're from not using expensive models when you don't need to.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;60-70% of AI coding calls don't need a frontier model&lt;/li&gt;
&lt;li&gt;Task-level routing cuts costs 50-70% with zero quality loss&lt;/li&gt;
&lt;li&gt;Make routing invisible to developers&lt;/li&gt;
&lt;li&gt;The fix for "$1M/month AI bills" isn't fewer models — it's smarter model selection&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;I'm Bo, founder building tools for AI-powered development. If you're drowning in API costs, I've been there. Happy to chat in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
