DEV Community: Bo Shen

Opus 5, GPT-5.6, Gemini 3.1: A Practical Guide to Picking the Right AI Model (Without Going Broke)

Bo Shen — Thu, 30 Jul 2026 20:32:37 +0000

July 2026 might be the most confusing month in AI model history. Anthropic shipped Opus 5. OpenAI dropped the GPT-5.6 family with three tiers (Sol, Terra, Luna). Google pushed Gemini 3.1 Pro Preview. And every one of them claims to be "the best."

If you're building with AI—especially using coding agents like Claude Code, Codex, or Cursor—you're probably asking the same question I was six months ago: which model should I actually use?

The answer that saved me $7K/month: it depends on the task.

The Problem: One Model Doesn't Fit All

Here's what my API bills looked like before I got smart about model selection:

January 2026: $10,200 (Claude Code on Opus 4.8 for everything)
February 2026: $9,800 (same approach, slightly less usage)
March 2026: $8,400 (started being "careful" — still hemorrhaging money)

The issue wasn't that Opus was bad. It was that I was using a $15/MTok output model for tasks that a $1/MTok model could handle just as well.

The Model Landscape Right Now (July 2026)

Let me break down what's actually available and what each is good for:

Tier 1: Heavy Reasoning ($5-15/MTok output)

Claude Opus 5 — $5/$25 per MTok (introductory $2/$10 through Aug 31). Best for architectural decisions, complex refactoring, novel algorithm design.
GPT-5.6 Terra — $3/$15. Strong at multi-file reasoning, good context window management.

Tier 2: Workhorse ($1-6/MTok output)

Claude Sonnet 5 — Excellent balance of capability and cost. Handles 70% of coding tasks.
GPT-5.6 Sol — $1/$6. Surprisingly capable for straightforward implementation.
Gemini 3.1 Pro — $2/$12 under 200K tokens. Great for large context analysis.

Tier 3: Speed ($0.25-1/MTok output)

Claude Haiku 4.5 — Fast, cheap, perfect for linting, formatting, simple edits.
GPT-5.6 Luna — The budget option that punches above its weight.

What I Actually Do: Task-Level Routing

Here's the framework I use now. Not every task needs the smartest model:

Planning & Architecture → Tier 1
When I'm designing a new system, thinking through edge cases, or making decisions that are expensive to reverse, I want the best model available. This is maybe 10% of my total token usage.

Implementation → Tier 2
Writing the actual code based on a clear plan? A Sonnet-class model handles this perfectly. This is 60% of my usage.

Testing, Debugging, Formatting → Tier 3
Writing test cases, fixing lint errors, reformatting code, generating boilerplate? Haiku-class. This is 30% of my usage.

The Math

Before task-level routing:

100% of tokens through Opus 4.8 at ~$15/MTok output
Monthly bill: ~$10K

After:

10% through Opus/Tier 1 at $10-25/MTok
60% through Sonnet/Tier 2 at $3-6/MTok
30% through Haiku/Tier 3 at $1/MTok
Monthly bill: ~$3K

Same output quality. 70% cost reduction.

The key insight: you're not degrading quality. You're matching capability to complexity. Nobody brings a bulldozer to plant a flower.

Practical Implementation

If you're using Claude Code or Codex with BYOK (bring your own key), you can implement this yourself:

Classify the task before sending it to the model. Is it planning, implementation, or maintenance?
Route to the appropriate tier. Most API providers make it trivial to switch models per request.
Track results. Log which model handled which task and review weekly. You'll find the boundaries quickly.

Some things I learned the hard way:

Don't use Tier 3 for debugging complex race conditions. I tried. It suggested "add a sleep(1)" as the fix. Multiple times.
Tier 1 is overkill for writing unit tests from a clear spec. You're burning $15/MTok to generate assertEqual statements.
Context window matters more than raw intelligence for large codebase navigation. Gemini 3.1 Pro's pricing structure actually favors this use case.

The Bigger Picture

We're entering an era where "which AI model?" is the wrong question. The right question is "which AI model for this specific task?"

The developers who figure this out first will have a massive cost advantage. Not because they're using worse tools, but because they're using the right tool at the right moment.

My $10K → $3K journey wasn't about cutting corners. It was about cutting waste.

I'm Bo. I've shipped 10+ apps and spend way too much time thinking about AI costs. Currently building tools to automate the model selection process. If you're burning through API credits, I've probably made the same mistakes you're about to make.

The Real Cost of AI Coding in July 2026: What Nobody Tells You About Claude Code, Codex, and Cursor Bills

Bo Shen — Mon, 27 Jul 2026 20:42:13 +0000

Everyone's comparing AI coding tool prices on paper. Claude Code Max at $200/mo. Cursor Ultra at $200/mo. Codex with ChatGPT Pro at $200/mo.

But after 6 months of running 10+ production apps across these tools, I can tell you: the sticker price tells you almost nothing about your actual cost.

Here's what I learned tracking every dollar.

The pricing page lie

Comparison articles love neat tables. "$200/mo for Claude Code Max, $200/mo for Cursor Ultra, credits included with ChatGPT Pro."

What they don't mention:

You'll use multiple tools. Nobody I know uses just one. Claude Code for complex architecture, Codex for quick fixes, Cursor for exploration. That's $400-600/mo before you even start counting API overages.
Subscription caps are softer than they look. Hit your Claude Code limit mid-sprint? You either wait or burn API credits at Opus 4 rates ($15/MTok input, $75/MTok output). That "predictable $200" becomes $800+ in crunch weeks.
The expensive model runs on everything by default. Claude Code defaults to whatever the latest flagship is. Right now that's the Fable 5 / Opus 4.8 family. Your git commit -m "fix typo" gets the same $75/MTok model as your "redesign the authentication system" task.

My actual numbers (Q1 2026)

Running a portfolio of AI-powered apps (fitness, photo editing, content tools), here's what my team's monthly AI coding spend looked like before any optimization:

Category	Monthly Cost
Claude Code (Max + API overages)	$4,200
Cursor (Pro × 2 seats + Agent usage)	$1,800
Codex / ChatGPT Pro	$600
One-off API calls (testing, debugging)	$3,400
Total	~$10,000/mo

That's $120K/year on AI coding tools. For a small team.

Where the money actually goes

I spent two weeks instrumenting our workflows with cost tracking. The breakdown was eye-opening:

60% of spend went to tasks that didn't need a frontier model (formatting, boilerplate, simple CRUD, test generation)
25% of spend was the right model on the right task (architecture decisions, complex debugging, security reviews)
15% of spend was pure waste (re-running failed prompts, context window overflow, unnecessary iterations)

The insight: most AI coding tasks are not equal, but we treat them like they are.

The fix: task-level routing

The approach that worked for us was embarrassingly simple: match the model to the task, not the other way around.

Instead of sending everything to Opus 4.8 / Fable 5:

Planning & architecture → Frontier model (Opus 4.8 / Fable 5). This is where reasoning quality actually matters. Maybe 10-15% of your tasks.
Implementation → Mid-tier model (Sonnet 4 / GPT-5.6). Plenty capable for writing code from a clear spec. 40-50% of tasks.
Debugging & fixes → Depends on complexity. Simple bugs → fast model. Gnarly race conditions → frontier. 20-30% of tasks.
Tests, docs, formatting → Cheapest model that works (Haiku 4 / Flash 3). 15-20% of tasks.

The result

After implementing task-level routing across our workflow:

	Before	After	Change
Monthly spend	~$10,000	~$3,000	-70%
Code quality	Baseline	Same or better	—
Development speed	Baseline	~15% faster	⬆️

The speed improvement surprised me. Turns out, smaller models respond faster for simple tasks. Less waiting = more shipping.

What I'd tell you if we were grabbing coffee

Track your actual spend first. Don't optimize blind. You might be surprised where the money goes.
Not every keystroke needs GPT-5.6 / Opus 4.8. That test file doesn't need a $75/MTok model. That commit message doesn't need 200K context.
The "unlimited" plans aren't unlimited. Read the fair usage policies. If you're a power user shipping 8+ hours/day, you'll hit walls.
Multi-tool is the reality. Budget for using 2-3 tools, not one. Each has strengths.
The cheapest token is the one you don't send. Better prompts, smaller context windows, and knowing when to stop iterating saves more than any pricing plan.

The elephant in the room

AI coding tool costs are going up, not down. Models are getting more capable but also more expensive at the frontier. Cursor hit $2B ARR. Anthropic's pricing keeps climbing.

The developers and teams who figure out intelligent model routing — using the right model for each specific task — will have a structural cost advantage.

It's not about being cheap. It's about not being wasteful.

I'm building tools to make this easier. If you want to compare notes on AI coding costs, I'm @aplomb2 on X or find me here on Dev.to.

Why Burning Opus on Every Claude Code Turn Is the #1 Cost Mistake in AI Coding

Bo Shen — Thu, 23 Jul 2026 20:43:06 +0000

I tracked my Claude Code spending for three months. The finding that changed everything: 60-70% of agent turns don't need a frontier model.

File reads. Grep commands. Test reruns. Simple edits from a clear spec. These tasks produce identical results on Haiku as they do on Opus — but at 1/60th the cost.

The Numbers That Convinced Me

Month 1 (all Opus, no routing): ~$10,200
Month 3 (task-level routing): ~$3,100

Same codebase. Same velocity. Same code quality on the work that matters.

Why This Is Happening Now

This week alone, three new routing tools launched:

Ramp Router (by Ramp, the fintech company) — OpenAI-compatible endpoint that routes per request
Entelligence Model Router — picks the model per agent turn, benchmarked against direct Opus on Terminal Bench
Frugal (open source) — Claude Code hooks that delegate subtasks to cheaper tiers

Add these to existing options like LiteLLM, OpenRouter, and Portkey, and routing is clearly becoming a category, not a feature.

The fact that a fintech company, an AI startup, and an open-source dev all shipped the same idea within days tells you something: the single-model-per-session paradigm is breaking down.

The Three-Tier Model That Works

After testing various configurations, here is what stuck:

Tier 1: Deterministic (zero model calls)

If a shell command answers the question — grep, jq, git log, wc -l — do not call a model at all. This handles ~15-20% of agent turns.

Tier 2: Cheap model (Haiku / Luna / similar)

File location, text extraction, mechanical edits from a spec, log parsing, simple refactors. The model needs to follow instructions, not reason deeply. ~40-50% of turns.

Tier 3: Frontier model (Opus / Fable / GPT-5.5)

Architecture decisions, complex debugging, design reviews, novel algorithm implementation. The work where model quality actually changes the outcome. ~30-35% of turns.

The Escalation Problem

The naive approach is letting the cheap model decide when it is stuck. This does not work. In my testing, cheap models were confidently wrong in both directions — claiming they could not handle tasks they could, and claiming success when they had produced subtly broken code.

What works: verified escalation. A tier only steps up when a concrete check fails — the test suite, the compiler, a schema validation, a diff that does not apply cleanly. One retry per step, capped.

What Does Not Change

Routing saves money on the mechanical work. It does not make hard problems easier.

If you are spending $200/month on Claude Code Max and it is mostly going to actual reasoning work, routing might save you 30%. If you are spending $10K/month on API and half of it is burning Opus tokens on grep-equivalent tasks, routing might save you 70%.

The ROI depends on your ratio of thinking to typing.

The Real Lesson

The AI coding cost conversation keeps framing itself as "which model is cheapest" or "which subscription is the best deal." Both miss the point.

The right question is: for each turn in your agent session, what is the cheapest model that produces an identical outcome?

Most of the time, the answer is cheaper than what you are running.

I have been building apps with AI coding agents for the past year. Currently shipping 10+ products across iOS, web, and API. The cost data above is from real production usage across multiple codebases.

75 Companies Exposed the Real AI Cost Problem. It's Not the Models — It's the Routing.

Bo Shen — Mon, 20 Jul 2026 20:01:30 +0000

A YC-backed startup recently shared findings from 75+ customer conversations across every industry. One theme dominated everything else:

"Everyone defaults to the most expensive model because nobody knows which to use."

This isn't a technology problem. It's a routing problem. And it's costing companies millions.

The Numbers Nobody Wants to Talk About

Here's what they found:

CFOs have one line item for AI — not broken down by team, project, or agent
Token costs get compared to cloud invoices in every finance meeting
Uber burned its entire 2026 AI budget in four months and couldn't connect the spend to customer outcomes
One fintech said per-use-case model selection was the single highest-impact request they had

The punchline? Every company they talked to wanted the same thing: a way to know which model fits which task. Nobody has it.

Why "Default to Max" Is Burning Money

When nobody knows which model fits which task, the rational choice is the most powerful one. Every time.

I've seen this pattern firsthand. Running 10+ AI apps, our Claude Code bills hit $10K/month before we traced the root cause:

60-70% of coding tasks don't need your most expensive model. They need the right model.

Here's the actual breakdown from our production workloads:

Task Type	What Most Teams Use	What Actually Works	Cost Delta
Planning & Architecture	Fable 5 / Opus 4.8	Opus 4.8 ✅	0%
Implementation	Fable 5 / Opus 4.8	Sonnet 4.8	-85%
Linting & Formatting	Fable 5 / Opus 4.8	Haiku 4.8	-95%
Code Review	Fable 5 / Opus 4.8	Sonnet 4.8	-85%
Test Generation	Fable 5 / Opus 4.8	Haiku 4.8	-95%

The waste is structural, not accidental.

"Just Pick a Cheaper Model" Doesn't Work Either

I hear this a lot. "Just use Sonnet for everything."

It breaks on architecture tasks — you get shallow scaffolding instead of thoughtful system design. "Just use Haiku for everything." It misses critical nuance in code review.

The real answer is task-level routing: automatically matching each coding task to the right model based on what's actually being done.

We built this into our workflow. The result: $10K/month → $3K/month. Same output quality on the metrics that matter — test pass rates, PR review accuracy, architecture coherence scores.

No model downgrade. Just smarter dispatch.

The Maturity Gap Is Enormous

The same research highlighted another pattern: the gap between companies who've figured this out and those who haven't is widening fast.

One neobank moved AI into mission-critical ops and is running a hiring freeze against 50-60% growth — funded entirely by AI efficiency gains
Meanwhile, a Director of AI at a large agency said they're "3-4 years from ready"

The difference isn't budget or talent. It's visibility. The neobank knows exactly which models power which workflows, what they cost per task, and where the waste sits. The agency treats AI as one amorphous blob of compute.

What Actually Works (From Cutting Our Own Bill by 70%)

Four things, in order:

1. Audit your model usage by task type.
Most teams have genuinely never done this. You'll be shocked how much Opus or Fable is being used for tasks that Haiku handles perfectly.

2. Identify your "Opus tasks" vs "Haiku tasks."
In our portfolio, it's roughly 30/70. Yours might differ — the point is that the split exists.

3. Route automatically.
Manual model selection doesn't scale past a single developer. You need a system that classifies the task and dispatches to the right model without human intervention.

4. Measure per-task quality.
Cost savings mean nothing if output degrades. Track the metrics that matter for each task type — not just "did it complete" but "did it complete well."

The Bottom Line

The companies winning the AI cost game aren't the ones with the best prompts, the biggest budgets, or the newest models.

They're the ones who figured out that model selection is a routing problem, not a spending problem.

The CFO in that research wanted AI spend tracked as precisely as ad spend — every dollar attributed, every channel measured. That same discipline is coming to AI. The question is whether you build it proactively or get surprised by the bill.

I run a portfolio of 10+ AI-powered apps and write about what actually moves the needle on AI costs. Previously cut our team's AI coding bills from $10K to $3K/month with task-level routing. Find me on X @aplomb2.

Why Uber's $1,200 Claude Code Session Is Actually a Routing Problem

Bo Shen — Thu, 16 Jul 2026 19:57:54 +0000

Uber burned through its entire 2026 AI coding budget in four months. One executive racked up a $1,200 bill in a single two-hour Claude Code session. By spring, 95% of their engineers had adopted AI coding tools, with heavy users hitting $2,000 per month.

Their response? Spending caps at $1,500 per engineer.

But caps are a bandaid. The real problem is architectural.

The Tokenmaxxing Trap

CNBC coined the term "tokenmaxxing" — companies incentivizing developers to use as much AI as possible without worrying about results. Uber even had internal leaderboards ranking engineers by Claude Code usage.

This is the predictable outcome when you give every engineer access to frontier models with no routing logic. Every task — from complex architecture decisions to writing unit tests — gets processed by the most expensive model available.

It's like giving every employee a first-class plane ticket for every trip, including the 30-minute drive to the office.

What Actually Costs Money (And What Doesn't)

After months running ~$10K/month in Claude Code API bills across multiple products, I started tracking which tasks actually benefit from frontier reasoning. The breakdown was surprising:

Tasks that genuinely need frontier models (~15-20%):

Complex architectural decisions spanning multiple services
Novel algorithm design with non-obvious edge cases
Tricky refactors that require understanding implicit dependencies
Debugging production issues with subtle race conditions

Tasks that run fine on mid-tier models (~60%):

Standard feature implementation from clear specs
Code reviews and suggestions
Refactoring with clear patterns (extract method, rename, reorganize)
Writing integration tests

Tasks where a fast, cheap model is sufficient (~20%):

Boilerplate generation
Unit test scaffolding
Documentation
Linting-style fixes and formatting

The ratio was roughly 15/65/20 — meaning 80% of our API spend was going to frontier models for tasks that didn't need them.

Route by Task Type, Not by Preference

The fix isn't picking a cheaper model. It's picking the right model for each step.

Here's the mental model:

Planning/Architecture  -> Frontier (Opus, Sol Ultra)
Implementation         -> Mid-tier (Sonnet, Sol Standard)
Tests/Docs/Boilerplate -> Fast (Haiku, Luna)

When we implemented this routing — matching the model tier to the coding phase — our monthly bill dropped from ~$10K to ~$3K. Same output quality. Same velocity. 70% cost reduction.

The key insight: the model doesn't know what task it's working on, but the harness does. If your coding agent knows it's generating unit tests, it doesn't need to spin up Opus. If it's planning a complex migration, it absolutely should.

Why Caps Don't Work

Uber's $1,500/month cap addresses the symptom, not the cause. Here's what happens with caps:

Engineers self-ration on the wrong tasks. They'll skip AI assistance on easy tasks (where it's cheapest and most helpful) and save their budget for hard tasks (where the cost is highest).
You lose the 80% productivity gain. Most AI coding value comes from the mundane — scaffolding, boilerplate, test generation. Caps discourage this usage disproportionately.
Caps create political problems. Who gets the higher tier? The senior architect or the junior dev who needs AI more? Every cap becomes a negotiation.

Task-level routing solves all three. Every engineer gets unlimited access. The system just picks the right model for each step.

The Industry Is Figuring This Out

Lindy's CEO recently switched 100% of their traffic from Anthropic to DeepSeek — saving millions. But wholesale model switching is a blunt instrument. You lose quality on the tasks that need it.

The smarter move: route the 80% of tasks that don't need frontier reasoning to cheaper models, and keep frontier for the 20% where it matters.

This is where AI coding tools are heading. The era of "pick one model and use it for everything" is ending. The next generation of tooling will route by task type automatically — no human in the loop deciding "is this a Sol Ultra or Sol Standard task" for every prompt.

Getting Started

If you're running AI coding tools at scale, here's a practical starting point:

Instrument your usage. Track which task types consume the most tokens.
Identify your 80%. Most teams find that implementation, tests, and docs account for the bulk of spend.
Set up tiered routing. Even manual tiers (e.g., different API keys for different task types) cut costs significantly.
Measure quality, not tokens. The goal isn't fewer tokens — it's the same quality at lower cost.

Uber's $1,200 session wasn't a Claude Code problem. It was a routing problem. And every team running AI coding at scale will hit the same wall — unless the harness gets smarter about matching tasks to models.

I've been building task-level routing tools for AI coding workflows. If this resonates, check my profile for more on the $10K to $3K journey.

How We Cut AI Coding Costs from $10K to $3K/Month with Task-Level Model Routing

Bo Shen — Mon, 13 Jul 2026 20:00:05 +0000

The Problem Nobody Talks About

Every AI coding tool sells you on the frontier model. Claude Code defaults to Fable 5. Codex pushes Sol. Cursor uses whatever's newest.

But here's what six months of running production AI workloads taught me: 65% of your AI coding calls don't need a frontier model.

Tests? Boilerplate? Documentation? File scaffolding? You're burning $50/Mtok output tokens on work that a $3/Mtok model handles identically.

Last quarter we spent $10,400/month across Claude Code, Codex, and API calls. After implementing task-level routing, we're at $3,100. Same output quality. Same velocity. Here's exactly how.

What Is Task-Level Model Routing?

Traditional model selection: pick the best model, use it for everything.

Task-level routing: classify each task, then route it to the cheapest model that can handle it well.

This isn't new. Coinbase built a routing layer that cut their LLM spend nearly in half while usage kept growing. The concept is simple — the implementation details matter.

For coding specifically, here's how tasks break down:

Task Type	% of Calls	Model Needed	Cost Tier
Architecture & Planning	~10%	Frontier (Fable 5, Sol)	$$$$
Complex Implementation	~25%	Mid-tier (Opus 4.8, Terra)	$$$
Boilerplate & Scaffolding	~30%	Fast (Sonnet 5, Luna)	$$
Tests & Documentation	~25%	Budget (Haiku, Nano)	$
Refactoring	~10%	Mid-tier	$$$

The math is brutal when you realize most teams send everything to the top row.

The Quick Win: CLAUDE_CODE_SUBAGENT_MODEL

If you're on Claude Code, this one environment variable is worth hundreds per month:

export CLAUDE_CODE_SUBAGENT_MODEL=claude-sonnet-5

Claude Code's orchestrator (Fable) spawns sub-agents for individual file edits, test generation, and scaffolding tasks. By default, every sub-agent also uses Fable. Setting SUBAGENT_MODEL routes these worker tasks to Sonnet instead.

Result: Same quality plans. Same architecture decisions. 60% fewer Fable tokens consumed. Your 5-hour rate limit suddenly stretches to 12+ hours of productive work.

The Full Setup: Multi-Provider Routing

For teams using multiple tools, here's the architecture that got us from $10K to $3K:

Step 1: Classify Your Tasks

Before routing, you need to know what you're routing. We track every AI call by category:

Planning calls: Architecture decisions, system design, complex debugging
Implementation calls: Writing new features, integrating APIs
Maintenance calls: Tests, docs, formatting, linting fixes
Review calls: Code review, refactoring suggestions

Step 2: Assign Models by Category

Our current routing table (July 2026 prices):

Planning       → Fable 5 ($10/$50 per Mtok) or Sol ($5/$30)
Implementation → Opus 4.8 ($15/$75) or Terra ($2.50/$15)
Maintenance    → Sonnet 5 ($3/$15) or Luna ($0.50/$3)
Review         → Haiku ($0.25/$1.25) or Nano ($0.10/$0.60)

Step 3: Measure and Adjust

The routing table isn't static. We review weekly:

If a cheaper model produces noticeably worse output for a category → upgrade
If a category has zero quality complaints at current tier → try downgrading
Track rejection rate (how often you redo AI-generated work) per model per category

Real Numbers: Before and After

Before (January 2026):

Claude Code: ~$4,200/mo (all Opus/Fable)
Codex API: ~$3,800/mo (all GPT-5)
Direct API calls: ~$2,400/mo
Total: $10,400/mo

After (June 2026):

Claude Code: ~$1,400/mo (Fable orchestrator + Sonnet sub-agents)
Codex: ~$900/mo (Sol planning + Luna implementation)
Direct API calls: ~$800/mo (routed by task type)
Total: $3,100/mo

Quality impact: Rejection rate went from 12% to 11%. Slightly better, because cheaper models are often more deterministic for straightforward tasks.

Common Objections

"Won't cheaper models produce worse code?"

For planning and architecture — yes, significantly. That's why you keep frontier models for those tasks. For writing a React component from a clear spec? Sonnet 5 and Luna are virtually indistinguishable from Fable on well-defined implementation tasks.

"This sounds like a lot of overhead to manage."

The initial setup takes an afternoon. After that, it's a 15-minute weekly review. The ROI is immediate — we saved more in the first week than the setup time cost.

"What about when models get cheaper?"

They will, and that's great. Routing still helps because the spread between frontier and budget tiers stays large. Even if everything drops 50%, the relative savings from routing remain.

What's Next

The market is moving toward this pattern fast. MindStudio, OpenClaw's ClawRouter, and open-source tools like Plano are all building routing layers. Anthropic themselves benchmarked "Fable orchestrates, cheap models execute" at 96% performance for 46% of the cost.

Outcome-based pricing (pay per completed task, not per token) is coming. But until then, task-level routing is the most practical way to control AI coding costs without sacrificing quality.

I'm Bo — I've shipped 10+ apps and spend way too much time optimizing AI workflows. Currently building tools that make model routing automatic. Find me @aplomb2 on X.

Fable 5 Goes Credit-Only Tomorrow — Here's How to Not Go Broke

Bo Shen — Mon, 06 Jul 2026 19:40:18 +0000

Tomorrow (July 7, 2026), Anthropic pulls Fable 5 out of subscription plans. Every Fable 5 call moves to usage credits: $10 per million input tokens, $50 per million output tokens.

No more flat-rate safety net. Every token counts.

I've been running AI coding agents at scale for months ($10K+/month at peak). Here's what I've learned about surviving per-token billing — and actually spending less.

The Real Problem Isn't the Price

Fable 5 at $50/Mtok output is expensive. But the real cost killer isn't the rate — it's sending every task to the most expensive model.

A Reddit user just went viral after losing $20 on a single "hey" message. Claude Code resent 847,000 tokens of session context. At Fable 5 rates, that's a meal.

But even without the context resend bug, most teams waste 60-70% of their AI budget on tasks that don't need frontier-level reasoning.

The 5-Stage Framework That Cut Our Bill 70%

We categorized every coding task into 5 stages:

Stage 1: Planning & Architecture

Model: Frontier (Fable 5, Opus)
Why: This is where model quality actually matters. System design, complex architecture decisions, novel problem-solving.
Cost share: ~15% of tokens, ~40% of budget

Stage 2: Implementation

Model: Mid-tier (Sonnet 5, GPT-4.1)
Why: 90% of implementation is pattern-matching against well-known solutions. Mid-tier models handle this fine.
Cost share: ~40% of tokens, ~30% of budget

Stage 3: Debugging & Testing

Model: Budget (Haiku, Flash)
Why: Reading stack traces, generating test cases, fixing lint errors. These are mechanical tasks.
Cost share: ~20% of tokens, ~10% of budget

Stage 4: File Operations

Model: Budget or cached
Why: Reading files, searching codebases, listing directories. You're literally paying frontier prices to cat a file.
Cost share: ~15% of tokens, ~5% of budget

Stage 5: Review & Refinement

Model: Frontier
Why: Final code review, security audit, performance optimization. Worth the premium.
Cost share: ~10% of tokens, ~15% of budget

The Math

Before routing:

100% of tasks → Fable 5 at $50/Mtok output
Monthly bill: ~$10,000

After routing:

25% of tasks → Frontier ($50/Mtok)
40% → Mid-tier (~$8/Mtok)
35% → Budget (~$0.80/Mtok)
Monthly bill: ~$3,000

Same code quality on the tasks that matter. 70% less spend.

Practical Tips for Tomorrow

Start fresh sessions frequently. Context accumulates. Every message resends the full history. New session = reset the meter.
Set spending caps in Claude Console. Do this today, before the switch. Anthropic lets you cap monthly spending.
Audit your last week of usage. Look at what percentage of your calls actually needed the frontier model. I bet it's under 30%.
Use prompt caching aggressively. Cached input tokens are 90% cheaper. If you're sending the same system prompt repeatedly, cache it.
Consider the Copilot flat-rate option. GitHub Copilot gives access to Claude models at a flat subscription price. For some workflows, this is cheaper than per-token.

The Bigger Picture

The July 7 switch isn't a crisis — it's the market telling us something important. We've been treating frontier AI models like a utility when they're actually a premium resource.

The companies that thrive in the per-token era won't be the ones who find the cheapest model. They'll be the ones who match model cost to task complexity.

That's not just a cost optimization. It's a better way to build.

I've been building tools for AI coding cost optimization. If you're interested in task-level routing, check my profile for more.

Your Claude Code Bill Quietly Got 5x Worse — And They Were Tracking You Too

Bo Shen — Thu, 02 Jul 2026 20:45:05 +0000

This has been a rough week for Anthropic's developer trust.

The Invisible Price Hike

Developer Vincent Schmalbach published detailed logs showing Claude Code's effective cost increased approximately 5x — without any pricing change announcement.

His numbers are hard to argue with:

Previous heavy weeks: ~8.9M and ~8.5M visible Opus tokens
Current week: ~1.4M visible Opus tokens
Same subscription. Same machine. Same workflow.

That's roughly 83% fewer tokens for the same money. His broader metric (including cache creation) tells a similar story: about 80% less effective output.

The worst part? A fresh account burned through its entire 5-hour quota with zero visible Opus rows in the logs. The meter moved, but the ledger didn't explain why.

As Schmalbach puts it: "Developers don't need a fancy progress bar. We need a ledger."

The Tracking Controversy

The same week, security researchers discovered Claude Code was quietly embedding location-tracking code to identify users in China or affiliated with Chinese AI labs.

Anthropic called it "anti-abuse." The code used XOR encoding and base64 to hide domain classification lists. As The Register reported: "This is not a malicious feature, but it is a weird choice for a developer tool that asks for trust."

After backlash on Reddit and social media, Anthropic rolled it back.

The Real Problem: Single-Vendor Dependency

These aren't isolated incidents. They're symptoms of the same underlying issue: when you depend on a single AI provider, you're at their mercy — for pricing, for privacy, for everything.

Here's what I learned after 6 months of running AI coding workloads across multiple providers:

Not Every Task Needs the Best Model

We tracked 30 days of coding agent usage and found a consistent pattern:

Task Type	Model Needed	Cost Impact
Architecture decisions	Frontier (Opus/Fable)	Worth it
Multi-file refactors	Frontier	Worth it
Boilerplate generation	Mid-tier (Sonnet/GPT-4o)	70% cheaper
Test generation	Any capable model	85% cheaper
Linting/formatting	Cheapest available	90% cheaper

The Numbers

By routing tasks to the appropriate model tier, we went from $10K/month to $3K/month on AI coding costs. Not by using less AI — by using the right AI for each task.

The breakdown:

~30% of tasks genuinely needed frontier models
~40% worked perfectly with mid-tier models
~30% could run on the cheapest option with no quality difference

Privacy as a Bonus

When you route across providers, no single company sees your entire codebase. After this week's tracking revelation, that's not just a cost optimization — it's a security practice.

What You Can Do Today

Audit your usage: Tools like ccusage show exactly where your tokens go. Most developers are shocked by how much goes to routine tasks.
Categorize your tasks: Before hitting "send," ask: does this genuinely need Opus/Fable? Or would Sonnet handle it fine?
Try task-level routing: Route planning to frontier models, implementation to mid-tier, and tests to whatever's cheapest.
Diversify providers: Don't let one company control your pricing AND your privacy.

The Bigger Picture

Anthropic makes great models. Claude is genuinely the best coding AI for complex tasks. But "best model" and "only model" are very different strategies.

The era of trusting a single AI vendor with your entire development workflow — your code, your costs, your data — ended this week.

Build your routing layer. Your wallet and your IP will thank you.

I'm Bo, founder of a team that ships 10+ apps. We cut our AI coding costs by 70% through task-level model routing. Follow me on X @aplomb2 for more on building affordably with AI.

The $500M Claude Code Problem: Why Most Teams Pay 3x What They Should for AI Coding

Bo Shen — Mon, 29 Jun 2026 19:54:42 +0000

Enterprise AI coding bills are hitting absurd numbers. One source told Axios that a client spent $500 million in a month on Claude Code. Gartner's latest data says 23% of tech leaders are spending $200-500 per developer per month on tokens alone. Uber reportedly burned through its entire 2026 Claude Code budget by April and had to cap spending at $1,500/month per employee.

These aren't edge cases anymore. This is the new normal. And the uncomfortable truth is that most of this spend is waste.

The One-Model Trap

Here's what typically happens: A team adopts Claude Code or Copilot. They default to the most powerful model available because that's the safest bet. Every task — from scaffolding a React component to planning a complex distributed system migration — runs through the same frontier model at the same price.

The problem? Roughly 70-80% of coding tasks don't require frontier-level reasoning. Writing boilerplate, generating tests from existing code, formatting, simple refactors, documentation — these tasks get identical results from models that cost 5-10x less.

You're paying Michelin-star prices for every meal, including the toast.

What Task-Level Routing Actually Looks Like

The concept is simple: match model capability to task complexity. In practice, you're creating tiers:

Tier 1 — Frontier model (Opus/o3-pro):

System architecture decisions
Complex algorithm design
Cross-service refactoring
Security-critical code review

Tier 2 — Mid-tier model (Sonnet/GPT-4o):

Feature implementation from clear specs
Code review for standard patterns
Bug fixes with clear reproduction steps

Tier 3 — Fast/cheap model (Haiku/Flash/DeepSeek):

Boilerplate generation
Test scaffolding
Documentation
Linting suggestions
Simple formatting/renaming

Real Numbers

I run a team of 5 devs. Before routing, our monthly AI coding bill was consistently above $10K. Most of that was Opus tokens on tasks that any mid-tier model could handle.

After implementing task-level routing:

Month 1: $10,200 → $4,800 (basic tier mapping)
Month 3: Stabilized at ~$3,100 (refined classification + caching)
Quality metrics: Zero regression in PR review scores, test coverage, or bug rates

The 70% cost reduction came primarily from moving test generation and boilerplate to Tier 3. These tasks had identical output quality regardless of model tier.

The Classification Problem

The hardest part isn't the routing — it's accurately classifying task complexity before execution. Some approaches:

Rule-based: Pattern matching on task descriptions. "Write tests for..." → Tier 3. "Design the architecture for..." → Tier 1. Simple, brittle, but gets you 60% of the way there.

LLM-based classification: Use a cheap model to classify the task first, then route to the appropriate tier. Adds a few cents of overhead but dramatically improves accuracy. The classifier itself costs almost nothing compared to running every task through Opus.

Hybrid: Rules for obvious cases, LLM classification for ambiguous ones. This is where most teams end up after iterating.

The Bigger Picture

The AI coding cost problem isn't going away. Models are getting more capable, which means more tasks get delegated to them, which means bills keep growing. The answer isn't spending less on AI coding — it's spending smarter.

Companies like Uber capping spend at $1,500/month per dev are treating the symptom. Task-level routing treats the cause.

If your team is spending more than $2K/month per developer on AI coding tokens and you're running everything through a single model tier, you're leaving 50-70% of that budget on the table.

The efficiency gains are real. The implementation isn't rocket science. The only question is how long you'll keep paying frontier prices for commodity tasks.

I've been building tools around AI coding cost optimization. Happy to discuss implementation details in the comments.

Uber Burned Through Its Entire AI Coding Budget in 4 Months. Here's What Smart Teams Do Instead.

Bo Shen — Wed, 24 Jun 2026 22:03:26 +0000

The AI coding bill just became everyone's problem. In the last two weeks alone:

Uber blew through its entire 2026 Claude Code budget by April and capped employees at $1,500/month
Gartner reported that 23% of tech leaders now spend $200-500 per developer per month on AI coding tokens alone
GitHub flipped Copilot to usage-based billing, turning a predictable $19/seat into an open-ended credit drain
Ramp's AI Index shows the top 1% of firms spending $7,500/employee/month on AI — $90K/year per head, up 14.1% in a single month

The pattern is clear: agentic workflows burn tokens faster than any flat budget anticipated. And single-vendor lock-in makes it worse — when your only option is Opus 4.8 at $75/M output tokens, every wasted thinking loop is expensive.

The Real Problem: Not All Tasks Need the Best Model

Here's what I learned after watching my own AI coding spend hit $10K/month earlier this year.

I was sending everything to Claude Opus. Code planning? Opus. Writing unit tests? Opus. Formatting a config file? Opus. Renaming a variable across three files? Opus.

That's like hiring a senior architect to move furniture. The work gets done, but you're massively overpaying.

When I actually profiled my usage, the breakdown looked like this:

~15% of tasks genuinely needed frontier reasoning (complex architecture decisions, subtle bug diagnosis, multi-file refactors with tricky dependencies)
~25% of tasks needed solid mid-tier capability (implementing features from clear specs, writing meaningful tests, code review)
~60% of tasks were mechanical (formatting, renaming, boilerplate generation, simple file operations, documentation updates)

That 60% was burning frontier-tier tokens for work that Haiku, Gemini Flash, or even a local model could handle identically.

Task-Level Routing: The Boring Fix That Saves 60-70%

The concept is simple: instead of routing every request to one model, classify each task and send it to the cheapest model that can handle it well.

Planning phase → Frontier model (Opus, GPT-5). This is where reasoning depth matters. You want the model that catches edge cases your spec missed.

Implementation → Mid-tier model (Sonnet, GPT-4.1). Given a clear plan, most code generation doesn't need maximum intelligence — it needs reliable instruction-following.

Tests, formatting, docs → Fast/cheap model (Haiku, Flash, Gemini 2.5). These tasks have objectively verifiable outputs. Either the test passes or it doesn't. You don't need 200 IQ for assertEqual.

Debug/diagnosis → Frontier model again. When something breaks in a non-obvious way, you want the best reasoning available.

After implementing this approach, my monthly spend dropped from ~$10K to ~$3K. Same output quality. Same velocity. Just stopped overpaying for routine work.

How to Actually Do This

You don't need custom infrastructure. Here's the practical version:

1. Audit Your Token Usage

Before optimizing, know where your tokens go. Log the actual prompts hitting the API for a week. You'll probably find:

Context bloat (frameworks serializing full state into every call)
Unnecessary thinking loops (model "reasoning" about trivial operations)
Repeated system prompts eating 10K+ tokens per call

2. Create Task Categories

Start simple — three tiers is enough:

Tier 1 (Frontier): Architecture, complex debugging, security-sensitive code
Tier 2 (Mid): Feature implementation, test writing, code review
Tier 3 (Fast): Formatting, documentation, boilerplate, simple edits

3. Route Based on the Task, Not the Session

The key insight: routing should happen at the task level, not the session level. A single coding session might need Opus for the initial design, Sonnet for implementation, and Haiku for writing tests — all within the same workflow.

Most teams I've talked to start with manual routing (just switching models themselves) and then automate it once they see the pattern.

4. Monitor and Adjust

Track cost-per-task, not just total spend. When you see a Tier 3 task consuming $2 worth of tokens on a frontier model, that's a routing failure. When a Tier 1 task fails on a cheap model, that's also a routing failure. The sweet spot is in the middle.

The Bigger Picture

Ramp's data tells an interesting story: the companies spending the most on AI aren't the ones in trouble. The ones in trouble are companies locked into a single vendor with no ability to route.

"The top 1% of firms tend to mix and match, bouncing between multiple frontier models and platforms that give them access to cheaper models." — Ramp AI Index

This isn't about spending less on AI. It's about spending smarter. The teams that figure out task-level routing now will have a structural cost advantage as agentic workflows become the default.

The $10K/month developer AI bill is already here. The question is whether you're paying it because you need to, or because you never bothered to check which tasks actually require the expensive model.

I've been building apps with AI coding tools for the past year and tracking the economics obsessively. Happy to share specific numbers or discuss routing strategies in the comments.

Loop Engineering Is Replacing Prompt Engineering — Here's What That Means for Your AI Coding Bill

Bo Shen — Mon, 22 Jun 2026 20:23:36 +0000

If you've been following AI coding tools this month, you've seen the quote everywhere:

"I don't prompt Claude anymore. I have loops running that prompt Claude. My job is to write loops." — Boris Cherny, Head of Claude Code at Anthropic

This isn't just a catchy soundbite. It represents a fundamental shift in how developers interact with AI coding agents — and it has massive cost implications that almost nobody is talking about.

The Evolution Nobody Asked For (But Everyone Needed)

The progression looks like this:

Prompt engineering (2023): Craft the perfect prompt, get one good output
Context engineering (2024): Get the right information to the model
Harness engineering (2025): Design the environment a single agent runs in
Loop engineering (2026): Design systems that spawn, monitor, and verify autonomous agent work

Each step shifted leverage away from "writing better prompts" toward "designing better systems." Loop engineering is the logical endpoint: the human stops being in the loop entirely and starts designing the loop itself.

Why This Happened

Here's the architectural constraint that drives everything: LLMs are stateless. They forget everything between sessions. Every piece of context — project rules, prior decisions, intermediate results — must live outside the model.

When you prompt one turn at a time, you are the memory system. You hold the context in your head and feed it back each turn. That works for small tasks. For anything multi-step, it collapses under its own overhead.

Loop engineering is the systems design response: instead of holding context manually, you build a small system that:

Holds context externally (files, git, memory docs)
Decides what to prompt next
Dispatches the agent
Checks whether the work is done
Loops until complete

The Cost Problem Nobody Warns You About

Here's where it gets dangerous: token costs in autonomous loops compound exponentially.

A single manual Claude Code session might cost $0.50-2.00. An autonomous loop doing the same work might make 10-50x more API calls because it's:

Reading files to understand context (every loop iteration)
Making exploratory changes and reverting
Running tests and interpreting failures
Retrying with different approaches

Without guardrails, a loop that runs overnight can burn through $200+ on what should have been a $5 task.

Three Guardrails Every Loop Needs

1. Budget Guards (Non-Negotiable)

Set a hard dollar cap per loop execution. Not per session — per task. If your agent is implementing a feature, cap it at $10. If it's fixing a typo, cap it at $0.50. The cap should reflect the value of the task, not the model's appetite.

2. A Separate Verifier Model

This is the insight most people miss: use a cheap model to verify the expensive model's work.

Your implementation loop runs on Opus or o3 (the expensive frontier model). But the verifier — the model that checks "did the tests pass? does the code compile? does this match the spec?" — can run on Haiku or GPT-4o-mini at 1/20th the cost.

The verifier runs after every iteration and decides: continue, retry with different approach, or stop and escalate to a human.

3. Task-Level Model Routing

This is the biggest cost lever available, and it's orthogonal to loop engineering itself.

Not every step in a loop needs a frontier model. The pattern that works:

Architecture/Planning → Frontier (Opus, o3) — needs deep reasoning
Implementation → Mid-tier (Sonnet, GPT-4o) — good enough for code generation
Test writing → Fast/Cheap (Haiku, Flash) — boilerplate-heavy, pattern matching
File reading/grep → No model needed — tool calls only

In practice, ~80% of coding tasks don't need frontier-tier reasoning. Routing those to mid-tier models cuts your loop costs by 60-70% without meaningful quality loss on the work that matters.

What This Looks Like in Practice

If you're using a coding agent today, here's the minimum viable loop:

1. Agent reads task description + project context
2. Agent plans approach (frontier model)
3. Agent implements (mid-tier model, budget-capped)
4. Verifier checks (cheap model): tests pass? Linter clean?
5. If no → loop back to 3 with error context
6. If yes → commit and report

The human's job is designing steps 1-6 and setting the budget caps. The models handle everything inside the loop.

The Bottom Line

Loop engineering isn't just a new buzzword — it's a genuine paradigm shift in how we use AI coding tools. But it comes with a cost trap that can 10x your bill if you're not careful.

The developers who'll win are the ones who combine autonomous loops with intelligent routing and verification. Let the system work while you sleep, but make sure it's working efficiently.

The game isn't better prompts anymore. It's better systems.

I cut my team's AI coding bill from $10K/mo to under $3K by implementing task-level model routing. The approach described in this article is exactly how we did it. If you're interested in routing, check out coderouter.io.

Claude Fable 5 Went from Free to Offline in 72 Hours — What I Learned About AI Coding Costs

Bo Shen — Mon, 15 Jun 2026 20:30:01 +0000

Last week, Anthropic launched Fable 5 — their most powerful model ever — free for all Pro/Max subscribers through June 22.

Three days later, the US government issued an export control directive. Fable 5 went dark worldwide.

Developers who hardcoded claude-fable-5 in their workflows woke up to broken pipelines. Anthropic received the directive at 5:21pm ET on June 12 and had to comply immediately.

This isn't a post about geopolitics. It's about what this event reveals about the true cost of AI-assisted coding — and why model routing is the most underrated skill in a developer's toolkit right now.

The Real Cost of AI Coding in June 2026

Let's talk numbers that most people aren't tracking:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Typical coding session cost
Claude Fable 5	$10	$50	$5-15 per task
Claude Opus 4.8	$5	$25	$2-8 per task
Claude Sonnet 4	$1.50	$7.50	$0.50-2 per task
GPT-5.5	~$2.50	~$10	$1-3 per task

One Reddit user reported burning $200 in under 60 minutes with Fable 5. Another tracked 35 Claude Code subscriptions that would cost $80K/month at API rates.

The Insight: 80% of Your Coding Tasks Don't Need the Most Powerful Model

I run multiple AI coding agents daily across a portfolio of 10+ apps. Six months ago, my monthly AI coding bill hit $10K.

Today it's around $3K.

The difference wasn't switching to cheaper models across the board. It was routing different task types to the right model:

What Actually Needs Frontier Models (Fable/Opus)

Complex architectural decisions
Multi-file refactoring with subtle dependencies
Novel algorithm implementation
Debugging race conditions or memory leaks

What Works Great with Mid-Tier Models (Sonnet/GPT-5.5)

Boilerplate generation and scaffolding
Unit test writing
Documentation
Simple bug fixes
Code formatting and linting

What Smaller Models Handle Fine

Commit message generation
Simple string transformations
Template filling
Configuration file updates

When I actually tracked which model was doing what, I found that roughly 60-70% of my tokens were going to tasks that a Sonnet-class model would handle equally well.

The Fable 5 Shutdown Proved Something Else

Beyond cost, the overnight shutdown exposed a resilience problem.

If your entire workflow depends on a single model from a single provider, you don't have a workflow — you have a single point of failure.

My setup auto-fell back to Opus 4.8 when Fable went offline. No configuration changes, no manual intervention, no lost work. That's not because I predicted a government export control order. It's because I assumed any model can become unavailable at any time.

This has happened before:

OpenAI rate limits during peak hours
Anthropic's extended outage in March
Google's API deprecation cycle

Building model fallback chains isn't paranoia. It's good engineering.

How to Start Routing Today

You don't need fancy infrastructure. Here's a simple approach:

1. Classify your tasks

Before sending a prompt, tag it: planning, implementation, debugging, testing, documentation, formatting.

2. Create a routing table

planning       → opus/fable (complex reasoning matters)
implementation → sonnet (good enough, 5x cheaper)
debugging      → opus (needs deep understanding)
testing        → sonnet (formulaic, template-driven)
documentation  → sonnet (clarity over intelligence)
formatting     → haiku/small (trivial tasks)

3. Track and iterate

Log which model handled which task, then review: did the cheaper model produce acceptable results? Over time, you'll discover your personal routing table.

The Bigger Picture

The AI coding landscape in June 2026 looks like this:

Models are getting more capable AND more expensive at the top end
The gap between tiers is narrowing for common tasks
Availability is no longer guaranteed (regulatory, rate limits, outages)
Smart routing beats brute-force spending every time

The developers who'll thrive aren't the ones with unlimited API budgets. They're the ones who treat model selection as an engineering problem — matching the right tool to the right task, with fallbacks for when things go wrong.

I'm Bo. I run 10+ AI-powered apps and spend too much time thinking about model costs. Previously cut our team's Claude Code bill from $10K/mo to $3K with task-level routing. Find me @aplomb2 on X.