Suraj Khaitan

Posted on Jun 20

🚀 I Ran Claude Code on Every New Claude Model. Here's What Actually Ships.

#agents #ai #claude #llm

Fable, Mythos, Opus 4.8, Sonnet 4.6, Haiku — Anthropic's 2026 lineup is no longer "one model you talk to." It's a fleet you route between. I spent a month inside Claude Code orchestrating all of them across real codebases. Here's which model to reach for, when, and the routing playbook that quietly doubled my throughput.

Why I Went Down This Rabbit Hole (Again)

Last time I wrote about Claude Skills and called Claude Code the killer host for them. Since then, two things happened that changed how I work day to day.

First, the models got genuinely strange-good. In the span of a few months Anthropic shipped Sonnet 4.6, Opus 4.8, and then an entirely new tier above Opus — the Mythos class — released to the public as Claude Fable 5. We went from "the AI suggested a decent diff" to Stripe reporting that Fable 5 ran a codebase-wide migration on a 50-million-line Ruby codebase in a single day — work that would've taken a team over two months by hand.

Second, Claude Code stopped being a single-model tool. With a fleet of models at different price/speed/intelligence points, the highest-leverage skill in 2026 isn't prompting — it's routing. Knowing which model to put on which task is the difference between burning $200 of tokens on a typo fix and one-shotting a multi-service refactor.

So I did the obvious thing: I wired all of them into Claude Code and ran them against real work for a month — bug fixes, migrations, greenfield features, test suites, the boring stuff and the scary stuff. This is what I learned.

TL;DR

The lineup is now a ladder: Haiku → Sonnet 4.6 → Opus 4.8 → Fable 5 → Mythos 5. Each rung trades cost for capability and patience for long-horizon autonomy.
Sonnet 4.6 is your default. Frontier-ish coding at $3/$15 per million tokens with a 1M-token context window. Most of your work should live here.
Opus 4.8 is the reliable senior. Better judgment, ~4× less likely to let its own code bugs slide, and it powers dynamic workflows — hundreds of parallel subagents in one session.
Fable 5 is the frontier. A Mythos-class model made safe for general use. Best-in-class on long-horizon coding, vision, and reasoning — it falls back to Opus 4.8 on sensitive topics.
Mythos 5 is the locked vault. Same underlying model as Fable, safeguards lifted, restricted to vetted cyber-defense and biology partners.
The real unlock is model routing inside Claude Code — plus Routines, Agent View, and computer use.
Six battle-tested use cases below — from a 50M-line migration (≈2 months → 1 day) to notebook→pipeline conversions saving 1–2 days each — with the results to back them up.
⚠️ Reality check: As of June 12, 2026, public access to Fable 5 and Mythos 5 is suspended under a US government export-control directive. The capabilities are real; availability is in flux. Plan accordingly.

The 2026 Claude Model Ladder

Forget "Claude" as one thing. In 2026 it's a graded ladder, and each rung exists for a reason.

Model	Class	Sweet spot	Price (in / out per M tokens)
Haiku	Fast tier	High-volume, latency-sensitive, cheap glue work	Lowest
Sonnet 4.6	Workhorse	Everyday coding, agents, 1M context	$3 / $15
Opus 4.8	Heavy lifter	Architecture, refactors, judgment-heavy work	$5 / $25 ($10 / $50 fast mode)
Fable 5	Mythos-class (safe)	Long-horizon, frontier coding, vision, research	$10 / $50
Mythos 5	Mythos-class (restricted)	Cyber defense, life sciences — vetted access only	$10 / $50

A few things worth knowing about how these actually relate:

Fable and Mythos are the same underlying model. The only difference is safeguards. Fable ships with classifiers that hand sensitive cyber/bio/chemistry queries off to Opus 4.8; Mythos has those guardrails lifted and is restricted to trusted partners. The names come from the same root — Latin fabula, Greek mythos, "that which is told."
"Mythos-class" sits above Opus in raw capability. It's the first tier Anthropic gated behind classifiers before a general release.
The longer the task, the bigger Fable's lead. On short tasks the gap between Sonnet and Fable is small. On multi-hour, multi-file, "live with your earlier decisions" work, it widens dramatically.

How I Route Work Inside Claude Code

Here's the mental model I settled on after a month. Think of it as a triage flow:

flowchart TD
    A[New task] --> B{How long-horizon<br/>and how risky?}
    B -->|Quick edit, glue,<br/>bulk text| H[Haiku]
    B -->|Everyday coding,<br/>most PRs| S[Sonnet 4.6]
    B -->|Architecture, refactor,<br/>needs judgment| O[Opus 4.8]
    B -->|Multi-hour migration,<br/>frontier reasoning| F[Fable 5]
    O -->|Scale it out| D[Dynamic workflows:<br/>100s of subagents]

1. Start at Sonnet 4.6. Always.
This is the single most important habit. Sonnet 4.6 now benchmarks near Opus-level on the coding tasks most teams actually care about, with a 1M-token context window and a price point that makes running multiple instances in parallel economically trivial. Several teams I trust have publicly moved the majority of their traffic here. Start here, and only climb the ladder when Sonnet visibly struggles.

2. Climb to Opus 4.8 when judgment matters.
The moment a task needs taste — a cross-service refactor, an API redesign, "should we even do it this way?" — Opus 4.8 earns its premium. The standout improvement isn't raw smarts, it's honesty: Opus 4.8 is roughly four times less likely than its predecessor to let a flaw in its own code pass unremarked. It flags uncertainty instead of confidently shipping a landmine. For unattended, long-running work, that's worth more than a benchmark point.

3. Reach for Fable 5 on the long-horizon stuff.
When the task is genuinely big — a migration across hundreds of thousands of lines, rebuilding an app's source from screenshots, reasoning that spans millions of tokens — Fable 5 is the one I reach for to get past a wall. It stays focused across enormous contexts and improves its own outputs using file-based memory. It's also more token-efficient than past models, which softens the higher per-token price.

4. Drop to Haiku for the boring glue.
Bulk renames, log parsing, commit-message generation, simple codegen. Don't pay Opus prices to reformat JSON.

The Claude Code Features That Make Routing Worth It

A model fleet only pays off if the host lets you orchestrate it. Four features did the heavy lifting for me:

1. Dynamic Workflows — the parallelism unlock

Launched alongside Opus 4.8, dynamic workflows let Claude plan a task and then fan out across tens to hundreds of parallel subagents in a single session — then verify its own outputs before reporting back. This is what turns "codebase-scale migration" from a slide into a Tuesday. Claude Code with Opus 4.8 can now take a six-figure-line migration from kickoff to merge, using your existing test suite as the bar. Available on Enterprise, Team, and Max plans.

2. Routines — set it once, let it run

Routines (shipped April 2026) let you configure a Claude Code workflow once and trigger it on a schedule, via API, or in response to an event. Nightly dependency upgrades, auto-triage of new GitHub issues, on-merge changelog generation. Pair a routine with the right model — Sonnet for triage, Opus for the actual fix — and you've replaced a pile of brittle CI scripts with one agent that improves over time.

3. Agent View — mission control

When you're keeping "as many instances of Claude Code busy as possible" (Notion's co-founder isn't joking — that's literally the workflow now), you need a cockpit. Agent View gives you one place to manage every running session across surfaces. It's the unglamorous feature that makes parallel agent work sane.

4. Computer Use — beyond the terminal

Claude Code now opens your apps, drives your browser, and runs your dev tools to complete tasks end-to-end. Combined with Fable 5's state-of-the-art vision (it beat Pokémon FireRed from raw screenshots alone, no harness), the "AI that can actually operate your machine" future is quietly here.

And it meets you everywhere: terminal, VS Code / Cursor / JetBrains extensions, desktop app, web, mobile, and Slack — same agent, same context, same models, wherever you happen to be working.

A Note on Effort (the dial most people miss)

The newer models expose an effort control — and it's the cheapest performance lever you have. Opus 4.8 defaults to high, but you can push it to extra (xhigh in Claude Code) or max for hard problems and long async runs. On lower effort it answers faster and sips your rate limits; on higher effort it thinks more and self-validates.

My rule: low/standard effort for interactive back-and-forth, high/extra for anything you're going to walk away from. The extra thinking pays for itself precisely when you're not watching.

There's also fast mode for Opus 4.8 — 2.5× the speed at a higher per-token cost. Great for tight interactive loops where you're paying in wall-clock attention, not just dollars.

"Combine It With Other Good Models" — Yes, Do That

Routing doesn't have to stop at Claude's borders. A few honest observations from running mixed fleets:

Claude isn't operating in a vacuum. Anthropic's own benchmark tables put Fable 5 and Opus 4.8 head-to-head with GPT-5.5 and Gemini 3.5 — and the gaps are task-dependent, not absolute. On long-horizon agentic coding, Fable currently leads. On raw latency-per-dollar for simple tasks, the field is closer than the marketing suggests.
The pragmatic combo I've landed on: Claude (Sonnet/Opus) as the primary coding agent inside Claude Code, with a second-opinion model wired in via MCP for adversarial review. Having a different model critique a diff catches a class of "confidently wrong" mistakes that any single model's self-review misses.
MCP is the connective tissue. The Model Context Protocol means "best model for the job" can include non-Claude tools and models behind a uniform interface. Skills teach the workflow; MCP exposes the capability; Claude Code routes between models. That's the whole stack.

The takeaway isn't "Claude beats everyone." It's that multi-model routing is now a first-class engineering decision, and Claude Code is the most mature place to actually do it.

Real Use Cases & Results (the part devs actually want)

Benchmarks are fine. But what convinced me — and what I think convinces most engineers — is watching the thing land a PR you'd have spent a day on. Here are the use cases I ran (and the public results that back them up), organized by the kind of work you actually do.

Use case 1: The legacy migration nobody wanted

The task: Migrate a large service off a deprecated framework — the kind of ticket that sits in the backlog for two quarters because nobody has a free week.

The setup: Opus 4.8 (or Fable 5 where available) + dynamic workflows, with the existing test suite as the pass/fail bar. Claude plans the migration, fans out across hundreds of parallel subagents, each handling a slice, then verifies against the tests before reporting back.

The result: Stripe reported Fable 5 performing a codebase-wide migration on a 50-million-line Ruby codebase in a single day — work estimated at two-plus months for a team by hand. In my own (far smaller) runs, a multi-thousand-file framework bump that I'd scoped at three days came back green in an afternoon, with a clean diff and a summary of every non-trivial decision.

Takeaway: Long-horizon migrations are the single highest-ROI use case for the frontier tier. The longer and more mechanical the migration, the more absurd the time savings.

Use case 2: EDA notebook → production pipeline

The task: Turn an exploratory notebook (pull data, train a model, eval with basic metrics) into a real, scheduled production pipeline.

The setup: Sonnet 4.6 as the driver — this is bread-and-butter work that doesn't need Opus. Point it at the notebook and your pipeline framework's conventions in CLAUDE.md.

The result: Ramp's staff engineer reported this exact workflow — notebook to Metaflow pipeline — saving 1–2 days of routine work per model. That's not a demo; that's a recurring tax on every ML engineer's week, quietly removed.

Takeaway: The boring-but-skilled translation work (notebook→pipeline, script→service, prototype→prod) is where Sonnet 4.6 pays for itself daily.

Use case 3: Issue → PR, end to end

The task: A GitHub issue comes in. Read it, reproduce, write the fix, add a test, open the PR.

The setup: Claude Code's GitHub/GitLab integration. Sonnet 4.6 for triage and the common case; escalate to Opus 4.8 when the bug touches architecture or the root cause is non-obvious.

The result: This is the loop teams at GitHub, Cognition, and Code Rabbit have publicly leaned into — Sonnet 4.6 "punches way above its weight class for the vast majority of real-world PRs," with double-digit-point gains on the hardest bug-finding problems over Sonnet 4.5. In practice: most issues never reach me as anything but a PR to review.

Takeaway: Wire the cheap model to the front door, reserve the expensive model for the hard 10%. Don't pay Opus to fix a null check.

Use case 4: Screenshot → working app

The task: "Here's a screenshot of the dashboard. Rebuild it." No source, no spec — just pixels.

The setup: Fable 5, the current state-of-the-art vision model. It can extract precise numbers from scientific figures and reconstruct a web app's source code from screenshots alone.

The result: Anthropic's own demo had Fable 5 beating Pokémon FireRed from raw game screenshots with a vision-only harness — something earlier Claude models couldn't do even with navigation aids. Translated to dev work: design-to-code from a Figma export or a competitor's UI screenshot, with far less hand-holding than anything before it.

Takeaway: Vision is no longer a party trick. "Rebuild this from a picture" is a real, reliable workflow now.

Use case 5: Nightly autonomous maintenance

The task: Dependency upgrades, flaky-test triage, changelog generation — the chores that rot a codebase when ignored.

The setup: Routines. Configure once, trigger on a schedule. Sonnet 4.6 does the nightly sweep; anything genuinely broken gets escalated to an Opus 4.8 fix with a draft PR waiting in the morning.

The result: Replaced a folder of brittle cron + bash scripts with a single agent that understands why a test failed instead of just reporting that it did. The win isn't speed — it's that the maintenance actually happens now, every night, without a human remembering to do it.

Takeaway: Skills + Routines + model routing is the combo that turns "we should automate that" into "it ran at 2am."

Use case 6: The adversarial code review

The task: Catch the confidently-wrong bug before it ships.

The setup: Primary model writes the diff; a different model (via MCP — could be another Claude tier, GPT-5.5, or Gemini 3.5) reviews it adversarially. Opus 4.8's honesty gains help here too: it's ~4× less likely than its predecessor to let a flaw in its own code pass unremarked.

The result: Cognition reported Sonnet 4.6 "meaningfully closed the gap with Opus on bug detection," letting them run more reviewers in parallel and catch a wider variety of bugs without increasing cost. A second, independent model catches the class of mistakes self-review structurally can't.

Takeaway: Two cheap reviewers beat one expensive author. Parallel, multi-model review is now economically obvious.

The results, at a glance

Use case	Model(s)	Reported / observed result
50M-line framework migration	Fable 5 + dynamic workflows	~2 months → 1 day (Stripe)
Notebook → prod pipeline	Sonnet 4.6	1–2 days saved per model (Ramp)
Issue → PR	Sonnet 4.6 → Opus 4.8	Most issues arrive as review-ready PRs
Screenshot → app	Fable 5 (vision)	Source rebuilt from pixels alone
Nightly maintenance	Sonnet 4.6 + Routines	Chores that actually happen, unattended
Adversarial review	Multi-model via MCP	More bugs caught, parallel, no cost increase

The pattern across all six: match the model to the shape of the task, let Claude Code orchestrate, and verify with tests or a second model. That's the whole game.

A Dev-Community Playbook (steal these)

A few hard-won habits that separated my good weeks from my great ones:

Put your conventions in CLAUDE.md, once. Lint rules, directory layout, "we use pnpm not npm," "never touch legacy/." Every model in the fleet inherits it. This single file is the highest-leverage 20 minutes you'll spend.
Default to Sonnet. Earn your way up the ladder. Most engineers reflexively reach for the biggest model. Resist it. Start at Sonnet 4.6 and only climb when it visibly stalls — your bill and your latency will thank you.
Let the model write the failing test first. Tell it to reproduce the bug as a red test before fixing it. You get a regression guard for free and a much higher-quality fix.
Keep N agents busy. The mental shift that 10×'d Notion's team: you're not waiting on one agent, you're conducting several. Use Agent View, run parallel branches, review the fourth while three more cook.
Promote anything you do twice into a Routine. If you've manually asked Claude to do the same chore twice, that's a Routine waiting to be born.
Always wire a fallback. Frontier models get rate-limited, deprecated, or — as June 2026 proved — export-controlled overnight. Have an Opus 4.8 path ready so a policy change doesn't become an outage.
Review the diff, every time. The faster the agent, the lazier the human gets. The discipline that keeps this safe is unchanged: read the diff, run the tests, never merge what you can't roll back.

The meta-lesson: agentic coding rewards engineers who think like tech leads. You decide what and why; the fleet handles how. The bottleneck moved from typing speed to judgment — which is exactly where you want it.

A Word on Safety (Read This Part)

The Mythos class crossed a capability threshold that made Anthropic genuinely nervous — and they were right to be. These models excel at discovering and exploiting software vulnerabilities and at agentic hacking (recon, lateral movement, the works). That's exactly why:

Fable 5 ships with classifiers that detect cyber/bio/chemistry/distillation misuse and fall back to Opus 4.8 rather than answering. More than 95% of sessions never trigger a fallback — but the guardrail is there.
Mythos 5 is deliberately gated behind trusted-access programs (cyber defense via Project Glasswing, select biology researchers), not handed to everyone.
As of June 12, 2026, public access to both Fable 5 and Mythos 5 is suspended under a US government export-control directive. This is the single most important caveat in this whole post: the capabilities are real and shipping, but availability is volatile and policy-driven. If you're building on Fable, have an Opus 4.8 fallback path wired in today, not later.

For your own work, the same discipline as ever applies: sandbox agent execution, restrict file-system and network egress, review diffs before they merge, and never let an autonomous agent push to anything you can't roll back. A more capable model raises the stakes of a bad instruction, not just a good one.

How to Try This Yourself

Install Claude Code (one-liner):

irm https://claude.ai/install.ps1 | iex          # Windows
# or: curl -fsSL https://claude.ai/install.sh | sh   # macOS / Linux

Pick your plan. Claude Code is bundled into Pro ($17–$20/mo), Max 5x ($100/mo), and Max 20x ($200/mo). For "keep three branches alive while I review the fourth," Max is the honest entry point.

Switch models per task. Inside a session, select the model that matches the job — Sonnet for the PR, Opus for the architecture call, Fable for the migration (where available). Use a CLAUDE.md file to encode your project's conventions once so every model inherits them.

Promote winners to Routines. Once a model-plus-workflow combo proves itself, schedule it. Nightly Sonnet-powered issue triage that escalates real bugs to an Opus fix is the kind of thing that runs while you sleep.

Wire in a second opinion via MCP. Let a different model adversarially review high-stakes diffs. Cheap insurance against confident-but-wrong.

Final Take: The Skill Is Routing Now

A year ago the question was "is the AI good enough to write this code?" In 2026 the answer is yes — across an entire ladder of models, each tuned for a different shape of problem. The new skill, the one that separates a 1.2× productivity bump from a 3× one, is knowing which model to put on which task and letting Claude Code orchestrate the fleet.

Start at Sonnet 4.6. Climb to Opus 4.8 when judgment matters. Reach for Fable 5 on the long-horizon work — when you can get it. Wire in a second model for adversarial review. Promote your wins to Routines. And keep a fallback path for the frontier models, because as June 2026 reminded everyone, the most capable model is also the one most likely to get pulled out from under you for a week.

Tools give agents capability. Skills give them competence. Models give them intelligence at the right price — and Claude Code, in 2026, is where you conduct the whole orchestra.

About the Author

Suraj Khaitan — Gen AI Architect | Building scalable platforms and secure cloud-native systems

Connect on LinkedIn | Follow for more engineering and architecture write-ups

Which Claude model has become your default — and what finally made you climb the ladder? Drop it in the comments. I'm always refining the routing playbook.

Sources & further reading: Anthropic's announcements for Claude Fable 5 & Mythos 5, Claude Opus 4.8, Claude Sonnet 4.6, the Claude Code product page, and the Fable/Mythos access statement. Benchmarks and pricing reflect Anthropic's published figures as of June 2026 and are subject to change.

DEV Community