RAXXO Studios

Posted on Jun 10 • Originally published at raxxo.shop

Claude Fable 5 vs GPT-5.5 vs Gemini 3.1 Pro: Who Leads Now?

#ai #productivity #claudecode #automation

SWE-Bench Pro: Claude Fable 5 hits 80.3 percent, GPT-5.5 lands 58.6, Gemini 3.1 Pro 54.2
Gemini stays cheapest at 2 dollars per million input, Fable 5 costs 10 but undercuts GPT-5.5 Pro
Only Anthropic ships a two-tier safety design: risky prompts get Opus 4.8 answers, not refusals
My stack: Fable 5 for agentic coding, Gemini for cheap volume, GPT-5.5 where its ecosystem lives

The frontier has three flagships again, and one of them just moved the line. Claude Fable 5 arrived on June 9, 2026 as the first public Mythos-class model, and the obvious question is not whether it beats Opus, it is whether it beats GPT-5.5 and Gemini 3.1 Pro. I pulled the published numbers, the price sheets, and a day of my own production traffic to answer how the three actually compare.

The Coding Benchmarks Are Not Close

SWE-Bench Pro is the cleanest cross-vendor measure right now because it scores real GitHub engineering tasks, not puzzles. The June numbers: Claude Fable 5 at 80.3 percent, GPT-5.5 at 58.6, Gemini 3.1 Pro at 54.2. For scale, Anthropic's own previous best, Opus 4.8, sits at 69.2. Fable 5's lead over GPT-5.5 is 21.7 points, which is larger than the gap between GPT-5.5 and Gemini.

Cognition's FrontierCode tells the same story at the hard end. The benchmark deliberately uses demanding production-standard tasks, and Fable 5 scores 29.3 percent against 5.7 for GPT-5.5. A five-times difference on the work that most resembles real senior engineering is the single most lopsided frontier result I have seen this year.

Two honest caveats before anyone cancels subscriptions. First, vendor-published benchmarks favor the vendor, always. Second, GPT-5.5 has real wins on record: it took Terminal-Bench from Opus 4.8 back in May, and no Fable 5 Terminal-Bench number has been published yet. Terminal-heavy agent workflows might still lean GPT until someone measures it. I keep score on these launches, and the pattern from Claude Opus 4.8 Is Here: Everything That Changed repeats: Anthropic publishes its losses less loudly than its wins, same as everyone.

Outside pure coding, Fable 5 claims state-of-the-art vision (it beat Pokemon FireRed on a vision-only harness and rebuilt app source from screenshots) and the top scores on document and finance reasoning suites. Google still owns the cheap long-context crown, and OpenAI still ships the broadest consumer surface. But on the benchmark axis that decides what writes code in production, June 2026 is not a three-way tie.

Pricing: Three Different Bets

Per million tokens, in dollars, as of the June 9 price sheets:

Model
Input
Output

Claude Fable 5
10
50

Claude Opus 4.8
5
25

GPT-5.5
5
30

GPT-5.5 Pro
30
180

Gemini 3.1 Pro (under 200K context)
2
12

Gemini 3.1 Pro (over 200K context)
4
18

Three strategies in one table. Google is buying the volume market: 2 dollars input is unmatched, and batch mode halves it again. OpenAI holds the middle with GPT-5.5, then charges a steep premium for GPT-5.5 Pro. Anthropic priced Fable 5 at double its own Opus 4.8 but well under OpenAI's top tier, which costs 3 times more on input and 3.6 times more on output than Fable 5.

So "Fable 5 is expensive" is only half true. Against GPT-5.5 standard, yes, double the input price. Against the model OpenAI positions as its best, Fable 5 is the budget option, with better published coding numbers.

Caching shifts the picture further. All three discount heavily for repeated prefixes (GPT-5.5 cached input drops to 0.50, Gemini and Claude both serve cache reads at about a tenth of base input), so for agent loops with stable system prompts, the effective input cost converges. What does not converge is output, and output is what agentic work burns. That is where the per-task math from my Fable 5 vs Opus 4.8 comparison applies across vendors too: the model that solves a task in one attempt beats the cheaper model that needs three.

Only One Ships a Fallback Instead of a Refusal

The architectural difference nobody else copied: Fable 5 does not just refuse dangerous requests, it swaps models. Classifier systems screen every conversation, and requests touching cybersecurity exploitation, dual-use biology and chemistry, or capability distillation get their answers generated by Opus 4.8 instead. Under 5 percent of sessions trigger it. GPT-5.5 and Gemini 3.1 Pro are binary: answer or refuse.

For developers this cuts both ways. The upside is fewer dead ends. A security-curious question that GPT might refuse outright still gets a competent Opus-grade answer through Fable. The downside is consistency: if your product depends on knowing exactly which model answered, you need to handle the silent handoff, and conservative tuning means some harmless requests get downgraded too. Anthropic says reducing those false positives is the current focus.

The testing behind it is unusually public: over 1,000 hours of bug bounty with no universal jailbreak found, zero harmful compliance across 30 public jailbreak techniques on cyber tasks, and external red teams reporting the strongest cyber safeguards they have measured. Add the data terms (30-day retention on Mythos-class traffic, no training use, logged human access) and the privacy story is straightforwardly the strongest of the three vendors right now.

There is also a capability being deliberately withheld here that neither rival has shown. The unrestricted Mythos 5, which only vetted Project Glasswing partners get, scores 78 percent on ExploitBench against 40 for Opus 4.8. Anthropic built a frontier offensive-security capability, measured it, and then fenced it off to roughly 150 vetted organizations across 15-plus countries. OpenAI and Google publish nothing comparable, which means either they do not have it or they do not talk about it. Both possibilities are interesting.

Worth knowing the context: the release came days after Anthropic's own public warnings about AI capability risks, and this two-tier design is their answer to the obvious charge of hypocrisy. The full mechanism, including what the restricted Mythos 5 does differently, is in Claude Fable 5 Is Here: The First Public Mythos-Class Model.

What I Run Where

My production split after the launch week, as one data point from a solo studio that ships daily.

Agentic and long-horizon coding moved to Fable 5. Claude Code defaulted to it on my machine launch morning, and the pipeline that researched, drafted, and published this article ran on it end to end. The retry-rate drop on hard tasks is visible within a day, which matches the FrontierCode spread.

High-volume, price-bound work goes to Gemini 3.1 Pro or stays on smaller Claude models. Summarization, classification, bulk ingestion of long documents: 2 dollars per million input under 200K context is a price the others do not answer, and quality there is past the good-enough line anyway.

GPT-5.5 keeps two lanes in my book: teams already deep in OpenAI's tooling, where switching costs outweigh benchmark gaps, and terminal-style agent flows where its Terminal-Bench result still stands unchallenged. Ecosystem lock-in is a real input to this decision even when the benchmark table is lopsided, a dynamic I dug into in Anthropic Now Owns 40% of Enterprise LLM Spend.

The cheap way to form your own opinion: paid Claude plans include Fable 5 at no extra cost until June 22. Run your hardest recurring task through all three with the same prompt and count attempts to done, not tokens per attempt.

Bottom Line

June 2026 splits the frontier into three clear bets: Anthropic holds the capability ceiling, Google holds the price floor, OpenAI holds the ecosystem middle. The benchmark gap on hard coding is wide enough that I moved real workload the same week, which I have not done on a launch since Opus 4.7.

Three things will decide whether this snapshot holds. A published Fable 5 Terminal-Bench number, in either direction. Google's answer, because Gemini 3.1 Pro is now two coding generations behind at a quarter of the price, and that price-capability spread cannot stay this wide. And the June 23 switch from included access to usage credits, which is when Fable 5 usage patterns get honest.

If you only act on one thing, make it the free window: until June 22, Fable 5 costs paid Claude subscribers nothing to test. Bring a task that hurt last month and see if it stops hurting. My Claude Code setup for exactly that kind of test, hooks, routing, and guardrails included, ships as Claude Blueprint.

Top comments (1)

Kevin • Jun 10

Is this source still up to date: “Anthropic Now Owns 40% of Enterprise LLM Spend”? The link no longer works, and I've read that most companies (including those in the U.S.) have now switched to Chinese alternatives. A comparison with Minimax M3 would be interesting at this point.