This article was originally published on aicoderscope.com
TL;DR: Claude Fable 5 leads Terminal-Bench 2.1 at 88.0% — the first model to break 85% — but it's been offline under a US export-control order since June 12 and was not restored as of June 27. Among tools you can actually pay for and run today, Codex CLI on GPT-5.5 (83.4%) edges Claude Code on Opus 4.8 (82.7%). That 0.7-point gap is noise. Don't switch your stack over it.
| Codex CLI + GPT-5.5 | Claude Code + Opus 4.8 | Claude Code + Fable 5 | |
|---|---|---|---|
| Best for | Long-context, autonomous runs | Interactive, plan-first work | Nobody — it's suspended |
| Terminal-Bench 2.1 (native harness) | 83.4% | 82.7% | 88.0% |
| Price (per M tokens) | $5 in / $30 out | $5 in / $25 out | $10 in / $50 out |
| Availability today | Full | Full | Suspended (US export order) |
Honest take: The benchmark's real lesson in June 2026 isn't "which model wins." It's that the model winning by 4.6 points is one you legally cannot access. Pick your tool on harness fit, cost, and availability — not on a leaderboard whose top row is unbuyable.
The number everyone is quoting is unbuyable
The headline result from Terminal-Bench 2.1 this month reads great: Claude Fable 5 hit 88.0%, the first model ever to clear 85% on the benchmark, finishing 4.6 points ahead of GPT-5.5. Anthropic's Mythos-class model looked like the new ceiling for agentic coding.
Then on June 12, three days after Fable 5 launched, the US government issued an export-control directive ordering Anthropic to suspend all access to Fable 5 and Mythos 5. The models went dark globally — for every customer, on every platform. As of June 27, the government allowed Mythos 5 back for a narrow set of US critical-infrastructure organizations. Fable 5 was not restored. Anthropic says it is working to bring it back "as soon as possible," with no date.
So the model topping Terminal-Bench 2.1 is one you can't use to write code today. That single fact reframes the entire leaderboard. If you're choosing a coding agent in late June 2026, the 88.0% row is trivia. The decision is between the next two rows — both of which you can install and pay for right now.
The leaderboard that actually matters
Strip out the suspended models and Terminal-Bench 2.1's native-harness leaderboard (tbench.ai, which runs each agent in its own tooling) looks like this for tools you can buy today:
| Rank (usable tools) | Agent | Model | Score |
|---|---|---|---|
| 1 | Codex CLI | GPT-5.5 | 83.4% |
| 2 | Claude Code | Opus 4.8 | 82.7% |
| 3 | Gemini CLI | Gemini 3.1 Pro | ~70.7% |
GPT-5.5 in Codex CLI leads the usable field by 0.7 points over Opus 4.8 in Claude Code. On a benchmark of 31 models running real multi-step terminal tasks — package installs, git operations, build fixes, server config — a sub-one-point gap is measurement noise. Run the suite again next week and the order could flip. Neither tool is meaningfully "better at coding" than the other based on this.
The third-place drop to Gemini CLI is the more interesting signal: roughly 12 points back. The top two are in a different class from everything else you can run.
Why the same model scores two different numbers
Here's the part most leaderboard coverage skips, and it changes how you should read every score above.
Terminal-Bench publishes two views. The tbench.ai leaderboard runs each agent in its native harness — Codex CLI wraps GPT-5.5 the way OpenAI built it, Claude Code wraps its models the way Anthropic built them. That's an apples-to-apples tool comparison. It answers "which product, as shipped, completes the most tasks."
The vals.ai leaderboard runs every model through the same harness (Terminus 2). That's an apples-to-apples model comparison. It strips the tooling out and asks "which raw model is strongest."
Put them side by side and the gap is impossible to ignore:
| Model | tbench.ai (native harness) | vals.ai (Terminus 2, uniform) |
|---|---|---|
| Claude Fable 5 | 88.0% | 80.52% |
| GPT-5.5 | 83.4% (Codex CLI) | 76.40% |
| Gemini 3.5 Flash | — | 74.16% |
| Claude Opus 4.8 | 82.7% (Claude Code) | 71.91% |
GPT-5.5 scores 83.4% inside Codex CLI but only 76.40% inside Terminus 2 — a 7-point swing. The model didn't change. The agent loop wrapping it did. That gap is the harness: how the tool plans, retries failed commands, manages context, and decides when a task is done.
Seven points is larger than the entire margin between the top two tools. The practical takeaway is blunt: the harness can matter more than the model. A strong model in a mediocre agent loop loses to a slightly weaker model in a well-engineered one. When you pick a coding agent, you're not picking a model — you're picking a model plus the software that drives it, and the driving is doing a lot of the work.
This also explains why you should distrust any single score quoted without its harness. "GPT-5.5 gets 83.4% on Terminal-Bench 2.1" is true and "GPT-5.5 gets 76.40% on Terminal-Bench 2.1" is also true. Both describe the same model on the same benchmark version. Always ask which harness produced the number.
What Terminal-Bench actually tests (and what it doesn't)
Terminal-Bench measures an agent driving a real terminal to finish a task: edit files, run shell commands, read the failures, fix them, repeat until the task passes. Version 2.1 is harder than 2.0 — the tasks are longer and more sequential — so scores are not comparable across versions. A model's 2.0 number tells you nothing about its 2.1 standing.
What it captures well: sequential, multi-step work where one wrong command derails the next three. That's closer to real agentic coding than SWE-bench's one-shot bug fixes, which is why Terminal-Bench has become the headline benchmark for CLI agents in 2026.
What it doesn't capture: tool latency, session memory across hours of work, IDE integration, how good the diffs are to review, or how often the agent quietly does the wrong thing confidently. A 0.7-point benchmark lead says nothing about whether you'll enjoy using the tool for eight hours. Those qualities decide daily satisfaction, and no leaderboard scores them.
Cost: where the real decision lives
With the top two tools tied on capability, price and fit decide it. Here's the verified API pricing as of June 28, 2026:
| Model | Input / M | Output / M | Notes |
|---|---|---|---|
| GPT-5.5 | $5 | $30 | Cached input $0.50/M; batch & flex 50% off → $2.50 / $15 |
| Claude Opus 4.8 | $5 | $25 | Standard tier; "fast" tier is $10 / $50 |
| Claude Fable 5 | $10 | $50 | Suspended — not purchasable |
| GPT-5.5 Pro | $30 | $180 | Heavy-reasoning variant |
Opus 4.8 is cheaper on output ($25 vs $30/M), which dominates the bill for agentic work that generates a lot of code and tool calls. GPT-5.5 claws some of that back with a low $0.50/M cached-input rate, which helps long sessions that re-send the same context repeatedly. For most real workloads the two land within a few dollars of each other per heavy session.
If you'd rather pay flat-rate than metered, both tools have subscription paths: Codex CLI runs on ChatGPT Plus ($20/mo) and Claude Code starts at $20/mo Pro. For the break-even math between flat and metered plans, see our Claude Code vs Codex CLI comparison and the 7-way agent comparison.
The lesson that outlasts this month's scores
The benchmark order will shuffle. Fable 5 may come back. GPT-5.6 is already shipping. Six months from now these exact numbers are history.
What won't change is the structural lesson Fable 5's suspension just taught: a cloud model can vanish overnight by government order, with three days' notice between launch and shutdown. Developers who built a workflow around Fable 5 in its first week lost it in its second. If your stack has a single point of failure — one model, one vendor, one API — you're one directive away from a bad afternoon.
The defensive move is the s
Top comments (0)