tokenmixai

Posted on Jun 1 • Originally published at tokenmix.ai

Claude Mythos vs Opus 4.8: 90x More Firefox Exploits — But Stay on Opus Anyway

#claude #ai #llm #security

I spent a few hours digging into Anthropic's Mythos Preview disclosure and the BleepingComputer report that Mythos-class models are coming to all customers "in the coming weeks." The headline numbers are wild. The conclusion is boring. Let me explain.

TL;DR

Mythos beats Opus 4.6 by ~90x on offensive security benchmarks (181 Firefox exploits vs 2 in matched tests).
Opus 4.8 already matches Mythos on alignment scores — that's how Anthropic justified the public rollout.
Mythos will likely cost $25/$125 per million tokens (vs $5/$25 for Opus 4.8). 5x premium.
For most code you ship, Opus 4.8 is still the right default. Mythos pays off only on security audits and autonomous research.

Why I went down this rabbit hole

On May 28, 2026, Anthropic released Opus 4.8 and quietly announced that Mythos-class models would "roll out to all customers in the coming weeks." That's a six-week reversal from their April 7 statement: "We do not plan to make Mythos Preview generally available."

I wanted to know two things:

What changed?
Should I refactor anything before Mythos lands?

Short answer: Opus 4.8 happened. Long answer below.

The capability gap is real (and concentrated)

Here's the data Anthropic actually published, in matched-conditions tests against Opus 4.6:

Test	Opus 4.6	Mythos Preview	Gap
Firefox working exploits / matched try set	~2	181	90x
OSS-Fuzz tier-1/2 crashes	minimal	595	huge
OSS-Fuzz tier-5 control flow hijacks	0	10	infinite
Total flaws found across 1,000+ projects	—	23,019	—
High-or-critical severity (CVSS 7+)	—	6,202	—
Validity rate on processed findings	~30-40% (community LLM baseline)	90.6%	~3x

That 90.6% validity number is the one that should make you pay attention. Pre-Mythos, LLM-driven security findings ran at 30-40% valid — high enough that you couldn't pipeline them into your bug-tracker without a human triage layer eating analyst hours. At 90.6%, Mythos crosses into "send findings straight to maintainer" territory. That's a structural change in how vulnerability programs operate.

But here's the part everyone misses

Pull up Opus 4.8's headline benchmarks for a second:

Benchmark	Opus 4.7	Opus 4.8	Delta
SWE-bench Verified	87.6%	88.6%	+1.0
SWE-bench Pro	64.3%	69.2%	+4.9
Terminal-Bench 2.1	66.1%	74.6%	+8.5
GDPval-AA Elo	1753	1890	+137
Code-flaw rate vs 4.7	1.0x	0.25x	-4x flaws

That 4x reduction in code-flaw rate is what enabled the Mythos public release. Anthropic literally wrote: Opus 4.8's misaligned behavior rates are "substantially lower than Opus 4.7" and "comparable to Claude Mythos Preview." The safeguard pipeline they needed to ship before letting Mythos out of Project Glasswing? They tested it in production on Opus 4.8 for ~7 weeks. The model swap is the easy part now.

So if your reason for waiting on Mythos was "I want better code quality," you already have it. It's called Opus 4.8 and it costs the same as Opus 4.7.

The pricing math

Anthropic hasn't published public Mythos pricing yet, but Glasswing partners reportedly pay $25 input / $125 output per million tokens. That's:

5x Opus 4.8's $5 / $25
8x GPT-5.5's $3 / $15
22x Sonnet 4.8's $3 / $15
90x DeepSeek V4's $0.27 / $1.10

Let me put that in code-review terms. Suppose you run an AI-assisted PR review pipeline averaging 8K input tokens + 2K output tokens per PR, and you ship 100 PRs/month:

Model	Cost per PR	Monthly	Annual
DeepSeek V4	$0.0044	$0.44	$5.30
Sonnet 4.8	$0.054	$5.40	$64.80
Opus 4.8	$0.090	$9.00	$108
Mythos (projected)	$0.450	$45.00	$540

For routine code review, Mythos burns 5x faster for almost no quality gain. The model is overpriced for that workload because it's not the workload it's priced for.

The workload it IS priced for: security audits where finding one missed critical vulnerability prevents a $40K-200K incident response. At that ratio, $45 per PR is a bargain. But that's a security-team workload, not a general dev-team workload.

When do you actually need Mythos?

Here's how I'm thinking about routing in my own stack:

def pick_model(task):
    if task.type == "security_audit" or task.type == "vuln_research":
        if task.severity_floor >= "critical":
            return "mythos"  # when public
        return "opus-4-8"

    if task.type == "agentic_coding":
        if task.requires_long_horizon_planning:
            return "opus-4-8"  # Dynamic Workflows
        return "opus-4-8"

    if task.type in ("chat", "summarization", "classification"):
        return "sonnet-4-8"  # 5x cheaper, plenty for these

    if task.type == "prototyping":
        return "deepseek-v4"  # free credits, frontier quality

    # default
    return "opus-4-8"

I don't have a single workload where Mythos is the default. My security-audit work goes through Opus 4.8 today and that's the only path where Mythos might show up in my router.

Where Opus 4.8 still beats Mythos

This is the part most launch coverage misses. Mythos is gated, premium-priced, and scoped to security. For everything below, Opus 4.8 wins:

General chat and customer copilots — Mythos pricing doesn't justify it; users won't notice the difference.
Math reasoning — Opus 4.8 hits 93.6-96.7% on GPQA/USAMO; no Mythos data suggests an edge.
Long-context document analysis — Opus 4.8 supports 1M tokens on the API; Mythos context window unknown.
Multi-modal (vision + code) — Opus 4.8 has the full tool surface; Mythos Preview was code-only.
Cost-sensitive production workloads — burns 5x faster on Mythos, kills margins.
No-gating-delay workloads — Mythos public release will probably have a Cyber Verification Program; Opus 4.8 ships today.

Architecture speculation (with caveat)

Anthropic hasn't disclosed Mythos's architecture. Per BuildFastWithAI's analysis, best estimates:

~10 trillion parameters (MoE)
~1-2T active per forward pass
Product tier name probably "Capybara" (above Opus)

This is consistent with where the industry is going — Qwen 3.6 Plus and rumored GPT-5.5 are similar architecture. The 5x pricing premium reflects real GPU time, not margin extraction. If you've been wondering why Anthropic raised $65B at a $965B valuation (announced same day as Opus 4.8), Mythos-class compute is part of the answer.

Cost-per-finding (the math that justifies Mythos)

Anthropic published two cost examples in the Mythos disclosure:

OpenBSD vulnerability discovery: under $50
Full FFmpeg vulnerability sweep: ~$10,000 across several hundred runs

That FFmpeg number sounds expensive until you cost the alternative. A senior security researcher running the same audit takes 3-6 weeks at $200-500/hour, billing $25K-90K. So $10K for the same outcome is a 60-90% cost reduction over human-only work — assuming the findings are good enough to ship without re-validation.

At 90.6% validity, they are. That's why Glasswing partners are paying premium rates.

What I'm doing this week

Concrete actions for builders on the Claude API today:

Migrate to Opus 4.8 if you're still on 4.7. Same per-token price, materially better at agentic coding. The only API-breaking change is extended thinking → adaptive thinking. Easy migration.
Audit effort defaults. Opus 4.8's default changed from medium to high. If you didn't set effort explicitly, your costs just went up. Set effort: "medium" to preserve 4.7's default behavior or run the new default if you want the upgrade.
If you do security work, get on the Mythos waitlist. Anthropic gates by use case, not by application. Document your defensive cybersecurity workload clearly. If you maintain internet-critical infrastructure, Project Glasswing-style outreach may already be coming.
Don't refactor for Mythos yet. Build everything on Opus 4.8 today. Mythos is a model-string swap when it lands publicly. Routers like TokenMix will surface it at the same OpenAI-compatible endpoint, so your wiring stays the same.
Budget for evaluation. Reserve $1-5K/month for Mythos eval when public access lands. Run it on your hardest security workloads, compare findings to Opus 4.8 output, decide whether to escalate the budget. Don't move all traffic — just the escalation tier.

Bottom line

Mythos is a real capability leap on a narrow workload class. For builders running general coding agents, customer copilots, content pipelines, or production chat, the 5x pricing premium burns budget without proportional return. For security audit teams, vulnerability research, and defensive cybersecurity tooling, Mythos is the new escalation tier and you should plan for it.

Anthropic's "coming weeks" commitment is concrete: mid-June to end of July 2026 is the realistic public release window. Don't restructure architecture around it. Stay on Opus 4.8, keep your model strings flexible, and route Mythos in when the workload calls for it.

Full data tables, FAQ, source citations, and a breakdown of all 23,019 Project Glasswing findings are in the original article on TokenMix.

If you want to test multiple Claude tiers against each other without managing multiple API keys, TokenMix routes Opus 4.8, Sonnet 4.8, and (when public) Mythos through one OpenAI-compatible endpoint at Anthropic's published rates.

What's your take — are you waiting for Mythos, or sticking with Opus 4.8? Drop a comment.

DEV Community