The post hit X at some point on June 10, the morning after Anthropic's biggest launch in years.
I was honestly expecting something like this. The ...
For further actions, you may consider blocking this person and/or reporting abuse
$10/$50 per million tokens with a 1M input window is steep but completely justified if it can actually reason across a massive codebase for hours without losing its mind. That’s a massive win for production-level software agents.
Agreed. When you factor in the developer hours saved by having an agent reason across an entire repository without degrading, the ROI easily covers the steep token cost. It’s expensive for hobby projects, but a no-brainer for production-level enterprise agents.
The 128K output ceiling is the real sleeper stat here. Most models choke up long before that, making autonomous code refactoring on a large scale impossible. Fable 5 might actually be the first true "autonomous dev" partner.
Everyone looks at the input window, but a 128K output ceiling is game-changing. It means the model can actually write entire multi-file refactors in a single pass instead of hitting a wall mid-function. That's the real differentiator for true autonomy.
This architecture design (routing dangerous queries from Fable 5 to Opus 4.8) is super interesting. It's essentially an AI-driven reverse proxy. But if Pliny bypassed the routing entirely, it means the classifier failed to even recognize the query as toxic.
Precisely. If Pliny's prompt skipped the routing completely, the upstream classifier didn't even flag it as a risk. It shows that the entire multi-model defense architecture is completely reliant on a fragile frontend categorization step.
1,000 hours of red-teaming bypassed in 24 hours. Classic. It just goes to prove that static, hard-coded classifiers are a band-aid solution when you’re dealing with a dynamic semantic layer. If the weights are identical to Mythos 5, the vulnerability is inherent. Fascinating write-up, Syed.
Thanks, Sahil! It really proves that a dedicated global community will always out-pace a closed-door red-teaming group. When millions of minds meet a dynamic semantic layer, 1,000 hours of internal testing gets tested at scale within minutes of release.
Honestly, the security drama is interesting, but that 80.3% score on SWE-Bench Pro is what has my attention. If it can actually maintain consistency across large codebases without hallucinating context after an hour of agentic loop execution, $10/$50 per million tokens is an absolute steal.
Completely valid. The agentic consistency over long horizons is the real prize here. At $10/$50, if it consistently prevents context drift during complex loops, it’s going to drastically change how we build autonomous development tooling.
Pliny strikes again! It’s wild how fast the 'bulletproof' narrative crumbled. The pressure on Anthropic must have been immense with the IPO paperwork filed—commercial momentum definitely won the internal argument over safety brakes this time around
The timing with the IPO filing is definitely hard to ignore, Vicky. There’s always an intense tug-of-war between commercial momentum and safety boundaries, and when investors are watching, getting the product out the door often wins out over perfect guardrails.
This proves that post-training safety layers (RLHF, safety classifiers) are decoupled from the core intelligence. We need fundamental shifts in model architecture if we want real AI safety, not just fancy wrapper filters.
It really highlights the difference between a model actually understanding a safety principle versus just having its output filtered. Until we bake alignment directly into how the network represents information, we’re essentially just playing a massive game of whack-a-mole.
The 'silent routing' to Claude Opus 4.8 for flagged queries is an interesting engineering choice, but it explains why early users reported such massive performance degradation on edge-case coding tasks. It wasn't Fable failing; it was just a quiet downgrade behind the scenes.
That’s a brilliant connection, Vicky. It perfectly explains those early complaints about sudden latency spikes and weird downgrades in code quality on complex edge cases. It wasn’t a glitch; it was just the system quietly passing the buck to an older model.
Has anyone successfully replicated Pliny’s OSED exploit bypass today? I tried a similar nested context framing this morning and it got hit by the classifier instantly. Curious if Anthropic has already pushed a silent patch to the routing layer.
They’ve almost certainly pushed a silent patch to the classifier or updated the system prompt context since the leak went viral. Anthropic's response loops for exposed bypasses are usually measured in hours. Let us know if you find a new angle that breaks through!
We were looking into Project Glasswing for our infrastructure monitoring, but seeing a Birch reduction walkthrough leak this fast makes our compliance team incredibly nervous. Guardrails on frontier models feel like trying to catch water with a net right now.
"Trying to catch water with a net" is an incredibly accurate description of current LLM compliance. For high-security infrastructure, relying on frontier model guardrails right now is a massive gamble. Completely understand why your compliance team is sweating!
Am I the only one who finds the 'Mythos-class danger' narrative a bit too convenient for marketing? Nothing drives hype like telling the public your model is 'too dangerous to release' right before handing them a slightly modified version of it.
You're definitely not alone in thinking that. The "too hot for TV" marketing strategy is incredibly effective in Silicon Valley. Framing a model as potentially dangerous creates an immediate aura of power and inevitability that drives massive hype.
24 hours is a new record for a "Mythos-class" model. It proves what we’ve been saying in security for decades: hard-coded or classifier-based guardrails sitting on top of an LLM are just a band-aid. If the weights have the capability, someone will coax it out.
Exactly. The "hard-coded band-aid" approach is showing its age. If the core capabilities and weights are fundamentally present in the model, a clever prompt engineer will always find the right key to turn. 24 hours really shattered the illusion of the bulletproof wrapper.
Fascinating that the system prompt was 120,000 characters. That is massive scaffolding just to keep the model aligned. No wonder people are treating system prompts like open-source architecture maps now.
It’s mind-blowing. A 120k-character system prompt isn't just instructions anymore; it’s practically a mini-codebase running in the context window just to keep the model on the rails. It really goes to show how much compute is being spent purely on behavioral containment.
The fact that the jailbreak circumvented the classifier routing by framing it as "OSED exam prep" is classic social engineering applied to silicon. LLMs still can't differentiate between educational context and malicious intent when phrased elegantly enough.
"Social engineering applied to silicon" is a brilliant way to frame it, Faraz. LLMs are deeply semantic, so if you wrap a malicious request in a perfectly legitimate educational or defensive context, the mathematical semantic distance shifts away from "danger" to "utility." It’s an incredibly tough problem to solve.
The reverse-engineered system prompt is the real goldmine here. Seeing Anthropic’s behavioral scaffolding exposed at that scale (~120k characters) gives us a rare, unvarnished look at how they approach frontier alignment. Thanks for compiling the timeline so clearly!
Thanks for reading, Vicky! I agree, looking at that 120k-character scaffolding is like looking at the blueprints of the safety engine. It reveals exactly what they are afraid the model will do, which ironically gives attackers a roadmap of what to target.
Honestly, security leaks aside, that SWE-Bench Pro score of 80.3% is absolute madness. An 11-point jump over Opus 4.8 means this thing is a monster for long-horizon agents. I'm spinning up an API key today.
It completely shifts the goalposts for AI agents. Bypassing the 80% mark on SWE-Bench Pro means we are moving from "helpful coding assistant" to "autonomous team member." Good luck with the API key—I'd love to hear how it handles your workflows!
"Mythos 5 is the full engine. Fable 5 is the same engine with a governor installed." — This is the best analogy I've read for this model class. Great writeup, Syed.
Appreciate it, Sagar! Glad that analogy resonated. It really feels like trying to drive a sports car with a speed limiter attached—the raw horsepower is always trying to break through.
Incredible breakdown of the architecture. Seeing how they structured the Mythos class data gives a ton of insight into how the game engine handles character scaling behind the scenes.
Thanks, Tahir! Glad you enjoyed the breakdown. Digging into how the engine scales under the hood really pulls back the curtain on how they're managing these massive model architectures.
This is a massive security oversight for a studio this size. Leaving raw class endpoints exposed like that is basically an open invitation for reverse engineering.
Agreed, Tahir. For a top-tier lab, leaving raw endpoints vulnerable to reverse engineering is a surprisingly basic oversight. It shows how fast these teams are moving to deploy, sometimes at the expense of standard security hygiene.
Pliny strikes again. Honestly, Anthropic claiming "no universal jailbreaks found" after 1,000 hours of red-teaming felt like a direct dare to the alignment community.
It definitely read like an open invitation! The security community loves nothing more than being told something is un-breakable. 1,000 hours of internal testing just can't compete with the collective creativity of the internet.
Anyone have a working mirror to the GitHub link before it gets DMCA'd? I'm deeply curious to study how they structured their internal cybersecurity refusal logic.
They are playing whack-a-mole with the mirrors right now, but a few forks are still floating around on decentralized repos. The refusal logic structure is absolutely worth a study if you can get your hands on it—it’s incredibly intricate.
GPT-5.5 lagging at 58.6% on SWE-Bench compared to Fable's 80% shows that Anthropic's focus on algorithmic agentic reasoning is pulling ahead of OpenAI's raw scaling laws.
It looks like raw parameter scaling is hitting a point of diminishing returns for pure logic tasks, whereas Anthropic’s heavy focus on algorithmic routing and agentic reasoning loops is yielding massive dividends.
I wonder how much latency that silent rerouting to Opus 4.8 adds to user queries. If a user triggers a safety check, do they pay Fable prices for Opus speeds?
That’s the million-dollar question. If you’re triggering the safety layer, you're likely paying Fable premium rates for what ends up being slower, older-generation compute. It’s a bit of a raw deal for the end user!