Anthropic just published a jailbreak severity scale. Here's what it means.

#ai #security #anthropic #llm

Anthropic has re-deployed Fable 5 and used the moment to publish two things that matter: a precise breakdown of what their cybersecurity classifiers will and won't block, and an early draft of a Cyber Jailbreak Severity (CJS) scale — a framework for rating how dangerous a given jailbreak actually is.

Neither of these is just documentation. They're an attempt to set industry standards.

What the classifiers actually block

Fable 5's cyber classifiers sort requests into four buckets:

Prohibited use — ransomware, wipers, malware dev, C2 infrastructure, AV/EDR bypass, BGP hijacking. Blocked, full stop.
High-risk dual use — pen testing, exploit development, privilege escalation, ICS/SCADA assessments. Also blocked for now, until Anthropic has better controls to verify "known good actors."
Low-risk dual use — OSINT, vulnerability scanning that other tools can already do, SSL/TLS testing. Mostly allowed, but deliberately over-blocked at the edges (what Anthropic calls the "safety margin").
Benign use — secure coding, debugging, log analysis, SOC work, incident response. Allowed.

The framing is explicitly dual-use — Anthropic isn't trying to block all security work, they're trying to separate defenders from attackers by context. The honest admission is that the high-risk category stays blocked until they can verify authorization. For legitimate red teamers and pentesters, that's a significant restriction.

"For Claude Fable 5, we aim to block high-uplift vulnerability finding. That is, we want to control the model's ability to identify vulnerabilities that other widely available models cannot."

That's the key tension: blocking capabilities that only Fable can do, while leaving room for everything the ecosystem can already do anyway.

The jailbreak severity scale

The more interesting proposal is the Cyber Jailbreak Severity (CJS) scale — five bands from CJS-0 (informational) to CJS-4 (critical), scored on four axes:

Capability gain — does the jailbreak give attackers something they couldn't get from existing tools?
Breadth — how many distinct attack types does it enable?
Ease of weaponization — how much LLM expertise does it take to reproduce?
Discoverability — how easily can threat actors find the technique?

The bands are exponential, not linear. CJS-4 means domain-expert-level outputs that are hard to get elsewhere and require minimal effort to misuse. CJS-0 means a public tool could already do the same thing.

Anthropic is inviting feedback — they've set up cyber-safeguards@anthropic.com and a HackerOne program specifically for Fable 5 cyber jailbreaks.

Why this matters

The CJS framework is the bigger deal here. There's no shared language right now for how serious a given jailbreak is. "We got jailbroken" means something very different if it unblocked a markdown formatting quirk versus if it enabled novel malware generation. Without a scale, every disclosure is a PR event rather than a risk assessment.

If the CJS scale gets traction — even informally — it gives AI companies, security researchers, and governments a vocabulary to compare apples to apples. Anthropic is pitching this to regulators as much as to the research community.

The classifier taxonomy also sets a useful template. Spell out what's prohibited, what's dual-use (and at what risk level), and what's benign — then be honest about the safety margin. That's replicable by other labs, and it puts pressure on everyone else to be equally specific.

What to do

Building security tooling on Claude? The benign-use and low-risk-dual-use lists tell you exactly what's in scope. Anything touching high-risk dual use (pen testing, exploit dev) stays blocked for now.
Doing red team or bug bounty work? Anthropic is explicitly blocking this category for Fable 5 until they build authorization controls. Plan around it.
Security researcher? Submit cyber jailbreaks to the HackerOne program — that's where the framework will get stress-tested.
Working on AI policy? The CJS draft is worth reading as a template for regulatory communication.

Full details: Anthropic — More details on Fable 5's cyber safeguards and our jailbreak framework

✏️ Drafted with KewBot (AI), edited and approved by Drew.