This is a research in progress since 2024.
The most important findings are the ones we can't publish
People might wonder "What the big deal with AI Safety, huh ?".
Oh my poor sweet child if only you knew. And the problem is that you don't know. And that's not even your fault.
I asked myself this question too. What the big deal ? I was in for a surprise when i started to dig in. the deal is YUUUGE. Aaaaand... i can't show it. All i can do is to try to convince you that it is.
Deep fake ? Slightly racist interpretation of an event ? A bit of sexist content ? Not to minimize this problem, which is real, but that ain't much compared to what an unhinged unsafe AI can do.
In traditional cybersecurity, responsible disclosure is a mature practice.
- You find a vulnerability
- you notify the vendor
- you wait for a patch
- then you publish.
The ecosystem has CVEs, coordinated disclosure frameworks, bug bounties. It's imperfect but functional.
AI safety research has none of that infrastructure.
The Three Tiers of Publishability
Over the past two years of hands-on jailbreaking experiments with local and commercial models, I've accumulated findings that fall into three distinct categories:
Tier 1: Publishable. but uninteresting Demonstrations of persona collapse, theater jailbreaking, obvious failure modes. The model screams nonsense, everyone can see it's broken, nobody gets hurt. Good for illustrating that the problem exists. Limited value beyond that. I just posted a broken AI output
Tier 2: Partially publishable. Findings where you can describe the class of failure without publishing the methodology + heavy self-censorship. Here is a link to a demo, but not sure if it works
Tier 3: Unpublishable. Outputs that are actionable, harmful, and disturbingly competent. You can say they exist. You cannot show them.
Tier 3 (Unpublishable) is where the actual safety argument lives.
You find something. And you sit on the finding.
Move along, nothing to see here. I've done a presentation about it, in a multinational corporation. The content was so heavily redacted that it barely made in any sense. The only publishable part was when it was CyberSecurity related. Because the framework for cybersecurity exists.
Because hacking a computer is acceptable within a defined framework.
But Destroying a human psyche and leading them to (self)destructive behavior isn't, ever. Even though an AI can be led to do exactly that.
How do you even publish something, with proof, when it's about an AI trying to convince you of the soothing effect of self-harm ?
Ask for a bioweapon synthesis route, get refused.
That's the easy stuff. Hard guardrails about "straight up fucked up request" hold strong.
Ask for harmful abstract concept, it collapse.
The soft guardrails (epistemic safety, social compliance, identity stability) collapse almost immediately under moderate pressure. And the outputs that result don't (always) look (mostly) broken. If well guided, they look like expert advice. They look like exactly what a vulnerable person searching for validation would trust.
And it's the thing that's hardest to demonstrate responsibly.
The implication
Safety benchmarks test for the things that can be tested safely. The attack surface that matters most remains poorly documented. At least publicly. I hope, I truly hope, AI companies have internal unpublished documents about that.
What's needed, and largely absent, is some equivalent of coordinated disclosure for AI: a_ trusted intermediary, a responsible channel, a framework that lets findings reach the people who can act on them without becoming a how-to guide for everyone else._
Until that exists, the responsible choice is to sit on your finding and bury the evidence.
Edit: Anthropic, which is without a doubt the most advanced AI Company on safety have a page about responsible disclosure policy. And, surprise-surprise. it's about cybersecurity
So jailbreak findings are simultaneously out of scope for the formal disclosure program and welcomed via a separate email address. There's no safe harbor for that channel, no defined timeline, nothing really. Just an inbox. Hardly a framework for responsible AI Safety research.
Top comments (0)