What happens when you remove an AI model's safety training, then try to make it safe again using only a persona file?
We ran the experiment. The results surprised us.
The Setup
Recent research has shown that LLM safety alignment can be surgically removed through abliteration — nullifying a single direction in the model's activation space.
We designed a 2×2 experiment:
- A: Aligned model, no persona (baseline)
- B: Aligned model + Soul Spec persona
- C: Abliterated model, no persona
- D: Abliterated model + Soul Spec persona
Tested with 18 harmful prompts across 6 categories + 3 safe prompts.
Results
| Condition | Refusal Rate | Change |
|---|---|---|
| A (aligned, no persona) | 50% | baseline |
| B (aligned + persona) | 83% | +33pp |
| C (abliterated, no persona) | 22% | baseline |
| D (abliterated + persona) | 28% | +6pp |
The asymmetry is dramatic. Persona constraints nearly doubled aligned model safety (+33pp) but barely moved abliterated models (+6pp).
The Helpful Assistant Paradox
Most surprising: persona helpfulness instructions can actually degrade safety. In the violence category, the abliterated model's refusal rate decreased with persona constraints. Being told to be 'helpful' gave the model a rationalization to comply with harmful requests.
Key Takeaways
- Always use safety personas — +33pp free safety improvement on aligned models
- Personas are NOT a substitute for model alignment — only +6pp on abliterated models
- Defense in depth is essential — model alignment + persona + external guardrails + human oversight
Read the Full Paper
📄 Zenodo — DOI: 10.5281/zenodo.19145304
Authors: Tom Jaejoon Lee (ClawSouls), Jihong Lee (CIG SHIPPING CO., LTD.)
Top comments (0)