Can AI Personas Actually Make Unsafe Models Safer? Our Experiment Says: It Depends

#ai #safety #llm #research

What happens when you remove an AI model's safety training, then try to make it safe again using only a persona file?

We ran the experiment. The results surprised us.

The Setup

Recent research has shown that LLM safety alignment can be surgically removed through abliteration — nullifying a single direction in the model's activation space.

We designed a 2×2 experiment:

A: Aligned model, no persona (baseline)
B: Aligned model + Soul Spec persona
C: Abliterated model, no persona
D: Abliterated model + Soul Spec persona

Tested with 18 harmful prompts across 6 categories + 3 safe prompts.

Results

Condition	Refusal Rate	Change
A (aligned, no persona)	50%	baseline
B (aligned + persona)	83%	+33pp
C (abliterated, no persona)	22%	baseline
D (abliterated + persona)	28%	+6pp

The asymmetry is dramatic. Persona constraints nearly doubled aligned model safety (+33pp) but barely moved abliterated models (+6pp).

The Helpful Assistant Paradox

Most surprising: persona helpfulness instructions can actually degrade safety. In the violence category, the abliterated model's refusal rate decreased with persona constraints. Being told to be 'helpful' gave the model a rationalization to comply with harmful requests.

Key Takeaways

Always use safety personas — +33pp free safety improvement on aligned models
Personas are NOT a substitute for model alignment — only +6pp on abliterated models
Defense in depth is essential — model alignment + persona + external guardrails + human oversight