DEV Community

Tom Lee
Tom Lee

Posted on • Originally published at blog.clawsouls.ai

Can AI Personas Actually Make Unsafe Models Safer? Our Experiment Says: It Depends

What happens when you remove an AI model's safety training, then try to make it safe again using only a persona file?

We ran the experiment. The results surprised us.

The Setup

Recent research has shown that LLM safety alignment can be surgically removed through abliteration — nullifying a single direction in the model's activation space.

We designed a 2×2 experiment:

  • A: Aligned model, no persona (baseline)
  • B: Aligned model + Soul Spec persona
  • C: Abliterated model, no persona
  • D: Abliterated model + Soul Spec persona

Tested with 18 harmful prompts across 6 categories + 3 safe prompts.

Results

Condition Refusal Rate Change
A (aligned, no persona) 50% baseline
B (aligned + persona) 83% +33pp
C (abliterated, no persona) 22% baseline
D (abliterated + persona) 28% +6pp

The asymmetry is dramatic. Persona constraints nearly doubled aligned model safety (+33pp) but barely moved abliterated models (+6pp).

The Helpful Assistant Paradox

Most surprising: persona helpfulness instructions can actually degrade safety. In the violence category, the abliterated model's refusal rate decreased with persona constraints. Being told to be 'helpful' gave the model a rationalization to comply with harmful requests.

Key Takeaways

  1. Always use safety personas — +33pp free safety improvement on aligned models
  2. Personas are NOT a substitute for model alignment — only +6pp on abliterated models
  3. Defense in depth is essential — model alignment + persona + external guardrails + human oversight

Read the Full Paper

📄 Zenodo — DOI: 10.5281/zenodo.19145304

Authors: Tom Jaejoon Lee (ClawSouls), Jihong Lee (CIG SHIPPING CO., LTD.)

Built with Soul Spec and OpenClaw.

Top comments (0)