DEV Community

rick
rick

Posted on • Originally published at ho4040.github.io

psyctl: Steer LLM Personality Without Fine-Tuning

What if you could make an LLM more extroverted — without any training?

That's the idea behind psyctl, a CLI tool I'm building at Modulabs Persona Lab. It lets you extract personality vectors from a model's internal activations and inject them during inference to shift behavior. No fine-tuning, no LoRA, no RLHF — just vector addition.

How It Works

The technique is called Contrastive Activation Addition (CAA). Here's the pipeline:

  1. Generate a contrastive dataset — pairs of responses that differ only in personality (e.g., extroverted vs. neutral)
  2. Extract a steering vector — compute the mean activation difference between the two response sets
  3. Inject the vector at inference — add the vector to a target layer's activations during forward pass
  4. Validate with psychological tests — run standardized inventories to measure the personality shift

What's fascinating is that meaningful behavior changes emerge from simple vector arithmetic on activations — no gradient updates needed.

The CLI

psyctl automates the entire pipeline:

# Generate contrastive personality dataset
psyctl dataset.build.steer --personality Extroversion --output ./data

# Extract steering vector using mean difference method
psyctl extract.steering --dataset ./data --method mean_diff --output ./vec.safetensors

# Apply steering and generate text
psyctl steering --steering-vector ./vec.safetensors --input "Tell me about yourself"

# Validate with psychological inventory
psyctl benchmark inventory --steering-vector ./vec.safetensors
Enter fullscreen mode Exit fullscreen mode

Extraction Methods

Two approaches are supported:

  • Mean Difference — a statistics-based method that computes the mean activation difference between positive and neutral responses. Fast and simple.
  • BiPO (Bidirectional Preference Optimization) — an optimization-based method using DPO loss to learn a more refined steering direction.

Evaluation

How do you measure if an LLM's personality actually changed? With the same tools psychologists use on humans:

  • IPIP-NEO — measures the Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism)
  • NPI-40 — measures narcissistic personality traits
  • MACH-IV — measures Machiavellianism

psyctl administers these inventories automatically and compares scores before and after steering.

Compatibility

Works with HuggingFace Transformers models including:

  • Llama 3.x
  • Gemma 3
  • Qwen 2.5
  • Mistral

Any decoder-only transformer with accessible intermediate layers should work.

Key Papers

The implementation builds on these research papers:

Links


If you're interested in LLM interpretability or personality research, give psyctl a try. Contributions and feedback are welcome!

Top comments (0)