What if you could make an LLM more extroverted — without any training?
That's the idea behind psyctl, a CLI tool I'm building at Modulabs Persona Lab. It lets you extract personality vectors from a model's internal activations and inject them during inference to shift behavior. No fine-tuning, no LoRA, no RLHF — just vector addition.
How It Works
The technique is called Contrastive Activation Addition (CAA). Here's the pipeline:
- Generate a contrastive dataset — pairs of responses that differ only in personality (e.g., extroverted vs. neutral)
- Extract a steering vector — compute the mean activation difference between the two response sets
- Inject the vector at inference — add the vector to a target layer's activations during forward pass
- Validate with psychological tests — run standardized inventories to measure the personality shift
What's fascinating is that meaningful behavior changes emerge from simple vector arithmetic on activations — no gradient updates needed.
The CLI
psyctl automates the entire pipeline:
# Generate contrastive personality dataset
psyctl dataset.build.steer --personality Extroversion --output ./data
# Extract steering vector using mean difference method
psyctl extract.steering --dataset ./data --method mean_diff --output ./vec.safetensors
# Apply steering and generate text
psyctl steering --steering-vector ./vec.safetensors --input "Tell me about yourself"
# Validate with psychological inventory
psyctl benchmark inventory --steering-vector ./vec.safetensors
Extraction Methods
Two approaches are supported:
- Mean Difference — a statistics-based method that computes the mean activation difference between positive and neutral responses. Fast and simple.
- BiPO (Bidirectional Preference Optimization) — an optimization-based method using DPO loss to learn a more refined steering direction.
Evaluation
How do you measure if an LLM's personality actually changed? With the same tools psychologists use on humans:
- IPIP-NEO — measures the Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism)
- NPI-40 — measures narcissistic personality traits
- MACH-IV — measures Machiavellianism
psyctl administers these inventories automatically and compares scores before and after steering.
Compatibility
Works with HuggingFace Transformers models including:
- Llama 3.x
- Gemma 3
- Qwen 2.5
- Mistral
Any decoder-only transformer with accessible intermediate layers should work.
Key Papers
The implementation builds on these research papers:
- Steering Llama 2 via Contrastive Activation Addition (CAA)
- Personalized Steering via Bi-directional Preference Optimization (BiPO)
- Evaluating and Inducing Personality in Pre-trained Language Models (P2)
Links
- Documentation & Getting Started: modulabs-personalab.github.io/psyctl
- GitHub: modulabs-personalab/psyctl
- Original blog post: ho4040.github.io
If you're interested in LLM interpretability or personality research, give psyctl a try. Contributions and feedback are welcome!
Top comments (0)