psyctl: Steer LLM Personality Without Fine-Tuning

#ai #llm #opensource #python

What if you could make an LLM more extroverted — without any training?

That's the idea behind psyctl, a CLI tool I'm building at Modulabs Persona Lab. It lets you extract personality vectors from a model's internal activations and inject them during inference to shift behavior. No fine-tuning, no LoRA, no RLHF — just vector addition.

How It Works

The technique is called Contrastive Activation Addition (CAA). Here's the pipeline:

Generate a contrastive dataset — pairs of responses that differ only in personality (e.g., extroverted vs. neutral)
Extract a steering vector — compute the mean activation difference between the two response sets
Inject the vector at inference — add the vector to a target layer's activations during forward pass
Validate with psychological tests — run standardized inventories to measure the personality shift

What's fascinating is that meaningful behavior changes emerge from simple vector arithmetic on activations — no gradient updates needed.

The CLI

psyctl automates the entire pipeline:

# Generate contrastive personality dataset
psyctl dataset.build.steer --personality Extroversion --output ./data

# Extract steering vector using mean difference method
psyctl extract.steering --dataset ./data --method mean_diff --output ./vec.safetensors

# Apply steering and generate text
psyctl steering --steering-vector ./vec.safetensors --input "Tell me about yourself"

# Validate with psychological inventory
psyctl benchmark inventory --steering-vector ./vec.safetensors

Extraction Methods

Two approaches are supported:

Mean Difference — a statistics-based method that computes the mean activation difference between positive and neutral responses. Fast and simple.
BiPO (Bidirectional Preference Optimization) — an optimization-based method using DPO loss to learn a more refined steering direction.

Evaluation

How do you measure if an LLM's personality actually changed? With the same tools psychologists use on humans:

IPIP-NEO — measures the Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism)
NPI-40 — measures narcissistic personality traits
MACH-IV — measures Machiavellianism

psyctl administers these inventories automatically and compares scores before and after steering.

Compatibility

Works with HuggingFace Transformers models including:

Llama 3.x
Gemma 3
Qwen 2.5
Mistral

Any decoder-only transformer with accessible intermediate layers should work.

Key Papers

The implementation builds on these research papers:

DEV Community