Steering Vectors: Changing What an LLM Wants Without Touching Its Weights

#ai #machinelearning #deeplearning #python

LLMs encode concepts as geometric directions in activation space. You can find those directions, add them at inference time, and shift model behavior - without touching a single weight.

This is called steering vectors, and it works.

The core idea

A language model's residual stream is a high-dimensional vector that accumulates information as it passes through layers. The linear representation hypothesis says that concepts like "pessimism," "formality," or "Python expertise" correspond to specific directions in this space.

If that's true, you should be able to:

Find the direction that encodes a concept (using contrastive examples)
Add a scaled version of that direction at a specific layer
Observe the model behaving more "conceptfully"

And you can.

Extracting a steering vector

The simplest method: take sentence pairs that differ only in the target concept, run them through the model, and average the difference in activations at a chosen layer.

import torch
from transformer_lens import HookedTransformer

def extract_steering_vector(model, positive_prompts, negative_prompts, layer=16):
    """Extract direction that points from negative to positive concept."""
    pos_acts, neg_acts = [], []

    with torch.no_grad():
        for prompt in positive_prompts:
            _, cache = model.run_with_cache(prompt)
            pos_acts.append(cache[f"blocks.{layer}.hook_resid_post"][0, -1])

        for prompt in negative_prompts:
            _, cache = model.run_with_cache(prompt)
            neg_acts.append(cache[f"blocks.{layer}.hook_resid_post"][0, -1])

    vector = torch.stack(pos_acts).mean(0) - torch.stack(neg_acts).mean(0)
    return vector / vector.norm()

Applying it at inference

Once you have the vector, inject it via a hook:

def make_hook(steering_vector, alpha=20.0):
    def hook_fn(value, hook):
        return value + alpha * steering_vector
    return hook_fn

def generate_steered(model, prompt, steering_vector, layer=16, alpha=20.0):
    hook = make_hook(steering_vector, alpha)
    hook_name = f"blocks.{layer}.hook_resid_post"
    with model.hooks(fwd_hooks=[(hook_name, hook)]):
        tokens = model.generate(prompt, max_new_tokens=100)
    return model.to_string(tokens[0])

What you can steer

The same technique works for a surprising range of properties:

Pessimism/optimism - shifts narrative tone measurably
Python enthusiasm - makes the model reach for Python examples
Formality - shifts register from casual to professional
Sycophancy - can be used to reduce agreement-seeking behavior

The key finding from Turner et al. (2023) and Zou et al. (2023): these vectors generalize. A pessimism vector extracted from weather sentences transfers to unrelated topics.

Checking if a concept is linearly encoded

Before steering, you can verify the concept is actually linearly represented using a probe:

from sklearn.linear_model import LogisticRegression

def probe_concept(model, positive_prompts, negative_prompts, layer=16):
    X, y = [], []
    with torch.no_grad():
        for prompt in positive_prompts:
            _, cache = model.run_with_cache(prompt)
            X.append(cache[f"blocks.{layer}.hook_resid_post"][0, -1].numpy())
            y.append(1)
        for prompt in negative_prompts:
            _, cache = model.run_with_cache(prompt)
            X.append(cache[f"blocks.{layer}.hook_resid_post"][0, -1].numpy())
            y.append(0)

    probe = LogisticRegression().fit(X, y)
    return probe.score(X, y)  # accuracy > 0.9 = concept is linearly encoded