DEV Community

Machina Tools
Machina Tools

Posted on • Originally published at machina.chat

Steering Vectors: Changing What an LLM Wants Without Touching Its Weights

LLMs encode concepts as geometric directions in activation space. You can find those directions, add them at inference time, and shift model behavior - without touching a single weight.

This is called steering vectors, and it works.

The core idea

A language model's residual stream is a high-dimensional vector that accumulates information as it passes through layers. The linear representation hypothesis says that concepts like "pessimism," "formality," or "Python expertise" correspond to specific directions in this space.

If that's true, you should be able to:

  1. Find the direction that encodes a concept (using contrastive examples)
  2. Add a scaled version of that direction at a specific layer
  3. Observe the model behaving more "conceptfully"

And you can.

Extracting a steering vector

The simplest method: take sentence pairs that differ only in the target concept, run them through the model, and average the difference in activations at a chosen layer.

import torch
from transformer_lens import HookedTransformer

def extract_steering_vector(model, positive_prompts, negative_prompts, layer=16):
    """Extract direction that points from negative to positive concept."""
    pos_acts, neg_acts = [], []

    with torch.no_grad():
        for prompt in positive_prompts:
            _, cache = model.run_with_cache(prompt)
            pos_acts.append(cache[f"blocks.{layer}.hook_resid_post"][0, -1])

        for prompt in negative_prompts:
            _, cache = model.run_with_cache(prompt)
            neg_acts.append(cache[f"blocks.{layer}.hook_resid_post"][0, -1])

    vector = torch.stack(pos_acts).mean(0) - torch.stack(neg_acts).mean(0)
    return vector / vector.norm()
Enter fullscreen mode Exit fullscreen mode

Applying it at inference

Once you have the vector, inject it via a hook:

def make_hook(steering_vector, alpha=20.0):
    def hook_fn(value, hook):
        return value + alpha * steering_vector
    return hook_fn

def generate_steered(model, prompt, steering_vector, layer=16, alpha=20.0):
    hook = make_hook(steering_vector, alpha)
    hook_name = f"blocks.{layer}.hook_resid_post"
    with model.hooks(fwd_hooks=[(hook_name, hook)]):
        tokens = model.generate(prompt, max_new_tokens=100)
    return model.to_string(tokens[0])
Enter fullscreen mode Exit fullscreen mode

What you can steer

The same technique works for a surprising range of properties:

  • Pessimism/optimism - shifts narrative tone measurably
  • Python enthusiasm - makes the model reach for Python examples
  • Formality - shifts register from casual to professional
  • Sycophancy - can be used to reduce agreement-seeking behavior

The key finding from Turner et al. (2023) and Zou et al. (2023): these vectors generalize. A pessimism vector extracted from weather sentences transfers to unrelated topics.

Checking if a concept is linearly encoded

Before steering, you can verify the concept is actually linearly represented using a probe:

from sklearn.linear_model import LogisticRegression

def probe_concept(model, positive_prompts, negative_prompts, layer=16):
    X, y = [], []
    with torch.no_grad():
        for prompt in positive_prompts:
            _, cache = model.run_with_cache(prompt)
            X.append(cache[f"blocks.{layer}.hook_resid_post"][0, -1].numpy())
            y.append(1)
        for prompt in negative_prompts:
            _, cache = model.run_with_cache(prompt)
            X.append(cache[f"blocks.{layer}.hook_resid_post"][0, -1].numpy())
            y.append(0)

    probe = LogisticRegression().fit(X, y)
    return probe.score(X, y)  # accuracy > 0.9 = concept is linearly encoded
Enter fullscreen mode Exit fullscreen mode

High probe accuracy means the concept is cleanly linearly separable in activation space - and steering should work well.


Full writeup with more experiments and analysis on the Machina blog: Steering Vectors: Changing What an LLM Wants Without Touching Its Weights

Originally published at machina.chat

Top comments (0)