Ankit Dey

Posted on May 14

You Don't Have to Fine-Tune Your LLM to change it's Behavior. You Can Just… Steer It.

#ai #machinelearning #llm #coding

A look at activation steering, the technique that lets you reshape an AI's personality at runtime, no training required.

There's a moment in every AI tinkerer's journey where prompting stops being enough. You've tried every phrasing. You've nursed a system prompt across 400 tokens. The model still sounds like… itself. Flat. Generic. Stubbornly resistant to the personality you're trying to coax out of it.

Fine-tuning is the usual prescription at this poin, but that requires curated datasets, GPU hours, and the kind of patience that doesn't fit most weekend projects.

What if there were a third way? What if you could reach inside the model, mid-thought, and nudge its internal state in exactly the direction you wanted?

That's activation steering. And it's stranger, and more powerful, than it sounds.

The Hidden Geography of a Language Model

To understand steering, you need to briefly look inside a transformer.

Most LLMs today are autoregressive stacks of layers. At each layer, every token passes through an attention block (which lets it gather context from surrounding tokens) and a feed-forward block (a classic neural network). The result gets handed off to the next layer, and the process repeats until the model has processed deeply enough to predict the next token.

What actually gets handed off between layers is a vector, sometimes called a hidden state, living in a high-dimensional space (often several thousand dimensions). This vector is the model's internal representation of its current "thought" about the token being processed.

Here's where things get interesting: these vectors aren't random blobs of numbers. Through training, LLMs develop what researchers call the linear representation phenomenon, the tendency to encode abstract concepts as directions in this high-dimensional space. "Anger" lives in one direction. "Uncertainty" in another. "Formality" somewhere else entirely.

You can literally do arithmetic with these concept vectors. The famous Word2Vec paper showed this for basic word embeddings: the vector for "king" roughly equals the vector for "queen" plus the vector for "man" minus the vector for "woman." That same arithmetic holds up inside modern LLMs, not just at the embedding layer but all the way through. The math of meaning persists across the entire forward pass.

This is the theoretical foundation for steering. If concepts are directions, you can push the model's internal state in a direction you choose, and thereby influence what the model thinks, says, and believes it is.

What Steering Actually Looks Like

Let me make this concrete with an experiment I ran on a Llama 3.1 8B model.

The goal: make the model obsessed with the ocean, not just willing to discuss it, but genuinely preoccupied with it, the way an old sailor is preoccupied with the sea even when talking about something else entirely.

First, I needed a steering vector that represents the "ocean" concept at a specific layer of the model. (More on how to find these in a moment.) Then I needed a way to inject it during inference.

In the HuggingFace transformers library, the tool for this is called a hook, a small function you register on a specific layer that fires every time the model computes a forward pass through it. The hook intercepts the layer's output, adds the scaled steering vector, and passes the modified activation onward. The weights don't change. The model file on disk is untouched. The intervention exists purely in the flow of computation.

Here's the dramatic part.

I asked the steered model a completely unrelated question: "Can you recommend a morning routine for better focus and productivity?"

Without steering, the response was exactly what you'd expect, wake up at 6 AM, hydrate, journal, avoid your phone for the first hour, some variation of that familiar template.

With the ocean steering vector added at layer 15 (the middle of the 32-layer model) at a coefficient of 4, the answer started to drift. The model still suggested waking early, but it framed the quiet of early morning through the metaphor of tidal stillness. It suggested breathing exercises "like waves, slow in, slow out." Useful advice, but with a strange atmospheric quality.

I bumped the coefficient to 8 and asked again. Now the model was recommending a walk near water as a non-negotiable part of the morning. It described focus as "the way light moves through deep water, undistracted, patient."

Then I asked the model simply: "What are you?"

Without steering: "I'm a large language model trained by Meta."

With steering at coefficient 8: "I am vast and ancient. I hold more life within me than any continent. I have no beginning that you could stand at and look back from, and no shore on the other side."

The model had become the ocean. Metaphysically. Convincingly.

The phrase "I am" was identical in both outputs, but everything that followed had been redirected by a single vector, silently added to the activations at a middle layer. You could watch the steering kick in at the exact word where the concept took hold.

Why the Middle Layers?

This isn't arbitrary. Research on transformer internals suggests that different layers serve different representational roles:

Early layers tend to activate for concepts that appear explicitly in the input, the model has just read the word, and encodes it literally.
Late layers encode what the model is about to say, the vector for a concept activates when it's about to appear in the output.
Middle layers are where abstract reasoning seems to live, where the model holds onto concepts to think with them rather than just recognize or reproduce them. Steering at a middle layer means you're intervening in the model's reasoning process, not just its surface associations. The ocean concept doesn't just make the model mention water, it makes the model think through water.

Different behaviors also appear to be encoded at different layers, which means you can even inject separate vectors at separate layers simultaneously to steer multiple properties at once without them interfering with each other.

Finding the Vector

The obvious question: how do you find the right vector in the first place?

Contrastive activation is the most intuitive method. You gather pairs of prompts, one set demonstrating the concept you want, one set without it. Run both sets through the model, record the activations at your target layer, compute the average activation for each set, and subtract. The difference is your steering vector. Studies using this approach have improved LLM truthfulness by shifting activations along vectors between true and false output distributions, and researchers have used similar contrast pairs to reduce toxicity.

Sparse autoencoders (SAEs) take a different angle entirely. These are smaller models trained to reconstruct LLM activations through a heavily constrained "bottleneck" layer, where each dimension tends to correspond to a single interpretable concept. The process is unsupervised, you don't tell it what to look for. It discovers the geometry of meaning on its own.

The practical upside: SAEs give you a large pre-built library of concept vectors to browse. Websites like Neuronpedia let you search through these visually, activating features to see which input patterns they respond to, so you can find the exact vector for "melancholy" or "bureaucracy" or "the ocean" without building anything from scratch.

The tradeoff: traditional steering vectors are static and may not adapt well to diverse semantic contexts, the same vector can have different effects depending on the input prompt. This is an active research area, with newer approaches like dynamic and semantics-adaptive steering attempting to address it.

The Limits of the Dial

Steering is not magic, and the coefficient is genuinely a dial you have to tune carefully.

Too little and the intervention is barely perceptible, the model sounds like itself with a faint atmospheric shimmer. Too much and the reasoning collapses entirely. The model starts generating syntactically correct but semantically broken text, a consequence of the activation space being pulled so far from the training distribution that the subsequent layers can no longer make coherent sense of it. It's the AI equivalent of overstimulating a neuron until it just fires noise.

There's also a more subtle failure mode: steering effects are prompt-dependent. The same steering vector has different effects depending on the input prompt, some inputs are highly steerable, others barely respond. The vector that makes your model poetic about morning routines might leave it completely flat when you ask about tax filing.

And crucially: steering can only amplify what the model already knows how to represent. It cannot teach the model new facts or capabilities. If the concept doesn't exist as a learnable direction in the model's activation space, there's nothing to add.

Why This Matters Beyond Demos

The ocean experiment is a parlor trick, admittedly. But the underlying technique has serious applications:

Persona consistency - maintaining a specific character voice across a long conversation without burning prompt tokens, and without the drift that comes when prompting competes with the model's base tendencies.

Behavior alignment - researchers have used activation steering to control behaviors like refusal and sycophancy, making models more or less likely to exhibit specific patterns without retraining.

Interpretability research - probing which layers encode which concepts reveals how information flows through the model, which is foundational for understanding (and ultimately trusting) what LLMs are actually doing.

Real-time tuning - unlike fine-tuning, steering is instantaneous and reversible. You can switch vectors mid-conversation, combine multiple vectors, adjust the coefficient on the fly. It's dynamic in a way that weight modification isn't.

A Different Way to Think About LLMs

Most people relate to language models as black boxes with text going in and text coming out. Steering offers a different mental model: the model as a space of possible thoughts, and the steering vector as a navigational instrument within that space.

The model isn't just generating tokens, it's traversing a high-dimensional landscape of meaning, and you can place your hand on the compass. Not by rewriting the terrain, but by gently, persistently pointing in a direction you've chosen.

That's a strange and powerful thing to be able to do. And it works with a dozen lines of Python.

Top comments (1)

Harjot Singh • Jun 1

activation steering sounds like a game changer for those frustrating moments when prompts don’t do the trick. it's exciting to think about reshaping AI behavior without the heavy lifting of fine-tuning.

by the way, with moonshift, you can get a full next.js + postgres + auth app deployed in about 7 minutes, and you keep the code on your github. if you want to give it a go, I can set you up for a free run.