Mechanistic interpretability: what we're actually finding inside transformers

#machinelearning #ai #deeplearning #neuralnetworks

For most of deep learning's history, the prevailing position was: we can't really know what's happening inside. The network is a black box. We can measure its inputs and outputs, and we can carefully instrument it to extract intermediate activations, but there's no real systematic way to understand what the network is computing.

This position was dominant for a long time. But over the last several years, a research area has emerged that challenges this assumption: mechanistic interpretability. The field is growing rapidly, and there's a clear sense that we're actually finding substantial and interpretable structure inside neural networks.

I want to walk through what this field is actually doing, what's been found, and why it matters.

What the field is actually trying to do

The simplest way to think about mechanistic interpretability: we're trying to reverse-engineer algorithms.

A neural network computes some function. In one sense, we understand this function: we can measure it empirically. If we feed in inputs and measure outputs, we know the input-output relationship. But we want more than this. We want to understand the algorithm - the specific computational steps - that the network is using to produce those outputs. We want to locate specific sub-circuits that compute meaningful intermediate quantities. We want to decompose the network's behavior into human-interpretable pieces.

This is fundamentally different from other interpretability work. If I have a decision tree, I can print it out and read it. If I have a linear model, I can look at the weights. These approaches give you full transparency into the decision-making process. What mechanistic interpretability is trying to do is extract similar kinds of transparency from neural networks, where the underlying structure is much messier.

What's been found

Several concrete things have been discovered over the last few years:

Induction heads. Transformers have a relatively simple but specific sub-circuit called induction heads. These are attention heads that implement a fairly specific algorithm: "look for previous occurrences of the current pattern and copy what came after." This was discovered in the paper "Attention is not not Turing-Complete." Researchers were able to locate these heads, measure their behavior, and verify that they really do implement this algorithm.

Curve detectors in vision models. In vision transformers and convolutional networks, researchers have found that individual neurons and small groups of neurons reliably activate for specific visual features. Some neurons fire for curves at specific angles, others for textures, others for object parts. This has been known for a while, but recent work has been more systematic about finding and characterizing these.

Superposition. One of the most interesting recent findings is that neural networks can represent far more features than they have neurons, through a phenomenon called superposition. When features are sparse (they're rarely all active at the same time), the network can represent them in an overlapping way in a lower-dimensional space. This is a form of data compression that the network learns. The catch: it means that many individual neurons don't correspond to clean, interpretable features. Rather, each neuron is a mixture of many features. But the structure is still there, just in a more complex form.

Why superposition matters

Superposition is important because it challenges a common assumption in mechanistic interpretability: that we can understand networks by finding interpretable features at the neuron level. If superposition is ubiquitous, then neurons themselves might not be the right level of abstraction.

There's growing work on finding structure at other levels of abstraction. Some researchers are working on finding features in lower-dimensional subspaces. Others are looking at the structure of how features interact. And still others are developing new mathematical frameworks for thinking about these compressed representations.

The circuit hypothesis

One of the organizing ideas in the field is the "circuit hypothesis." The basic claim: neural networks are made up of circuits - specific sub-structures that implement particular computations. These circuits may be small (a few heads in a transformer) or larger. The hypothesis is that if we can map out these circuits, we can explain the network's behavior.

This is appealing because it suggests a roadmap: we can work bottom-up, finding small circuits, characterizing their behavior, and then composing them to understand larger behaviors. It also suggests a method: if we can find and ablate (remove or disable) circuits, we can verify that our understanding is correct. If removing a circuit causes the network to fail at a particular behavior, that's evidence that the circuit was actually responsible for that behavior.

The specimen angle

There's an interesting methodological point emerging in mechanistic interpretability: the "specimen approach." Rather than trying to build general theories that apply to all networks, some researchers are taking the approach of treating particular networks as specimens to be studied in detail. Pick a specific network. Pick a specific behavior. Spend significant effort trying to completely understand this one case. Document everything you find. Build up a detailed map of the circuits involved.

The hope is that by deeply understanding even one or two specimens, we can build intuitions that transfer to other settings. This is similar to how neuroscience progressed - by detailed study of simple organisms like C. elegans and fruit flies, we've learned principles that seem to apply more broadly.

This work is being collected and archived at https://overfits.ai, where there's a growing library of detailed circuit diagrams and mechanistic analyses.