Over the last few years, “attention” has become one of the most overloaded words in machine learning. We often talk about attention weights as if they were explanations, even though many researchers explicitly warn against that interpretation.
I recently tried to get a more concrete understanding of attention heads by poking at small language models and reading interpretability papers more carefully. This post is not a breakthrough, and it doesn’t present new results. Instead, it’s a short reflection on what didn’t work, what surprised me, and how my mental model of attention changed in the process.
I’m writing this partly to clarify my own thinking, and partly in case it’s useful to others who are trying to move from “I know the theory” to “I understand the mechanism.”
What I initially believed
Before digging in, I implicitly believed a few things:
- If an attention head consistently attends to a specific token, that token is probably “important.”
- Looking at attention heatmaps would quickly reveal what a model is doing.
- Individual heads should correspond to relatively clean, human-interpretable functions.
None of these beliefs survived contact with even small toy models.
First surprise: attention patterns are easy to see, hard to interpret
It’s trivially easy to generate attention visualisations. Many tools make this feel like progress: you can point to a head and say “look, it’s attending to commas” or “this head likes previous nouns.”
What’s harder is answering the question: “If this head disappeared, would the model’s behaviour meaningfully change?”
Without that causal step, attention patterns felt more like descriptions than explanations. They were suggestive, but not decisive.
Second surprise: heads don’t act alone
Another naive assumption I had was that heads are mostly independent. In practice, even small models distribute functionality across multiple components:
- Several heads may partially contribute to the same behaviour
- Removing one head often degrades performance gradually rather than catastrophically
- Some heads only “matter” in combination with specific MLP layers
This made me more sympathetic to why interpretability papers emphasise circuits rather than single components. The unit of explanation is often larger than one head but smaller than the entire model.
Third surprise: failure is informative
In a few cases, I expected to find a clear pattern (for example, a head that reliably copies the next token after a repeated sequence) and… didn’t. Either the effect was weaker than expected, or it appeared inconsistently across layers.
Initially, this felt like a dead end. But reading more carefully, I realised that many published results are:
- Highly conditional on architecture
- Easier to observe at certain depths
- Sensitive to training setup and data
A “failed reproduction” wasn’t a refutation, but it was evidence about where and when a mechanism appears.
What changed in my own mental model
After this experience, I now think about attention heads differently:
- Attention weights are hypotheses, not explanations
- Causal interventions (ablation, patching) matter more than visualization
- Clean mechanisms are the exception, not the rule
- Toy models are not simplified versions of large models instead, they’re different objects that expose certain behaviours more clearly
It feels more like doing biology: messy, partial, and incremental. Most importantly, I stopped expecting interpretability to feel like reverse-engineering a clean system.
What I still don’t understand
To be explicit about the gaps:
- When does a “distributed” explanation become too diffuse to be useful?
- How stable are identified circuits across random seeds?
- Which interpretability results genuinely scale, and which are artefacts of small models?
These questions feel more important to me now than finding another pretty attention plot.
Why does this matter?
I don’t think interpretability progress comes from declaring models “understood.” It comes from slowly shrinking the gap between what we can describe and what we can causally explain.
Even small, frustrating attempts to understand a model helped me appreciate why careful, modest claims are a feature, not a weakness.
If nothing else, this experience made me more cautious about explanations I find convincing at first glance.
Closing
This post reflects a small slice of my learning process, not a polished conclusion. If you’ve had similar experiences — or think I’ve misunderstood something fundamental — I’d genuinely like to hear about it.
Understanding these systems feels hard because it is hard. That’s probably a good sign.
Top comments (0)