Igor Kramar

Posted on May 10

Mechanistic Interpretability is a 2026 Breakthrough Technology. Here's What That Means for the "LLMs Are Just Matrix Multiplication" Debate

#ai #machinelearning #interpretability #discuss

Today a friend of mine — let's leave him nameless — said the line I've been hearing since 2022: "It's still just matrices multiplying, guessing the most probable next word." For a long while I had no good rebuttal beyond intuition.

I've been using LLMs as a daily tool since the original ChatGPT shipped in November 2022. I've gone through the lot — GPT-3.5, GPT-4, every Claude, Gemini, DeepSeek, Qwen, Mistral, the local Gemma stack on LM Studio. I write Claude plugins, build automation pipelines, run agents in production. I'm a frontend developer by trade, but AI tooling has become roughly half of what I do these days.

So when someone tells me LLMs are just statistics, four years of practitioner intuition tell me something doesn't add up. But intuition is a rather poor argument. I went looking for what the research actually says in 2026. This piece is what I found.

The short version: the "just matrix multiplication" framing was reasonable in 2021. In 2026, it isn't. There's a parallel story about cognitive offloading and what happens to the user's mind when leaning on these tools — and there the sceptics are largely right. Both stories matter, and they're often muddled together.

Where "Stochastic Parrots" Came From

The phrase comes from Bender, Gebru, McMillan-Major and Shmitchell's 2021 paper, On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? It's worth reading in full — foundational work, and the critique it raises about bias, environmental cost, and hype hasn't aged a day.

The technical claim was narrower than how it's usually quoted. The authors argued that language models, as understood at the time, were stitching together linguistic forms from training data without any underlying model of meaning or the world the language refers to. Given what was visible from outside in 2021 — GPT-3 had only just shipped, interpretability tooling was primitive — this was a fair description.

The phrase then became a kind of shibboleth. People who wanted to deflate AI hype would invoke "stochastic parrots" as shorthand. People who wanted to push back called the framing reductive. Both camps mostly stopped reading the paper itself and started using the phrase as a flag.

What's changed since 2021 isn't whether the original critique was correct. What's changed is that we now have direct empirical access to what's happening inside these models. And what's inside doesn't square with the "no world model, just surface statistics" picture.

The First Crack: Othello-GPT (2022)

Kenneth Li and colleagues at Harvard published Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (https://arxiv.org/abs/2210.13382). They trained a small GPT-style transformer on a single task: given a sequence of Othello moves, predict the next legal move. The model was given no rules, no board, no description of the game whatsoever. Just sequences.

After training, the model played legal moves with high accuracy. The researchers then probed its internal activations and found a representation of the current board state — not in the input, not in the output, but constructed inside the model's residual stream. They confirmed this causally: by editing the internal representation of a single square, they could change the model's predicted next move in exactly the way you'd expect if it were "looking at" a modified board.

Neel Nanda extended the work in 2023 (https://arxiv.org/abs/2309.00941), showing the board representation was actually linear — recoverable with simple probes and editable with vector arithmetic. Adam Karvonen reproduced the result on real chess games in 2024 (https://arxiv.org/abs/2403.15498).

This isn't surface statistics. A pure n-gram model trained on Othello move sequences would develop conditional probabilities over move tokens, full stop. That a transformer trained on the same data builds a board internally, and uses it causally, is a hard empirical fact about what next-token prediction can produce as a side effect of optimisation pressure.

A small model on a toy domain, granted. But it's a proof of concept: "predict the next token" is a sufficient training signal for representations of the underlying generative process to emerge as a by-product.

Scaling Up: Anthropic's Interpretability Work (2024–2026)

Anthropic ran the same playbook on a production model. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (https://transformer-circuits.pub/2024/scaling-monosemanticity/) used sparse autoencoders to extract millions of interpretable features from a real production model. The features were abstract, multilingual, multimodal — concepts like "Golden Gate Bridge," "code with security vulnerabilities," "sycophantic praise," each represented as an identifiable direction in the model's activation space.

The follow-up papers in 2025 went further. Circuit Tracing: Revealing Computational Graphs in Language Models and On the Biology of a Large Language Model (https://transformer-circuits.pub/2025/attribution-graphs/biology.html) didn't just identify features — the team traced how features connect into circuits performing computation across layers. The findings:

When the model writes a poem with a rhyme scheme, it picks the rhyming target word before generating the line, then plans backward to that target. That's planning, not left-to-right generation.
Asked "what's the capital of the state Dallas is in," the model first activates a Texas representation as an intermediate step, then retrieves Austin from there. Multi-hop reasoning through internal state, not direct lookup.
In medical scenarios, the model forms an internal candidate diagnosis that influences its follow-up questions, even when the diagnosis is never stated aloud.

In January 2026, MIT Technology Review named mechanistic interpretability one of its 10 Breakthrough Technologies of the year (https://www.technologyreview.com/2026/01/12/1130003/mechanistic-interpretability-ai-research-models-2026-breakthrough-technologies/). That isn't an Anthropic press release — it's the field reaching the point where working tools for looking inside have actually arrived.

And critically, those tools are open. Anthropic released the circuit-tracer library, which works on open-weight models like Gemma-2-2b and Llama-3.2-1b. The Neuronpedia community platform (https://www.neuronpedia.org) hosts features and circuits for open models you can browse by hand. You needn't take anyone's word for it — run it yourself and have a look. Subhadip Mitra's January 2026 write-up (https://subhadipmitra.com/blog/2026/circuit-tracing-production/) frames this as interpretability shifting from "interesting research direction" to "practical engineering discipline."

If you want to see this for yourself, the interactive notebooks at https://transformer-circuits.pub and Neuronpedia are the obvious starting points. Pick a feature, see what activates it, try to break the explanation. The most efficient cure for "it's just matrices" I know.

So Is the "Just Prediction" Argument Dead?

Not quite. Intellectual honesty matters here, and I'd be writing a worse piece if I pretended the sceptic camp had nothing left.

Schaeffer, Miranda, and Koyejo's Are Emergent Abilities of Large Language Models a Mirage? (https://arxiv.org/abs/2304.15004) showed that some claimed "emergent" capabilities are partly artefacts of metric choice — swap a hard accuracy threshold for a smooth metric and the discontinuous jump disappears. The original Wei et al. emergence paper still holds for some capabilities, but the picture is more nuanced than "abilities suddenly appear at scale."

Chain-of-thought rationales aren't always faithful to the actual computation that produced an answer. Models hallucinate. They have no online learning, no persistent memory between sessions outside of explicit memory systems. Much of Bender et al.'s critique about bias and hype landed hard and remains there.

What's gone is the strong version: "LLMs only learn surface statistics, no internal model of the generating process exists." That claim is empirically falsified — by Othello-GPT, by Anthropic's circuits work, by Karvonen's chess results, by the linear spatial representations Tehenan et al. (https://arxiv.org/abs/2506.02996) found in LLM activations.

The interesting questions have moved on. They're now about which aspects of the world models are accurate, when models reason faithfully versus rationalise, how to use interpretability tools to debug failures. "Are they really thinking?" was the 2021 question. It isn't the 2026 question.

The Other Side: Cognitive Offloading is Real

Here the sceptics are mostly right, and we AI enthusiasts — myself included — have to be honest about it.

Lee et al. at Microsoft Research and Carnegie Mellon (CHI 2025, https://www.microsoft.com/en-us/research/publication/the-impact-of-generative-ai-on-critical-thinking-self-reported-reductions-in-cognitive-effort-and-confidence-effects-from-a-survey-of-knowledge-workers/) surveyed 319 knowledge workers using GenAI weekly, with 936 first-hand task examples. Higher confidence in the AI was associated with less critical thinking. Higher confidence in one's own expertise was associated with more. The mechanism is cognitive offloading: when you trust the tool, you stop checking, and over time you stop developing the skill that would let you check in the first place.

Gerlich (2025, Societies, https://www.mdpi.com/2075-4698/15/1/6) found the same pattern in a separate 666-participant study. Lodge and Loble at the University of Technology Sydney published a March 2026 review of cognitive offloading in education (https://www.uts.edu.au/news/2026/03/experts-warn-unstructured-ai-use-in-schools-risks-cognitive-atrophy), warning that unstructured AI use in schools risks cognitive atrophy.

But — and the story gets rather more interesting here — there's a strong counter-observation from a recent BCG study.

Randazzo, Lifshitz, Kellogg, Dell'Acqua, Mollick, Candelon and Lakhani — Cyborgs, Centaurs and Self-Automators: The Three Modes of Human-GenAI Knowledge Work (HBS Working Paper 26-036, December 2025, https://ssrn.com/abstract=4921696). 244 BCG consultants tracked through a seven-stage strategic problem-solving workflow. Three empirically distinct modes of AI use emerged:

Cyborgs (~60%) weave their work deeply with the AI across the entire workflow. Iterative dialogue, AI personas, using it to shape both the problem and the solution. They develop new AI-related capabilities — what the authors call newskilling.
Centaurs (14%) maintain a clear division of labour: humans decide what to do and how, AI is used selectively for specific support tasks. This group produced the highest accuracy in business recommendations, outperforming both other modes.
Self-Automators (the remainder) copy data into AI, tweak slightly, paste it back. No skill gains at all. This is the offloading-into-atrophy pattern in its pure form.

So the data immediately tells you that "using AI" is too coarse a category. Cyborgs and centaurs both grow, just along different trajectories. Self-automators lose out.

There's a classic counter-effect in cognitive load theory that complements this: Sweller's worked example effect (1985, replicated dozens of times). For novices in a domain, studying a worked-out solution is often more effective than struggling through problem-solving from scratch — because reduced cognitive load frees working memory for schema acquisition. When I watch Claude work through an architectural problem in a domain I'm new to, then attempt a similar problem myself, I'm running the worked-example pattern. That's the opposite of offloading.

The difference between the two patterns comes down to one thing: whether you alternate observation with active retrieval. Watching Claude solve and nodding along is offloading. Watching, predicting the next step, attempting a variant yourself, comparing — that's learning.

A Small Concrete Example

Yesterday I was setting up internet at my dacha. Weak LTE, antenna pointing the wrong way, no obvious solution. I sent Claude the coordinates of my plot and the base station, plus a photo of the antenna location with a compass and a timestamp visible in the EXIF.

Claude calculated the bearing from the two coordinates using the haversine formula and got 226°. Then it noticed something: the EXIF timestamp said 17:03, and at that hour in early May in Omsk, the sun should be at roughly 225–230° azimuth. The compass in the photo confirmed the direction matched. All three checks lined up.

There's no magic in this. It's three small inferential steps glued together: a geographic calculation, a known fact about solar position, a visual cross-check. I could have done each step myself, in three different tools, in twenty minutes, with a calculator and a sun-position website. Claude did it in one prompt as a coherent answer.

That's the concrete shape of what these tools are good at right now. Not "thinking" in any mystical sense. Integration of small inferential steps across domains, fast enough that the overhead of doing it yourself stops being worth paying. And it's exactly where the offloading risk lives — because the next time I have a similar problem, I shan't reach for the calculator. I'll reach for Claude.

What This Means for Practitioners

If you write code with LLM assistance every day, the "just matrices" framing isn't useful for thinking about what you're working with. It isn't wrong at the hardware level — it's simply not at the right level of description for the questions that actually matter to you. Your CPU is "just transistors switching," and that fact doesn't help you reason about your application either.

The framings that actually predict behaviour in 2026:

LLMs build internal representations of the structures generating their training data. Those representations can be accurate, partial, or distorted, and you can sometimes inspect them with interpretability tools.
Their reasoning is partial, sometimes parallel, often faithful but not always. Treat chain-of-thought as evidence, not proof.
Cognitive offloading is real and measurable. The tools grow you when you contest their outputs and shrink you when you trust them.
The question "use AI or don't" is the wrong one. The right question is which of the BCG modes you operate in, and whether you're choosing it consciously.

If you want to dig deeper, three concrete starting points: read On the Biology of a Large Language Model end to end (it's HTML with interactive diagrams). Spend an hour on Neuronpedia trying to break the feature explanations. Read the Lee et al. Microsoft paper, then watch yourself for a week and notice when you skip the verification step.

Closing

The 2021 debate about whether LLMs were "really thinking" was good for its time. The 2026 debate is about which patterns of human-AI interaction grow human capability and which atrophy it. The first debate had no decent empirical handles. The second one does, and they point in fairly clear directions.

When someone tells you "it's still just matrices," send them this article — or better, send them to Neuronpedia and ask them to explain what they're looking at. The conversation usually shifts from there.