I ran an fMRI on LLMs: a concept is a direction, not a region

#machinelearning #ai #neuroscience #research

TL;DR

I've been running an "fMRI for LLMs" — capturing the full internal activations of dense open models (Qwen2.5-7B, Gemma-2-9B, Gemma-4-12B) and applying neuroscience methods to map how meaning is organized. The headline result, confirmed causally and across all three models: a concept is not stored in a region of neurons — it is a single direction in activation space.

1. Meaning lives in a direction, not a region

In the brain, categories live in localized regions (faces → fusiform face area). LLMs are the opposite.

Distributed, superposed code. A 10-way category linear probe decodes far above chance (Gemma-2 0.97, Qwen 0.80), yet the "most selective" units do not replicate across two random halves of the stimuli (overlap ≈ 0.00–0.05). There is no findable "animal neuron."
Causal proof. Ablating the 20 most selective units changes downstream category accuracy by ~0 (same as removing 20 random units). But ablating one distributed direction collapses it — mean ΔAUC up to +0.52 (Qwen). True in all 3 models.

So category is localized to one direction but that direction is spread across ~2000 of 3584 neurons, and which neurons is non-reproducible. Localization is in vector space, not anatomy.

2. The mechanism, nailed by intervention

The residual stream is a shared additive bus. Injecting a concept direction at N consecutive layers equals injecting N× the magnitude at one layer — ratio = 1.00 for every N. The stream literally sums contributions across layers.
Only relative magnitude codes. Scaling the whole residual 0.25×–4× → zero output change (RMSNorm divides it out). Scaling only the component along the concept direction → a clean monotonic concept shift. Meaning = the projection along a direction, not the vector's length.

3. How much of the network is one concept? (depth study)

Under strict controls (120 stimuli/category, an architecture-matched untrained twin, word-grouped splits so no frame leaks across train/test):

A concept is essentially rank-1 — one direction, present at every depth (decodable layer-span: trained 1.0 vs untrained 0.0). Narrow in width, broad in depth.
Concepts coexist additively. One shared probe reads each category as well as a dedicated probe (retention 1.00) — they're linearly superposed and read in parallel.
Direction is the whole code. A nonlinear MLP probe fails to beat a single linear direction (gap ≤ 0 in all models), even with 1200 stimuli. "Meaning = direction" isn't an approximation; it's the code.

4. Where LLMs match the brain — and where they don't

Property	Brain	Dense LLM	Verdict
Small-worldness / rich-club hubs	yes	yes (σ up to 12.8)	match
Network modularity Q	0.30–0.50	0.09–0.23, rising each generation	partial
Category-selective regions	yes (FFA/PPA)	no (distributed direction)	differ
Topographic maps (retinotopy etc.)	yes	no (~20–40× below cortex)	differ
Cross-model universality (CKA)	—	0.69–0.77, cross-family	Platonic convergence

Two bonus results worth flagging:

Steerability is predicted by encoding dimensionality (r ≈ −0.83): concepts packed into ~1 direction (numbers, colors) steer cleanly; high-dimensional concepts resist.
A wiring-cost penalty makes a small transformer more modular (ΔQ > 0 in 4/4 seeds, with a non-monotonic sweet spot) — direct evidence that the brain's modularity is partly a consequence of physical embedding constraints that transformers normally lack.

Honest nulls

The harness has an adversarial verification gate, and several appealing hypotheses died in it: "abstraction velocity predicts capability" was rejected on a clean 5-point Qwen ladder; the flashy "60× more localized in SAE features" shrank to a modest 2.4× under a gold-standard pretrained Gemma Scope SAE; cross-model feature-level universality is only partial. Reported as nulls, not spun.

Method: dense models scanned on Apple Silicon (MPS), neuroscience-style analysis pipeline (linear probes, RSA/CKA, functional connectome graphs, causal patching, SAEs, steering). Every number is traceable to a data file. Feedback welcome.