TL;DR
I've been running an "fMRI for LLMs" — capturing the full internal activations of dense open models (Qwen2.5-7B, Gemma-2-9B, Gemma-4-12B) and applying neuroscience methods to map how meaning is organized. The headline result, confirmed causally and across all three models: a concept is not stored in a region of neurons — it is a single direction in activation space.
1. Meaning lives in a direction, not a region
In the brain, categories live in localized regions (faces → fusiform face area). LLMs are the opposite.
- Distributed, superposed code. A 10-way category linear probe decodes far above chance (Gemma-2 0.97, Qwen 0.80), yet the "most selective" units do not replicate across two random halves of the stimuli (overlap ≈ 0.00–0.05). There is no findable "animal neuron."
- Causal proof. Ablating the 20 most selective units changes downstream category accuracy by ~0 (same as removing 20 random units). But ablating one distributed direction collapses it — mean ΔAUC up to +0.52 (Qwen). True in all 3 models.
So category is localized to one direction but that direction is spread across ~2000 of 3584 neurons, and which neurons is non-reproducible. Localization is in vector space, not anatomy.
2. The mechanism, nailed by intervention
- The residual stream is a shared additive bus. Injecting a concept direction at N consecutive layers equals injecting N× the magnitude at one layer — ratio = 1.00 for every N. The stream literally sums contributions across layers.
- Only relative magnitude codes. Scaling the whole residual 0.25×–4× → zero output change (RMSNorm divides it out). Scaling only the component along the concept direction → a clean monotonic concept shift. Meaning = the projection along a direction, not the vector's length.
3. How much of the network is one concept? (depth study)
Under strict controls (120 stimuli/category, an architecture-matched untrained twin, word-grouped splits so no frame leaks across train/test):
- A concept is essentially rank-1 — one direction, present at every depth (decodable layer-span: trained 1.0 vs untrained 0.0). Narrow in width, broad in depth.
- Concepts coexist additively. One shared probe reads each category as well as a dedicated probe (retention 1.00) — they're linearly superposed and read in parallel.
- Direction is the whole code. A nonlinear MLP probe fails to beat a single linear direction (gap ≤ 0 in all models), even with 1200 stimuli. "Meaning = direction" isn't an approximation; it's the code.
4. Where LLMs match the brain — and where they don't
| Property | Brain | Dense LLM | Verdict |
|---|---|---|---|
| Small-worldness / rich-club hubs | yes | yes (σ up to 12.8) | match |
| Network modularity Q | 0.30–0.50 | 0.09–0.23, rising each generation | partial |
| Category-selective regions | yes (FFA/PPA) | no (distributed direction) | differ |
| Topographic maps (retinotopy etc.) | yes | no (~20–40× below cortex) | differ |
| Cross-model universality (CKA) | — | 0.69–0.77, cross-family | Platonic convergence |
Two bonus results worth flagging:
- Steerability is predicted by encoding dimensionality (r ≈ −0.83): concepts packed into ~1 direction (numbers, colors) steer cleanly; high-dimensional concepts resist.
- A wiring-cost penalty makes a small transformer more modular (ΔQ > 0 in 4/4 seeds, with a non-monotonic sweet spot) — direct evidence that the brain's modularity is partly a consequence of physical embedding constraints that transformers normally lack.
Honest nulls
The harness has an adversarial verification gate, and several appealing hypotheses died in it: "abstraction velocity predicts capability" was rejected on a clean 5-point Qwen ladder; the flashy "60× more localized in SAE features" shrank to a modest 2.4× under a gold-standard pretrained Gemma Scope SAE; cross-model feature-level universality is only partial. Reported as nulls, not spun.
Method: dense models scanned on Apple Silicon (MPS), neuroscience-style analysis pipeline (linear probes, RSA/CKA, functional connectome graphs, causal patching, SAEs, steering). Every number is traceable to a data file. Feedback welcome.
Top comments (0)