<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: jskim</title>
    <description>The latest articles on DEV Community by jskim (@jskim_7a1f310dceb06a5ebb1).</description>
    <link>https://dev.to/jskim_7a1f310dceb06a5ebb1</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3972351%2F2a59aba5-7959-4502-8516-d5b873bbaf34.jpg</url>
      <title>DEV Community: jskim</title>
      <link>https://dev.to/jskim_7a1f310dceb06a5ebb1</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jskim_7a1f310dceb06a5ebb1"/>
    <language>en</language>
    <item>
      <title>I ran an fMRI on LLMs: a concept is a direction, not a region</title>
      <dc:creator>jskim</dc:creator>
      <pubDate>Sun, 07 Jun 2026 10:20:57 +0000</pubDate>
      <link>https://dev.to/jskim_7a1f310dceb06a5ebb1/i-ran-an-fmri-on-llms-a-concept-is-a-direction-not-a-region-428b</link>
      <guid>https://dev.to/jskim_7a1f310dceb06a5ebb1/i-ran-an-fmri-on-llms-a-concept-is-a-direction-not-a-region-428b</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I've been running an "fMRI for LLMs" — capturing the full internal activations of dense open models (Qwen2.5-7B, Gemma-2-9B, Gemma-4-12B) and applying neuroscience methods to map how meaning is organized. The headline result, confirmed causally and across all three models: &lt;strong&gt;a concept is not stored in a region of neurons — it is a single direction in activation space.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Meaning lives in a &lt;em&gt;direction&lt;/em&gt;, not a &lt;em&gt;region&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;In the brain, categories live in localized regions (faces → fusiform face area). LLMs are the opposite.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Distributed, superposed code.&lt;/strong&gt; A 10-way category linear probe decodes far above chance (Gemma-2 &lt;strong&gt;0.97&lt;/strong&gt;, Qwen &lt;strong&gt;0.80&lt;/strong&gt;), yet the "most selective" units do &lt;strong&gt;not&lt;/strong&gt; replicate across two random halves of the stimuli (overlap ≈ 0.00–0.05). There is no findable "animal neuron."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Causal proof.&lt;/strong&gt; Ablating the 20 &lt;em&gt;most selective&lt;/em&gt; units changes downstream category accuracy by &lt;strong&gt;~0&lt;/strong&gt; (same as removing 20 random units). But ablating &lt;strong&gt;one distributed direction&lt;/strong&gt; collapses it — mean ΔAUC up to &lt;strong&gt;+0.52&lt;/strong&gt; (Qwen). True in all 3 models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So category is &lt;strong&gt;localized to one direction&lt;/strong&gt; but that direction is &lt;strong&gt;spread across ~2000 of 3584 neurons&lt;/strong&gt;, and &lt;em&gt;which&lt;/em&gt; neurons is non-reproducible. Localization is in vector space, not anatomy.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The mechanism, nailed by intervention
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The residual stream is a shared additive bus.&lt;/strong&gt; Injecting a concept direction at N consecutive layers equals injecting N× the magnitude at one layer — ratio = &lt;strong&gt;1.00 for every N&lt;/strong&gt;. The stream literally sums contributions across layers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only &lt;em&gt;relative&lt;/em&gt; magnitude codes.&lt;/strong&gt; Scaling the whole residual 0.25×–4× → &lt;strong&gt;zero&lt;/strong&gt; output change (RMSNorm divides it out). Scaling only the component &lt;em&gt;along the concept direction&lt;/em&gt; → a clean monotonic concept shift. Meaning = the projection along a direction, not the vector's length.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. How much of the network is one concept? (depth study)
&lt;/h2&gt;

&lt;p&gt;Under strict controls (120 stimuli/category, an architecture-matched &lt;strong&gt;untrained twin&lt;/strong&gt;, word-grouped splits so no frame leaks across train/test):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A concept is essentially rank-1&lt;/strong&gt; — one direction, present at &lt;strong&gt;every depth&lt;/strong&gt; (decodable layer-span: trained 1.0 vs untrained 0.0). Narrow in width, broad in depth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Concepts coexist additively.&lt;/strong&gt; One shared probe reads each category as well as a dedicated probe (retention &lt;strong&gt;1.00&lt;/strong&gt;) — they're linearly superposed and read in parallel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Direction is the whole code.&lt;/strong&gt; A nonlinear MLP probe fails to beat a single linear direction (gap ≤ 0 in all models), even with 1200 stimuli. "Meaning = direction" isn't an approximation; it's the code.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4. Where LLMs match the brain — and where they don't
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Brain&lt;/th&gt;
&lt;th&gt;Dense LLM&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Small-worldness / rich-club hubs&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;yes (σ up to 12.8)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;match&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network modularity Q&lt;/td&gt;
&lt;td&gt;0.30–0.50&lt;/td&gt;
&lt;td&gt;0.09–0.23, &lt;em&gt;rising each generation&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;partial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Category-selective &lt;em&gt;regions&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;yes (FFA/PPA)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;no&lt;/strong&gt; (distributed direction)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;differ&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Topographic maps (retinotopy etc.)&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;no&lt;/strong&gt; (~20–40× below cortex)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;differ&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-model universality (CKA)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0.69–0.77, &lt;strong&gt;cross-family&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Platonic convergence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two bonus results worth flagging:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Steerability is predicted by encoding dimensionality&lt;/strong&gt; (r ≈ −0.83): concepts packed into ~1 direction (numbers, colors) steer cleanly; high-dimensional concepts resist.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A wiring-cost penalty makes a small transformer more modular&lt;/strong&gt; (ΔQ &amp;gt; 0 in 4/4 seeds, with a non-monotonic sweet spot) — direct evidence that the brain's modularity is partly a consequence of &lt;em&gt;physical&lt;/em&gt; embedding constraints that transformers normally lack.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Honest nulls
&lt;/h2&gt;

&lt;p&gt;The harness has an adversarial verification gate, and several appealing hypotheses died in it: "abstraction velocity predicts capability" was &lt;strong&gt;rejected&lt;/strong&gt; on a clean 5-point Qwen ladder; the flashy "60× more localized in SAE features" shrank to a &lt;strong&gt;modest 2.4×&lt;/strong&gt; under a gold-standard pretrained Gemma Scope SAE; cross-model &lt;em&gt;feature&lt;/em&gt;-level universality is only partial. Reported as nulls, not spun.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Method: dense models scanned on Apple Silicon (MPS), neuroscience-style analysis pipeline (linear probes, RSA/CKA, functional connectome graphs, causal patching, SAEs, steering). Every number is traceable to a data file. Feedback welcome.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>neuroscience</category>
      <category>research</category>
    </item>
  </channel>
</rss>
