DEV Community

Cover image for Transformer Attention Is Hopfield's 1982 Update Rule (And What That Tells Us About LLM Memory)
Mathias Leonhardt
Mathias Leonhardt

Posted on • Originally published at ki-mathias.de

Transformer Attention Is Hopfield's 1982 Update Rule (And What That Tells Us About LLM Memory)

Hopfield's associative-memory equation from 1982 and the scaled dot-product attention from Vaswani 2017 are the same operation. One substitution turns one into the other. The 2024 Nobel Prize in Physics — to Hopfield and Hinton — is the academic acknowledgement that the mathematics behind today's LLMs was already written four decades ago, in a different vocabulary.

This is a condensed write-up of the longer, interactive piece at ki-mathias.de/en/hopfield.html. Seven chapters there, five live MNIST demos. Here I focus on the four steps where the story has interesting empirical edges.


1. The identity

Modern Hopfield (Ramsauer et al., 2020) writes the update rule as

v ← X · softmax(β · Xᵀv)
Enter fullscreen mode Exit fullscreen mode

where X ∈ ℝ^(N×p) is the matrix of stored patterns and β > 0 is an inverse-temperature parameter.

Scaled dot-product attention (Vaswani et al., 2017) writes

Attention(Q, K, V) = V · softmax(Kᵀ Q / √dₖ)
Enter fullscreen mode Exit fullscreen mode

Set Q = v, K = X, V = X, and β = 1/√d_k. The two equations become identical. Not analogous. Identical. Same operation, written in two different notations.

In a Transformer, K and V are independent learned projections of the same input rather than the same matrix, and Q is yet another projection. Those are extra learnable transformations around the Hopfield core; the softmax-weighted lookup in the middle is unchanged.

Krotov & Hopfield (2016) had already worked out the dense associative memory generalisation that gives this form its exponential storage capacity. Vaswani 2017 reached the same equation by iterating on machine-translation benchmarks. Ramsauer 2020 noticed they were the same. The independent rediscovery is itself diagnostic: the structure isn't a design choice, it's a forced consequence of the requirements.


2. Why classical Hopfield breaks on MNIST (and why that's not a bug)

The original 1982 recall rule is

vᵢ ← sign(Σⱼ Wᵢⱼ · vⱼ)        # W = (1/N) Σₘ ξₘ ξₘᵀ,  Wᵢᵢ = 0   (index m runs over stored patterns)
Enter fullscreen mode Exit fullscreen mode

This is the Hebb construction. Store ten MNIST digits, query each with 15 % pixel noise, observe what comes back.

Result: all ten queries collapse into the same end-state — an image that isn't visually any of the stored digits. Mean pairwise similarity between the ten "recalls": 0.99.

This is fully explained by the spectrum of W_Hebb. The eigenvalues are roughly

λ₁ ≈ 6.65,   λ₂ ≈ 0.65,   λ₃ ≈ 0.48,   ...
Enter fullscreen mode Exit fullscreen mode

A factor-of-ten gap between λ₁ and the rest. The top eigenvector is essentially ξ̄ = (1/p) Σₘ ξₘ, the per-pixel mean — cosine 0.9999.

The Hebb rule is provably correct only under two conditions:

  1. Pairwise orthogonality of stored patterns.
  2. Zero-mean patterns.

MNIST digits violate both: pairwise inner products are 400–600 out of 784 (≈ two thirds of the pixels shared), and mean pixel values are −0.63 to −0.90 (much more "background" than "ink"). The failure is therefore not an implementation bug; it's the construction operating outside its range of validity. Centring the patterns kills the bias sink but reveals the next defect — the v → −v symmetry of E(v) = -½vᵀWv causes recalls to land on negations of stored patterns.

The didactic point: a learning rule is correct or incorrect relative to a data geometry. "Hebb is broken" is not a sentence. "Hebb is broken on MNIST" is.


3. The pseudoinverse fix and its capacity cliff

The Personnaz–Guyon–Dreyfus construction (1985) keeps the same recall machinery but builds W differently:

W_PI = X (XᵀX)⁻¹ Xᵀ
Enter fullscreen mode Exit fullscreen mode

The factor (XᵀX)⁻¹ is exactly what's missing in Hebb — the inverse of the pattern-pattern Gram matrix. It removes correlations between stored patterns before the matrix becomes the energy landscape. For orthogonal patterns the two rules coincide; for correlated ones, only W_PI carries the algebraic guarantee

W_PI · ξₚ = ξₚ              # every stored pattern is a fixed point with eigenvalue 1
Enter fullscreen mode Exit fullscreen mode

Empirical capacity on MNIST, p stored patterns, 10 % pixel noise, fraction of queries that recover the original:

p Hebb Pseudoinverse
10 0 % 100 %
100 0 % 100 %
150 0 % 97 %
200 0 % 32 %
250 0 % 1 %
300 0 % 0 %

A sharp phase transition between p ≈ 150 and p ≈ 250. Far below the algebraic ceiling p = N = 784, where the Gram matrix becomes singular. The identity W_PI ξₚ = ξₚ holds throughout — but the basin of attraction around each fixed point shrinks as the patterns crowd one another, and 10 % noise overshoots the basin once p exceeds ~150.

Side note for readers who came in via the Eigenvalues post: the operator X(XᵀX)⁻¹Xᵀ is exactly ridge regression with λ = 0 — the pseudoinverse hat matrix. The Hopfield update with this W is therefore a non-linear filter built on top of an ordinary projection onto the span of stored patterns. The capacity cliff is the cliff of unregularised projection at near-singular Gram.


4. Modern Hopfield = Attention (the move that fixes capacity)

Stop iterating sign(Wv). Replace it with the soft, input-dependent

v ← X · softmax(β · Xᵀv)
Enter fullscreen mode Exit fullscreen mode

Three structural changes happen at once:

Component Classical (1982/1985) Modern (Ramsauer 2020)
Operator fixed W ∈ ℝ^(N×N) none — direct softmax-lookup on X
Update linear in v + sign non-linear (softmax in v)
Energy quadratic -½ vᵀWv log-sum-exp + ½‖v‖² (Lyapunov)
Convergence iterative, many sweeps one step (for sufficiently large β)
Capacity dynamically ≪ N Ω(exp(N)) — exponential in N

The exponential capacity is the practical reason this works for LLMs at all: with N = 768 (a typical embedding dim), you can store effectively-unbounded context. With N = 784 (MNIST), the classical pseudoinverse rule plateaus near p ≈ 150 on real data.

And the parameter β is interpretable. At small β, the softmax is near-uniform and the recall is a soft average of all stored patterns. At large β, it concentrates on the single best match — Modern Hopfield converges to 1-nearest-neighbour. Ramsauer's analysis of Transformer heads shows early layers running at low β (global averaging) and deeper layers running at high β (sharp lookup on a single token). The classical "attention is mysterious" complaint dissolves into a continuous interpolation between two known operations.


5. The surprise: same learning rule, different geometry, totally different outcome

The interesting finding from Negri, Tudisco, Lucibello et al. 2024Random Features Hopfield Networks generalize retrieval to previously unseen examples — is not "we made Hopfield better." It's the opposite:

The exact same learning rule that scores 65 % accuracy on MNIST (i.e., barely matches 1-NN, no real generalisation) achieves perfect generalisation — magnetisation 1.0 on unseen test patterns — when the data is built as a sparse mixture of a small set of random features.

Setup: let F ∈ {-1,+1}^(N×D) be a random feature matrix. Each pattern is ξ = sign(F · c) with c an L-sparse binary coefficient vector. Three sets share the same F:

  • Train (p stored patterns)
  • Features (the D feature columns of F — never stored)
  • Test (new patterns from the same distribution, never stored)

Sweep α = p/N and measure the magnetisation of each set. Three phases appear in order:

  • Storage (α small): only train patterns are stable attractors.
  • Learning (α medium): train magnetisation drops, features magnetisation rises — the network has begun to recognise the components it was implicitly trained on.
  • Generalisation (α large): test patterns become attractors too — without ever having been stored.

With the pseudoinverse rule this last transition is a hard jump to magnetisation 1.0, and the math explains why: once the trained patterns span enough of the feature mixtures, every feature mixture becomes an eigenvector of W_PI with eigenvalue 1 — by the same identity that made stored patterns fixed points.

The takeaway is not subtle: generalisation is a property of the data geometry, not of the learning rule. A textbook claim that "this learning rule generalises better" is well-typed only relative to a class of data. The reason language models generalise so well isn't that the attention mechanism has a special "ability" — it's that natural language already has the sparse compositional structure that makes Hopfield-style retrieval transfer beyond the training set. Words and constructions are a finite set of components; sentences are sparse mixtures. Hopfield-friendly by accident of biology.


6. What runs on this mathematics today

A non-exhaustive list, with the empirical claim each item is making:

  • Every Transformer attention layer. Modern Hopfield is what's there. With high probability the most-executed mathematical operation on global compute, by raw volume.
  • MHNfs (Klambauer/Hochreiter, Linz) — few-shot drug discovery, 100k+ context molecules as memory, SOTA on FS-Mol.
  • DeepRC (Widrich et al., NeurIPS 2020) — multi-instance learning over ~10⁶ immune-repertoire sequences, used for SARS-CoV-2 classification.
  • Memristor Hopfield chips (HP Labs Nature Electronics 2020; Peking U. Nature Comms 2024) — analogue MAX-CUT solvers, ~4 orders of magnitude energy advantage over digital. The Peking paper proves mathematical equivalence to a Hopfield attractor network, not just analogy.
  • PyTorch drop-in: ml-jku/hopfield-layersHopfield, HopfieldPooling, HopfieldLayer modules, swap-in replacements for LSTM / pooling / attention.

7. What this does not say

  • It does not claim that every modern ML algorithm is "secretly a Hopfield network." The identity is precise between Modern Hopfield and scaled dot-product attention. Diffusion models, state-space models, ConvNets have different mathematical structures.
  • The Chapter-6 generalisation result is for synthetic feature-mixture data, not for MNIST. To transfer it to real data you need to first extract a feature basis (PCA, dictionary learning, learned embeddings) — i.e. you have to engineer the right kind of sparse architecture, the data won't give it to you for free.
  • Classical Hopfield is not "back" as a general-purpose ML tool. ConvNets/diffusion/Transformers are better for almost every benchmark task. The Hopfield reading earns its keep when memory is the actual problem — few-shot, multi-instance, episodic recall, combinatorial optimisation on dedicated hardware.

Further reading

If you spot a mistake or a sharper statement of any of the above, the source repo is open — corrections welcome.

Top comments (0)