Hopfield's associative-memory equation from 1982 and the scaled dot-product attention from Vaswani 2017 are the same operation. One substitution turns one into the other. The 2024 Nobel Prize in Physics — to Hopfield and Hinton — is the academic acknowledgement that the mathematics behind today's LLMs was already written four decades ago, in a different vocabulary.
This is a condensed write-up of the longer, interactive piece at ki-mathias.de/en/hopfield.html. Seven chapters there, five live MNIST demos. Here I focus on the four steps where the story has interesting empirical edges.
1. The identity
Modern Hopfield (Ramsauer et al., 2020) writes the update rule as
v ← X · softmax(β · Xᵀv)
where X ∈ ℝ^(N×p) is the matrix of stored patterns and β > 0 is an inverse-temperature parameter.
Scaled dot-product attention (Vaswani et al., 2017) writes
Attention(Q, K, V) = V · softmax(Kᵀ Q / √dₖ)
Set Q = v, K = X, V = X, and β = 1/√d_k. The two equations become identical. Not analogous. Identical. Same operation, written in two different notations.
In a Transformer, K and V are independent learned projections of the same input rather than the same matrix, and Q is yet another projection. Those are extra learnable transformations around the Hopfield core; the softmax-weighted lookup in the middle is unchanged.
Krotov & Hopfield (2016) had already worked out the dense associative memory generalisation that gives this form its exponential storage capacity. Vaswani 2017 reached the same equation by iterating on machine-translation benchmarks. Ramsauer 2020 noticed they were the same. The independent rediscovery is itself diagnostic: the structure isn't a design choice, it's a forced consequence of the requirements.
2. Why classical Hopfield breaks on MNIST (and why that's not a bug)
The original 1982 recall rule is
vᵢ ← sign(Σⱼ Wᵢⱼ · vⱼ) # W = (1/N) Σₘ ξₘ ξₘᵀ, Wᵢᵢ = 0 (index m runs over stored patterns)
This is the Hebb construction. Store ten MNIST digits, query each with 15 % pixel noise, observe what comes back.
Result: all ten queries collapse into the same end-state — an image that isn't visually any of the stored digits. Mean pairwise similarity between the ten "recalls": 0.99.
This is fully explained by the spectrum of W_Hebb. The eigenvalues are roughly
λ₁ ≈ 6.65, λ₂ ≈ 0.65, λ₃ ≈ 0.48, ...
A factor-of-ten gap between λ₁ and the rest. The top eigenvector is essentially ξ̄ = (1/p) Σₘ ξₘ, the per-pixel mean — cosine 0.9999.
The Hebb rule is provably correct only under two conditions:
- Pairwise orthogonality of stored patterns.
- Zero-mean patterns.
MNIST digits violate both: pairwise inner products are 400–600 out of 784 (≈ two thirds of the pixels shared), and mean pixel values are −0.63 to −0.90 (much more "background" than "ink"). The failure is therefore not an implementation bug; it's the construction operating outside its range of validity. Centring the patterns kills the bias sink but reveals the next defect — the v → −v symmetry of E(v) = -½vᵀWv causes recalls to land on negations of stored patterns.
The didactic point: a learning rule is correct or incorrect relative to a data geometry. "Hebb is broken" is not a sentence. "Hebb is broken on MNIST" is.
3. The pseudoinverse fix and its capacity cliff
The Personnaz–Guyon–Dreyfus construction (1985) keeps the same recall machinery but builds W differently:
W_PI = X (XᵀX)⁻¹ Xᵀ
The factor (XᵀX)⁻¹ is exactly what's missing in Hebb — the inverse of the pattern-pattern Gram matrix. It removes correlations between stored patterns before the matrix becomes the energy landscape. For orthogonal patterns the two rules coincide; for correlated ones, only W_PI carries the algebraic guarantee
W_PI · ξₚ = ξₚ # every stored pattern is a fixed point with eigenvalue 1
Empirical capacity on MNIST, p stored patterns, 10 % pixel noise, fraction of queries that recover the original:
| p | Hebb | Pseudoinverse |
|---|---|---|
| 10 | 0 % | 100 % |
| 100 | 0 % | 100 % |
| 150 | 0 % | 97 % |
| 200 | 0 % | 32 % |
| 250 | 0 % | 1 % |
| 300 | 0 % | 0 % |
A sharp phase transition between p ≈ 150 and p ≈ 250. Far below the algebraic ceiling p = N = 784, where the Gram matrix becomes singular. The identity W_PI ξₚ = ξₚ holds throughout — but the basin of attraction around each fixed point shrinks as the patterns crowd one another, and 10 % noise overshoots the basin once p exceeds ~150.
Side note for readers who came in via the Eigenvalues post: the operator X(XᵀX)⁻¹Xᵀ is exactly ridge regression with λ = 0 — the pseudoinverse hat matrix. The Hopfield update with this W is therefore a non-linear filter built on top of an ordinary projection onto the span of stored patterns. The capacity cliff is the cliff of unregularised projection at near-singular Gram.
4. Modern Hopfield = Attention (the move that fixes capacity)
Stop iterating sign(Wv). Replace it with the soft, input-dependent
v ← X · softmax(β · Xᵀv)
Three structural changes happen at once:
| Component | Classical (1982/1985) | Modern (Ramsauer 2020) |
|---|---|---|
| Operator | fixed W ∈ ℝ^(N×N)
|
none — direct softmax-lookup on X |
| Update | linear in v + sign | non-linear (softmax in v) |
| Energy | quadratic -½ vᵀWv
|
log-sum-exp + ½‖v‖² (Lyapunov) |
| Convergence | iterative, many sweeps | one step (for sufficiently large β) |
| Capacity | dynamically ≪ N |
Ω(exp(N)) — exponential in N |
The exponential capacity is the practical reason this works for LLMs at all: with N = 768 (a typical embedding dim), you can store effectively-unbounded context. With N = 784 (MNIST), the classical pseudoinverse rule plateaus near p ≈ 150 on real data.
And the parameter β is interpretable. At small β, the softmax is near-uniform and the recall is a soft average of all stored patterns. At large β, it concentrates on the single best match — Modern Hopfield converges to 1-nearest-neighbour. Ramsauer's analysis of Transformer heads shows early layers running at low β (global averaging) and deeper layers running at high β (sharp lookup on a single token). The classical "attention is mysterious" complaint dissolves into a continuous interpolation between two known operations.
5. The surprise: same learning rule, different geometry, totally different outcome
The interesting finding from Negri, Tudisco, Lucibello et al. 2024 — Random Features Hopfield Networks generalize retrieval to previously unseen examples — is not "we made Hopfield better." It's the opposite:
The exact same learning rule that scores 65 % accuracy on MNIST (i.e., barely matches 1-NN, no real generalisation) achieves perfect generalisation — magnetisation 1.0 on unseen test patterns — when the data is built as a sparse mixture of a small set of random features.
Setup: let F ∈ {-1,+1}^(N×D) be a random feature matrix. Each pattern is ξ = sign(F · c) with c an L-sparse binary coefficient vector. Three sets share the same F:
- Train (p stored patterns)
- Features (the D feature columns of F — never stored)
- Test (new patterns from the same distribution, never stored)
Sweep α = p/N and measure the magnetisation of each set. Three phases appear in order:
- Storage (α small): only train patterns are stable attractors.
- Learning (α medium): train magnetisation drops, features magnetisation rises — the network has begun to recognise the components it was implicitly trained on.
- Generalisation (α large): test patterns become attractors too — without ever having been stored.
With the pseudoinverse rule this last transition is a hard jump to magnetisation 1.0, and the math explains why: once the trained patterns span enough of the feature mixtures, every feature mixture becomes an eigenvector of W_PI with eigenvalue 1 — by the same identity that made stored patterns fixed points.
The takeaway is not subtle: generalisation is a property of the data geometry, not of the learning rule. A textbook claim that "this learning rule generalises better" is well-typed only relative to a class of data. The reason language models generalise so well isn't that the attention mechanism has a special "ability" — it's that natural language already has the sparse compositional structure that makes Hopfield-style retrieval transfer beyond the training set. Words and constructions are a finite set of components; sentences are sparse mixtures. Hopfield-friendly by accident of biology.
6. What runs on this mathematics today
A non-exhaustive list, with the empirical claim each item is making:
- Every Transformer attention layer. Modern Hopfield is what's there. With high probability the most-executed mathematical operation on global compute, by raw volume.
- MHNfs (Klambauer/Hochreiter, Linz) — few-shot drug discovery, 100k+ context molecules as memory, SOTA on FS-Mol.
- DeepRC (Widrich et al., NeurIPS 2020) — multi-instance learning over ~10⁶ immune-repertoire sequences, used for SARS-CoV-2 classification.
- Memristor Hopfield chips (HP Labs Nature Electronics 2020; Peking U. Nature Comms 2024) — analogue MAX-CUT solvers, ~4 orders of magnitude energy advantage over digital. The Peking paper proves mathematical equivalence to a Hopfield attractor network, not just analogy.
-
PyTorch drop-in: ml-jku/hopfield-layers —
Hopfield,HopfieldPooling,HopfieldLayermodules, swap-in replacements for LSTM / pooling / attention.
7. What this does not say
- It does not claim that every modern ML algorithm is "secretly a Hopfield network." The identity is precise between Modern Hopfield and scaled dot-product attention. Diffusion models, state-space models, ConvNets have different mathematical structures.
- The Chapter-6 generalisation result is for synthetic feature-mixture data, not for MNIST. To transfer it to real data you need to first extract a feature basis (PCA, dictionary learning, learned embeddings) — i.e. you have to engineer the right kind of sparse architecture, the data won't give it to you for free.
- Classical Hopfield is not "back" as a general-purpose ML tool. ConvNets/diffusion/Transformers are better for almost every benchmark task. The Hopfield reading earns its keep when memory is the actual problem — few-shot, multi-instance, episodic recall, combinatorial optimisation on dedicated hardware.
Further reading
- Full interactive post: Hopfield Networks — From Spin Glass to Attention — seven chapters, five interactive MNIST demos (live Hebb recall, bias-sink, Hebb↔PI spectrum slider, β-slider, three-phase diagram).
-
The papers that close the loop:
- Hopfield 1982 — Neural networks and physical systems with emergent collective computational abilities (PNAS).
- Krotov & Hopfield 2016 — Dense Associative Memory for Pattern Recognition.
- Vaswani et al. 2017 — Attention Is All You Need.
- Ramsauer et al. 2020 — Hopfield Networks Is All You Need (the identity).
- Negri et al. 2024 — Random Features Hopfield Networks generalize retrieval to previously unseen examples (the three-phase diagram).
- Nobel Prize 2024: official citation — Hopfield & Hinton, for foundational discoveries and inventions that enable machine learning with artificial neural networks.
If you spot a mistake or a sharper statement of any of the above, the source repo is open — corrections welcome.
Top comments (0)