The ridge parameter λ in kernel regression and the damping factor d in PageRank are the same object. Both create a spectral gap. Both stabilize an otherwise-unstable iteration. Once you see it, four unrelated-looking algorithms collapse into one story.
This post is a condensed, math-first write-up of two longer pieces on ki-mathias.de:
- Eigenvalues & AI — the unification story, with interactive eigendecomposition demos.
- KRR Chat: Under the Hood — a 3 MB browser chatbot trained with kernel ridge regression, no neural network, ~50 k parameters.
- Paper: Leonhardt 2026 — Kernel Ridge Regression and PageRank: A Unified Absorber-Stochastic Framework (Zenodo, v2, 10 pages).
- Try the chatbot: Kalle · Source: pmmathias/krr-chat.
1. The identity that does the work
Every algorithm below is an eigenvalue problem with a regularizer. The regularizer is spelled differently each time, but it plays the same role: it pushes eigenvalues away from the unit circle, which makes power iteration converge.
| Algorithm | Core equation | "Regularizer" |
|---|---|---|
| Ridge regression | (K + λI) α = y |
λ |
| PageRank | v = d · M · v + (1 − d) · u |
d (damping ≈ 0.85) |
| Early-stopped gradient descent |
α_{t+1} = α_t − η (K α_t − y), stop at iteration t |
1/t (Theorem 1) |
| Schrödinger eigenproblem | Ĥ ψ = E ψ |
mass/potential scaling |
Theorem 1 in the paper makes the second-last row precise:
Running gradient descent on
½ ‖K α − y‖²for exactly t iterations and stopping approximates the ridge solution with regularizer λ ≈ 1/t.
Early stopping is spectral filtering. You never wrote λ into your training script, but by choosing a step count you chose one. That's why "no regularization + early stopping" and "ridge regression" often train to the same test loss. They are the same filter applied two different ways.
2. Why the spectrum matters — with one concrete example
A row-stochastic matrix M (like the link matrix of a hyperlink graph) has a leading eigenvalue of exactly 1, and the rest can hug that 1 arbitrarily closely. Power iteration will converge to the leading eigenvector, but slowly — the convergence rate is set by the gap 1 − |λ₂|. If the gap is 10⁻⁶, expect a million iterations.
PageRank fixes this by mixing in a teleport term:
v ← d · M · v + (1 − d) · u # d = 0.85, u = uniform
That pins the spectral radius of the effective operator at d, not 1. Now every eigenvalue except the leading one is multiplied by 0.85 each iteration — the gap becomes ≥ 0.15, and the walk converges in tens of steps.
Ridge regression is the linear-algebra twin:
α = (K + λI)⁻¹ y
Same trick. The matrix K has eigenvalues that can be tiny (noise directions). Inverting K alone amplifies noise. Adding λI floors the spectrum: every eigenvalue of K + λI is at least λ, so the inverse is bounded by 1/λ. You trade bias for variance, and the λ knob is exactly PageRank's damping factor d.
3. Absorber-Stochasticization (the paper's §6 trick)
Here's the subtle bit. Writing PageRank as an eigenvalue problem requires M to be properly stochastic (rows sum to 1). Real link matrices have dangling nodes — pages with no outgoing links — so they're sub-stochastic. Standard treatments patch this with a uniform "teleport from dead ends" rule, but that mixes two different regularizers (λ and teleport).
The paper's construction is cleaner: add a single extra row and column — the absorber — that catches all missing mass. Now the matrix is exactly stochastic. The absorber state is transient; the stationary distribution lives on the original nodes.
The identical construction works for kernel ridge regression. Given (K + λI) α = y, build the augmented system
[ K + λI r ] [ α ] [ y ]
[ sᵀ σ ] · [ β ] = [ 0 ]
where the extra row/column carries the residual. The system is now block-solvable via a Krylov iteration whose spectrum is controlled by λ alone — no auxiliary regularizers, no hand-tuned dangling-node rule.
The practical payoff: you can train KRR at D ≈ 6k with a Block Preconditioned Conjugate Gradient and hit numerical convergence in ~11 iterations.
4. The benchmark that surprised me (BLAS > GPU)
Here is the v2 paper's Table 4 — training the Kalle chatbot (D = 6144, V = 2977):
| Backend | Precision | Time (ms) | Iterations |
|---|---|---|---|
| Direct solve (NumPy/LAPACK) | Float64 | 2097 | — |
| Block-PCG (NumPy) | Float64 | 6731 | 14 |
| Block-PCG (PyTorch CPU) | Float32 | 1704 | 11 |
| Block-PCG (PyTorch MPS / GPU) | Float32 | 1622 | 11 |
The surprise: PyTorch CPU is within 5% of GPU, and it beats NumPy's direct solve. Why?
-
BLAS is why. PyTorch CPU dispatches matmuls to MKL/oneDNN with AVX-512, multi-threaded. NumPy's
linalg.solvecalls LAPACK's direct factorization, which is O(D³) and doesn't parallelize as cleanly as a blocked matmul. - GPU transfer overhead eats the advantage. For D ≈ 6k, the operators fit in L2/L3 cache. Moving α and the preconditioner to GPU memory each iteration costs more than the matmul you save.
- Float32 is enough. The preconditioner is well-conditioned after the absorber step, so Float32 converges in the same iteration count as Float64 with a 2× speed gain.
If your D is in the 10³–10⁴ range, don't reach for a GPU. Reach for PyTorch CPU with Float32. You'll ship on commodity hardware.
The inflection point is around D ~ 2·10⁴ on an M2 Pro — above that, GPU wins decisively. Your mileage varies with cache size and memory bandwidth; the takeaway is to measure, not to assume "GPU always wins."
5. Corpus scaling — N grows, solve stays 3–7% of total
The other surprise from the benchmarks: when I grew the training corpus from ~3 k to ~21 k sentences (V = 2977 → V = 21,433), the KRR solve stayed between 3% and 7% of the end-to-end pipeline. The bulk of time was spent in tokenization and RFF feature construction, not the linear solve.
This matters because the common intuition — "bigger corpus → quadratically harder training" — comes from full-kernel methods where you materialize the N×N gram matrix. With Random Fourier Features, the bottleneck is the D×D normal equations, and D is decoupled from N. Doubling your corpus doubles feature-extraction cost (linear in N) but leaves the solve essentially unchanged.
The scaling lever is D, not N. Budget accordingly.
6. Kalle — a 3 MB chatbot you can read
The paper's toy system is Kalle: a kernel-ridge-regression chatbot that I've deployed as a static web page at pmmathias.github.io/krr-chat. It has about 50 k parameters (three D×V matrices), fits in 3 MB, runs in-browser on your phone's CPU, and speaks German at a level somewhere between a broken Markov chain and a very confused graduate student. It won't replace ChatGPT. It will let you read every equation in the paper and see the matrices line up.
The point is pedagogical: at this scale, the mathematics is still visible. You can print out K + λI, look at its eigenvalues, and nothing is hidden behind a billion-parameter transformer. That's why I built it.
# Train step — the paper's Algorithm 1, 20 lines of Python
import torch
def train_krr_bpcg(K, y, lam, tol=1e-6, max_iter=50):
D = K.shape[0]
A = K + lam * torch.eye(D, dtype=K.dtype, device=K.device)
M_inv = 1.0 / A.diagonal() # Jacobi preconditioner
alpha = torch.zeros_like(y)
r = y - A @ alpha
z = M_inv[:, None] * r
p = z.clone()
for it in range(max_iter):
Ap = A @ p
rz = (r * z).sum(dim=0)
a = rz / (p * Ap).sum(dim=0)
alpha = alpha + a * p
r = r - a * Ap
if r.norm() < tol:
return alpha, it + 1
z_new = M_inv[:, None] * r
b = (r * z_new).sum(dim=0) / rz
p = z_new + b * p
z = z_new
return alpha, max_iter
That's it. The absorber construction lives inside the preconditioner; the outer loop is textbook CG.
7. What this means if you're building
- If you're writing a blog post that says "PageRank is an iterative algorithm and ridge regression is a closed-form algorithm" — they're the same eigenvalue problem, just with different stopping rules.
- If you're choosing between GPU and CPU training for a moderate-D kernel method, measure first. For D < 10⁴, BLAS CPU is frequently the winner.
- If you're looking for an early-stopping criterion, you can read it off the paper's Table 1: running for t iterations applies a filter approximately equivalent to λ ≈ 1/t. Choose t such that your effective λ matches the cross-validated ridge λ.
- If you're trying to explain modern ML to someone else, the eigenvalue story is the shortest honest path. Every algorithm they'll encounter — PCA, PageRank, Schrödinger, ridge regression, neural network training — reduces to "find the eigenvectors of this operator, and regularize the spectrum so the iteration converges."
Further reading
- Full interactive post: Eigenvalues & AI — includes three React demos (spectrum visualizer, PageRank convergence, ridge parameter slider) and a 9-minute companion video.
- Kalle walkthrough: KRR Chat: Under the Hood.
- Paper (v2, open access): Zenodo 10.5281/zenodo.19595641.
- Code: github.com/pmmathias/krr-chat.
If you find a mistake, open an issue on the repo or @ me — every correction improves the next version of the paper.
Top comments (0)