Your Ridge Parameter Is PageRank's Damping Factor (and Why BLAS Beats GPU for Mid-Sized KRR)

#machinelearning #python #math #ai

The ridge parameter λ in kernel regression and the damping factor d in PageRank are the same object. Both create a spectral gap. Both stabilize an otherwise-unstable iteration. Once you see it, four unrelated-looking algorithms collapse into one story.

This post is a condensed, math-first write-up of two longer pieces on ki-mathias.de:

Eigenvalues & AI — the unification story, with interactive eigendecomposition demos.
KRR Chat: Under the Hood — a 3 MB browser chatbot trained with kernel ridge regression, no neural network, ~50 k parameters.
Paper: Leonhardt 2026 — Kernel Ridge Regression and PageRank: A Unified Absorber-Stochastic Framework (Zenodo, v2, 10 pages).
Try the chatbot: Kalle · Source: pmmathias/krr-chat.

1. The identity that does the work

Every algorithm below is an eigenvalue problem with a regularizer. The regularizer is spelled differently each time, but it plays the same role: it pushes eigenvalues away from the unit circle, which makes power iteration converge.

Algorithm	Core equation	"Regularizer"
Ridge regression	`(K + λI) α = y`	λ
PageRank	`v = d · M · v + (1 − d) · u`	d (damping ≈ 0.85)
Early-stopped gradient descent	`α_{t+1} = α_t − η (K α_t − y)`, stop at iteration t	1/t (Theorem 1)
Schrödinger eigenproblem	`Ĥ ψ = E ψ`	mass/potential scaling

Theorem 1 in the paper makes the second-last row precise:

Running gradient descent on ½ ‖K α − y‖² for exactly t iterations and stopping approximates the ridge solution with regularizer λ ≈ 1/t.

Early stopping is spectral filtering. You never wrote λ into your training script, but by choosing a step count you chose one. That's why "no regularization + early stopping" and "ridge regression" often train to the same test loss. They are the same filter applied two different ways.

2. Why the spectrum matters — with one concrete example

A row-stochastic matrix M (like the link matrix of a hyperlink graph) has a leading eigenvalue of exactly 1, and the rest can hug that 1 arbitrarily closely. Power iteration will converge to the leading eigenvector, but slowly — the convergence rate is set by the gap 1 − |λ₂|. If the gap is 10⁻⁶, expect a million iterations.

PageRank fixes this by mixing in a teleport term:

v ← d · M · v + (1 − d) · u      # d = 0.85, u = uniform

That pins the spectral radius of the effective operator at d, not 1. Now every eigenvalue except the leading one is multiplied by 0.85 each iteration — the gap becomes ≥ 0.15, and the walk converges in tens of steps.

Ridge regression is the linear-algebra twin:

α = (K + λI)⁻¹ y

Same trick. The matrix K has eigenvalues that can be tiny (noise directions). Inverting K alone amplifies noise. Adding λI floors the spectrum: every eigenvalue of K + λI is at least λ, so the inverse is bounded by 1/λ. You trade bias for variance, and the λ knob is exactly PageRank's damping factor d.

3. Absorber-Stochasticization (the paper's §6 trick)

Here's the subtle bit. Writing PageRank as an eigenvalue problem requires M to be properly stochastic (rows sum to 1). Real link matrices have dangling nodes — pages with no outgoing links — so they're sub-stochastic. Standard treatments patch this with a uniform "teleport from dead ends" rule, but that mixes two different regularizers (λ and teleport).

The paper's construction is cleaner: add a single extra row and column — the absorber — that catches all missing mass. Now the matrix is exactly stochastic. The absorber state is transient; the stationary distribution lives on the original nodes.

The identical construction works for kernel ridge regression. Given (K + λI) α = y, build the augmented system

[ K + λI   r ]   [ α ]     [ y ]
[  sᵀ      σ ] · [ β ]  =  [ 0 ]

where the extra row/column carries the residual. The system is now block-solvable via a Krylov iteration whose spectrum is controlled by λ alone — no auxiliary regularizers, no hand-tuned dangling-node rule.

The practical payoff: you can train KRR at D ≈ 6k with a Block Preconditioned Conjugate Gradient and hit numerical convergence in ~11 iterations.

4. The benchmark that surprised me (BLAS > GPU)

Here is the v2 paper's Table 4 — training the Kalle chatbot (D = 6144, V = 2977):

Backend	Precision	Time (ms)	Iterations
Direct solve (NumPy/LAPACK)	Float64	2097	—
Block-PCG (NumPy)	Float64	6731	14
Block-PCG (PyTorch CPU)	Float32	1704	11
Block-PCG (PyTorch MPS / GPU)	Float32	1622	11

The surprise: PyTorch CPU is within 5% of GPU, and it beats NumPy's direct solve. Why?

BLAS is why. PyTorch CPU dispatches matmuls to MKL/oneDNN with AVX-512, multi-threaded. NumPy's linalg.solve calls LAPACK's direct factorization, which is O(D³) and doesn't parallelize as cleanly as a blocked matmul.
GPU transfer overhead eats the advantage. For D ≈ 6k, the operators fit in L2/L3 cache. Moving α and the preconditioner to GPU memory each iteration costs more than the matmul you save.
Float32 is enough. The preconditioner is well-conditioned after the absorber step, so Float32 converges in the same iteration count as Float64 with a 2× speed gain.

If your D is in the 10³–10⁴ range, don't reach for a GPU. Reach for PyTorch CPU with Float32. You'll ship on commodity hardware.

The inflection point is around D ~ 2·10⁴ on an M2 Pro — above that, GPU wins decisively. Your mileage varies with cache size and memory bandwidth; the takeaway is to measure, not to assume "GPU always wins."

5. Corpus scaling — N grows, solve stays 3–7% of total

The other surprise from the benchmarks: when I grew the training corpus from ~3 k to ~21 k sentences (V = 2977 → V = 21,433), the KRR solve stayed between 3% and 7% of the end-to-end pipeline. The bulk of time was spent in tokenization and RFF feature construction, not the linear solve.

This matters because the common intuition — "bigger corpus → quadratically harder training" — comes from full-kernel methods where you materialize the N×N gram matrix. With Random Fourier Features, the bottleneck is the D×D normal equations, and D is decoupled from N. Doubling your corpus doubles feature-extraction cost (linear in N) but leaves the solve essentially unchanged.

The scaling lever is D, not N. Budget accordingly.

6. Kalle — a 3 MB chatbot you can read

The paper's toy system is Kalle: a kernel-ridge-regression chatbot that I've deployed as a static web page at pmmathias.github.io/krr-chat. It has about 50 k parameters (three D×V matrices), fits in 3 MB, runs in-browser on your phone's CPU, and speaks German at a level somewhere between a broken Markov chain and a very confused graduate student. It won't replace ChatGPT. It will let you read every equation in the paper and see the matrices line up.

The point is pedagogical: at this scale, the mathematics is still visible. You can print out K + λI, look at its eigenvalues, and nothing is hidden behind a billion-parameter transformer. That's why I built it.

# Train step — the paper's Algorithm 1, 20 lines of Python
import torch

def train_krr_bpcg(K, y, lam, tol=1e-6, max_iter=50):
    D = K.shape[0]
    A = K + lam * torch.eye(D, dtype=K.dtype, device=K.device)
    M_inv = 1.0 / A.diagonal()  # Jacobi preconditioner
    alpha = torch.zeros_like(y)
    r = y - A @ alpha
    z = M_inv[:, None] * r
    p = z.clone()
    for it in range(max_iter):
        Ap = A @ p
        rz = (r * z).sum(dim=0)
        a = rz / (p * Ap).sum(dim=0)
        alpha = alpha + a * p
        r = r - a * Ap
        if r.norm() < tol:
            return alpha, it + 1
        z_new = M_inv[:, None] * r
        b = (r * z_new).sum(dim=0) / rz
        p = z_new + b * p
        z = z_new
    return alpha, max_iter

That's it. The absorber construction lives inside the preconditioner; the outer loop is textbook CG.

7. What this means if you're building

If you're writing a blog post that says "PageRank is an iterative algorithm and ridge regression is a closed-form algorithm" — they're the same eigenvalue problem, just with different stopping rules.
If you're choosing between GPU and CPU training for a moderate-D kernel method, measure first. For D < 10⁴, BLAS CPU is frequently the winner.
If you're looking for an early-stopping criterion, you can read it off the paper's Table 1: running for t iterations applies a filter approximately equivalent to λ ≈ 1/t. Choose t such that your effective λ matches the cross-validated ridge λ.
If you're trying to explain modern ML to someone else, the eigenvalue story is the shortest honest path. Every algorithm they'll encounter — PCA, PageRank, Schrödinger, ridge regression, neural network training — reduces to "find the eigenvectors of this operator, and regularize the spectrum so the iteration converges."