DEV Community

Cover image for Your Ridge Parameter Is PageRank's Damping Factor (and Why BLAS Beats GPU for Mid-Sized KRR)
Mathias Leonhardt
Mathias Leonhardt

Posted on • Originally published at ki-mathias.de

Your Ridge Parameter Is PageRank's Damping Factor (and Why BLAS Beats GPU for Mid-Sized KRR)

The ridge parameter λ in kernel regression and the damping factor d in PageRank are the same object. Both create a spectral gap. Both stabilize an otherwise-unstable iteration. Once you see it, four unrelated-looking algorithms collapse into one story.

This post is a condensed, math-first write-up of two longer pieces on ki-mathias.de:


1. The identity that does the work

Every algorithm below is an eigenvalue problem with a regularizer. The regularizer is spelled differently each time, but it plays the same role: it pushes eigenvalues away from the unit circle, which makes power iteration converge.

Algorithm Core equation "Regularizer"
Ridge regression (K + λI) α = y λ
PageRank v = d · M · v + (1 − d) · u d (damping ≈ 0.85)
Early-stopped gradient descent α_{t+1} = α_t − η (K α_t − y), stop at iteration t 1/t (Theorem 1)
Schrödinger eigenproblem Ĥ ψ = E ψ mass/potential scaling

Theorem 1 in the paper makes the second-last row precise:

Running gradient descent on ½ ‖K α − y‖² for exactly t iterations and stopping approximates the ridge solution with regularizer λ ≈ 1/t.

Early stopping is spectral filtering. You never wrote λ into your training script, but by choosing a step count you chose one. That's why "no regularization + early stopping" and "ridge regression" often train to the same test loss. They are the same filter applied two different ways.


2. Why the spectrum matters — with one concrete example

A row-stochastic matrix M (like the link matrix of a hyperlink graph) has a leading eigenvalue of exactly 1, and the rest can hug that 1 arbitrarily closely. Power iteration will converge to the leading eigenvector, but slowly — the convergence rate is set by the gap 1 − |λ₂|. If the gap is 10⁻⁶, expect a million iterations.

PageRank fixes this by mixing in a teleport term:

v ← d · M · v + (1 − d) · u      # d = 0.85, u = uniform
Enter fullscreen mode Exit fullscreen mode

That pins the spectral radius of the effective operator at d, not 1. Now every eigenvalue except the leading one is multiplied by 0.85 each iteration — the gap becomes ≥ 0.15, and the walk converges in tens of steps.

Ridge regression is the linear-algebra twin:

α = (K + λI)⁻¹ y
Enter fullscreen mode Exit fullscreen mode

Same trick. The matrix K has eigenvalues that can be tiny (noise directions). Inverting K alone amplifies noise. Adding λI floors the spectrum: every eigenvalue of K + λI is at least λ, so the inverse is bounded by 1/λ. You trade bias for variance, and the λ knob is exactly PageRank's damping factor d.


3. Absorber-Stochasticization (the paper's §6 trick)

Here's the subtle bit. Writing PageRank as an eigenvalue problem requires M to be properly stochastic (rows sum to 1). Real link matrices have dangling nodes — pages with no outgoing links — so they're sub-stochastic. Standard treatments patch this with a uniform "teleport from dead ends" rule, but that mixes two different regularizers (λ and teleport).

The paper's construction is cleaner: add a single extra row and column — the absorber — that catches all missing mass. Now the matrix is exactly stochastic. The absorber state is transient; the stationary distribution lives on the original nodes.

The identical construction works for kernel ridge regression. Given (K + λI) α = y, build the augmented system

[ K + λI   r ]   [ α ]     [ y ]
[  sᵀ      σ ] · [ β ]  =  [ 0 ]
Enter fullscreen mode Exit fullscreen mode

where the extra row/column carries the residual. The system is now block-solvable via a Krylov iteration whose spectrum is controlled by λ alone — no auxiliary regularizers, no hand-tuned dangling-node rule.

The practical payoff: you can train KRR at D ≈ 6k with a Block Preconditioned Conjugate Gradient and hit numerical convergence in ~11 iterations.


4. The benchmark that surprised me (BLAS > GPU)

Here is the v2 paper's Table 4 — training the Kalle chatbot (D = 6144, V = 2977):

Backend Precision Time (ms) Iterations
Direct solve (NumPy/LAPACK) Float64 2097
Block-PCG (NumPy) Float64 6731 14
Block-PCG (PyTorch CPU) Float32 1704 11
Block-PCG (PyTorch MPS / GPU) Float32 1622 11

The surprise: PyTorch CPU is within 5% of GPU, and it beats NumPy's direct solve. Why?

  1. BLAS is why. PyTorch CPU dispatches matmuls to MKL/oneDNN with AVX-512, multi-threaded. NumPy's linalg.solve calls LAPACK's direct factorization, which is O(D³) and doesn't parallelize as cleanly as a blocked matmul.
  2. GPU transfer overhead eats the advantage. For D ≈ 6k, the operators fit in L2/L3 cache. Moving α and the preconditioner to GPU memory each iteration costs more than the matmul you save.
  3. Float32 is enough. The preconditioner is well-conditioned after the absorber step, so Float32 converges in the same iteration count as Float64 with a 2× speed gain.

If your D is in the 10³–10⁴ range, don't reach for a GPU. Reach for PyTorch CPU with Float32. You'll ship on commodity hardware.

The inflection point is around D ~ 2·10⁴ on an M2 Pro — above that, GPU wins decisively. Your mileage varies with cache size and memory bandwidth; the takeaway is to measure, not to assume "GPU always wins."


5. Corpus scaling — N grows, solve stays 3–7% of total

The other surprise from the benchmarks: when I grew the training corpus from ~3 k to ~21 k sentences (V = 2977 → V = 21,433), the KRR solve stayed between 3% and 7% of the end-to-end pipeline. The bulk of time was spent in tokenization and RFF feature construction, not the linear solve.

This matters because the common intuition — "bigger corpus → quadratically harder training" — comes from full-kernel methods where you materialize the N×N gram matrix. With Random Fourier Features, the bottleneck is the D×D normal equations, and D is decoupled from N. Doubling your corpus doubles feature-extraction cost (linear in N) but leaves the solve essentially unchanged.

The scaling lever is D, not N. Budget accordingly.


6. Kalle — a 3 MB chatbot you can read

The paper's toy system is Kalle: a kernel-ridge-regression chatbot that I've deployed as a static web page at pmmathias.github.io/krr-chat. It has about 50 k parameters (three D×V matrices), fits in 3 MB, runs in-browser on your phone's CPU, and speaks German at a level somewhere between a broken Markov chain and a very confused graduate student. It won't replace ChatGPT. It will let you read every equation in the paper and see the matrices line up.

The point is pedagogical: at this scale, the mathematics is still visible. You can print out K + λI, look at its eigenvalues, and nothing is hidden behind a billion-parameter transformer. That's why I built it.

# Train step — the paper's Algorithm 1, 20 lines of Python
import torch

def train_krr_bpcg(K, y, lam, tol=1e-6, max_iter=50):
    D = K.shape[0]
    A = K + lam * torch.eye(D, dtype=K.dtype, device=K.device)
    M_inv = 1.0 / A.diagonal()  # Jacobi preconditioner
    alpha = torch.zeros_like(y)
    r = y - A @ alpha
    z = M_inv[:, None] * r
    p = z.clone()
    for it in range(max_iter):
        Ap = A @ p
        rz = (r * z).sum(dim=0)
        a = rz / (p * Ap).sum(dim=0)
        alpha = alpha + a * p
        r = r - a * Ap
        if r.norm() < tol:
            return alpha, it + 1
        z_new = M_inv[:, None] * r
        b = (r * z_new).sum(dim=0) / rz
        p = z_new + b * p
        z = z_new
    return alpha, max_iter
Enter fullscreen mode Exit fullscreen mode

That's it. The absorber construction lives inside the preconditioner; the outer loop is textbook CG.


7. What this means if you're building

  • If you're writing a blog post that says "PageRank is an iterative algorithm and ridge regression is a closed-form algorithm" — they're the same eigenvalue problem, just with different stopping rules.
  • If you're choosing between GPU and CPU training for a moderate-D kernel method, measure first. For D < 10⁴, BLAS CPU is frequently the winner.
  • If you're looking for an early-stopping criterion, you can read it off the paper's Table 1: running for t iterations applies a filter approximately equivalent to λ ≈ 1/t. Choose t such that your effective λ matches the cross-validated ridge λ.
  • If you're trying to explain modern ML to someone else, the eigenvalue story is the shortest honest path. Every algorithm they'll encounter — PCA, PageRank, Schrödinger, ridge regression, neural network training — reduces to "find the eigenvectors of this operator, and regularize the spectrum so the iteration converges."

Further reading

If you find a mistake, open an issue on the repo or @ me — every correction improves the next version of the paper.

Top comments (0)