Berkan Sesen

Posted on Mar 25 • Originally published at sesen.ai

Five Easy Pieces of Linear Regression

#supervisedlearning #statistics #optimisation

If you care about statistics, machine learning, or what everyone now calls "AI" — learning linear regression well is the single most important investment you can make. Not transformers. Not diffusion models. Linear regression.

You can go as shallow as calling model.fit(X, y) and declaring you know regression. Or you can go deep — the linear algebra behind the closed-form solution, the calculus behind gradient descent, the numerical analysis behind SVD, the Gauss-Markov assumptions that tell you when least squares is optimal, Cook's distance that reveals when one rogue data point is hijacking your entire model, and the Bayesian interpretation that turns regularisation into prior beliefs. People still write PhD theses on this subject. It runs that deep.

This post takes you through five fundamentally different algorithms for solving the same regression problem — each from a different branch of mathematics — and they all produce the exact same answer. Five roads, one destination. Along the way, you'll build intuitions that transfer directly to neural networks, Gaussian processes, and Bayesian inference.

By the end, you'll implement all five from scratch and understand why they agree.

Let's Build It

Click the badge to open the interactive notebook:

Watch gradient descent iteratively converge on the least-squares solution:

Here's the setup — a synthetic dataset with a clear linear trend plus Gaussian noise:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n = 50
X_raw = np.random.uniform(0, 10, n)
y = 2.5 * X_raw + 7 + np.random.normal(0, 2.5, n)

# Design matrix: add a column of ones for the intercept
X = np.column_stack([np.ones(n), X_raw])  # shape (50, 2)

Our goal: find $\boldsymbol{\beta} = [\beta_0, \beta_1]$ that minimises the sum of squared residuals:

The true values are $\beta_0 = 7$ (intercept) and $\beta_1 = 2.5$ (slope). Let's recover them five ways.

Method 1: Normal Equation

The closed-form solution from linear algebra:

beta_normal = np.linalg.inv(X.T @ X) @ X.T @ y
print(f'Normal equation: y = {beta_normal[0]:.2f} + {beta_normal[1]:.2f}x')

Method 2: Gradient Descent

Iteratively update $\boldsymbol{\beta}$ in the direction that reduces the error:

def gradient_descent(X, y, lr=0.001, n_iter=500):
    beta = np.zeros(X.shape[1])
    history = [beta.copy()]

    for _ in range(n_iter):
        residuals = X @ beta - y
        gradient = (2 / len(y)) * X.T @ residuals
        beta -= lr * gradient
        history.append(beta.copy())

    return beta, history

beta_gd, gd_history = gradient_descent(X, y, lr=0.02, n_iter=1000)
print(f'Gradient descent: y = {beta_gd[0]:.2f} + {beta_gd[1]:.2f}x')

Method 3: SVD (Pseudo-Inverse)

The numerically stable route via singular value decomposition:

U, s, Vt = np.linalg.svd(X, full_matrices=False)
beta_svd = Vt.T @ np.diag(1 / s) @ U.T @ y
print(f'SVD:             y = {beta_svd[0]:.2f} + {beta_svd[1]:.2f}x')

Method 4: scipy.optimize.minimize

Treat regression as a generic optimisation problem:

from scipy.optimize import minimize

def mse_loss(beta, X, y):
    return np.mean((y - X @ beta) ** 2)

result = minimize(mse_loss, x0=[0, 0], args=(X, y), method='Nelder-Mead')
beta_scipy = result.x
print(f'scipy.optimize:  y = {beta_scipy[0]:.2f} + {beta_scipy[1]:.2f}x')

Method 5: sklearn

The one-liner you'll use in practice:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_raw.reshape(-1, 1), y)
beta_sklearn = np.array([model.intercept_, model.coef_[0]])
print(f'sklearn:         y = {beta_sklearn[0]:.2f} + {beta_sklearn[1]:.2f}x')

All five produce the same answer:

The lines overlap perfectly. Every method converges to the same $\boldsymbol{\beta}$ because they're all minimising the same objective — the sum of squared residuals.

What Just Happened?

Five algorithms, one answer. Let's understand why, and where they differ.

The Normal Equation: Where the Gradient Is Zero

The mean squared error is:

Setting the gradient to zero:

Solving for $\boldsymbol{\beta}$ :

This is exact — no iteration, no approximation. But it requires $X^\top X$ to be invertible. If your features are collinear (or nearly so), the determinant is close to zero and the solution explodes numerically.

Gradient Descent: Follow the Slope Downhill

Instead of solving the equation analytically, gradient descent takes small steps in the direction of steepest descent:

where $\eta$ is the learning rate. The gradient $\nabla L = -\frac{2}{n}X^\top(\mathbf{y} - X\boldsymbol{\beta})$ points uphill, so we subtract it.

This is the same algorithm that trains neural networks — just applied to a simpler problem. If you understand gradient descent here, you understand SGD everywhere.

The learning rate matters. Too large and the iterations overshoot and diverge. Too small and convergence is painfully slow. For linear regression, the optimal learning rate relates to the eigenvalues of $X^\top X$ .

SVD: The Numerically Stable Path

The SVD decomposes $X = U \Sigma V^\top$ . Substituting into the normal equation:

This avoids computing $X^\top X$ entirely. When $X^\top X$ is ill-conditioned (near-singular), the normal equation gives garbage, but SVD handles it gracefully by working with the singular values directly. This is what numpy's lstsq and sklearn use internally.

scipy.optimize: The Black-Box Approach

The minimize function treats regression as a generic optimisation problem. We pass the loss function, initial guess, and algorithm — it handles the rest. Nelder-Mead (a derivative-free simplex method, like in our GP hyperparameter optimisation) doesn't even compute gradients. This approach generalises to any loss function — L1, Huber, custom losses — where closed-form solutions don't exist.

sklearn: What Happens Inside

sklearn's LinearRegression uses the SVD internally via numpy.linalg.lstsq. It handles the design matrix construction (adding the intercept column), feature scaling, and edge cases. For production code, always use sklearn — it's tested, optimised, and handles corner cases you'd forget.

Going Deeper

OLS Is Maximum Likelihood Under Gaussian Noise

If we assume the noise is Gaussian — $y_i = \mathbf{x}_i^\top\boldsymbol{\beta} + \epsilon_i$ where $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ — then the likelihood is:

Taking the log and maximising gives exactly the least-squares objective. So OLS is maximum likelihood estimation under Gaussian noise — the same MLE framework that appears throughout our tutorial series.

Ridge Regression: When the Normal Equation Breaks

If $X^\top X$ is singular or nearly so, the normal equation is unstable. Ridge regression adds a penalty $\lambda\|\boldsymbol{\beta}\|^2$ to the objective:

The $\lambda I$ term makes the matrix invertible regardless of collinearity. Larger $\lambda$ shrinks coefficients toward zero — a bias-variance tradeoff you can explore hands-on in the Overfitting Explorer.

def ridge_regression(X, y, lam):
    """Ridge regression: closed-form solution."""
    n_features = X.shape[1]
    return np.linalg.inv(X.T @ X + lam * np.eye(n_features)) @ X.T @ y

lambdas = [0, 1, 10, 100]
for lam in lambdas:
    beta_r = ridge_regression(X, y, lam)
    print(f'lambda={lam:>3d}: y = {beta_r[0]:.2f} + {beta_r[1]:.2f}x')

The Bayesian interpretation: Ridge regression is equivalent to placing a Gaussian prior $\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \sigma^2/\lambda \cdot I)$ on the weights and computing the MAP estimate. The connection to Bayesian inference runs deep — Ridge is OLS with a prior, and GP regression generalises this to infinite-dimensional feature spaces.

Influential Observations: When One Point Rules the Model

OLS minimises squared errors, which means a single outlier far from the data cloud can drag the entire regression line toward itself. This isn't a bug — it's a mathematical consequence of the $L^2$ loss.

Cook's distance (Cook, 1977) quantifies this: it measures how much all fitted values change when observation $i$ is removed. The formula combines the residual size with the observation's leverage — how extreme its $x$ value is:

A common rule of thumb flags points with $D_i > 4/n$ as potentially influential. The danger is subtle: a high-leverage point (extreme $x$ ) doesn't always distort the fit — it might sit right on the true line. But when a high-leverage point is also an outlier in $y$ , it can single-handedly determine the slope.

This is why residual analysis is not optional. Plotting residuals against fitted values, checking for heteroscedasticity (fan-shaped patterns), and inspecting Q-Q plots for normality are how you verify that your model's assumptions actually hold on your data. A clean-looking $R^2$ can mask serious problems that only residual plots reveal.

When influential points are genuine (not data errors), consider robust alternatives: Huber loss, L1 regression (which minimises absolute errors instead of squared), or RANSAC.

When NOT to Use Linear Regression

Non-linear relationships — if the true relationship is curved, linear regression gives a systematically wrong answer. Try polynomial features or non-parametric methods.
Outliers — OLS is sensitive to outliers because it minimises squared errors. A single outlier can tilt the line dramatically. Use Cook's distance to diagnose, and consider Huber loss or L1 regression as robust alternatives.
Collinear features — when features are highly correlated, the normal equation becomes unstable. Use Ridge, Lasso, or PCA first.

Computational Complexity Comparison

Method	Time Complexity	When to Use
Normal Equation	$O(p^2 n + p^3)$	Small $p$ (< ~10k features)
Gradient Descent	$O(npT)$	Large datasets, online learning
SVD	$O(np^2)$	Medium datasets, ill-conditioned $X$
scipy.optimize	Varies	Custom loss functions
sklearn	$O(np^2)$ (SVD internally)	Production code

Where $n$ = samples, $p$ = features, $T$ = iterations. For most tabular ML, sklearn's SVD-based approach is the right choice.

Where This Comes From

Legendre (1805) and Gauss (1809)

The method of least squares has one of the most famous priority disputes in science. Adrien-Marie Legendre published it first in 1805, in Nouvelles méthodes pour la détermination des orbites des comètes. But Carl Friedrich Gauss claimed he had been using it since 1795 — and in his 1809 Theoria motus corporum coelestium, he provided the theoretical justification by connecting least squares to maximum likelihood under Gaussian errors.

The original application was astronomy: predicting the orbit of the asteroid Ceres from a handful of noisy observations. The same problem structure — fitting a model to noisy data — appears in every regression task today.

The Normal Equation

Gauss showed that the least-squares estimator satisfies the normal equations:

The name "normal" comes from the geometric interpretation: the residual vector $\mathbf{y} - X\boldsymbol{\beta}$ is normal (perpendicular) to the column space of $X$ . The OLS solution is the orthogonal projection of $\mathbf{y}$ onto the space spanned by the features.

The Gauss-Markov Theorem

OLS is optimal — but only under specific assumptions. The classical linear model requires:

Linearity: $y = X\boldsymbol{\beta} + \boldsymbol{\epsilon}$ — the true relationship is linear in the parameters
Exogeneity: $\mathbb{E}[\epsilon_i \mid X] = 0$ — errors are unrelated to the features
Homoscedasticity: $\text{Var}(\epsilon_i) = \sigma^2$ — constant error variance across all observations
No autocorrelation: $\text{Cov}(\epsilon_i, \epsilon_j) = 0$ for $i \neq j$ — errors are independent
Full rank: $X$ has full column rank — no feature is a perfect linear combination of others

When these hold, the Gauss-Markov theorem guarantees that OLS is the Best Linear Unbiased Estimator (BLUE):

Among all linear unbiased estimators, OLS has the smallest variance.

The key word is unbiased. The theorem doesn't say OLS is the best estimator period — biased estimators like Ridge can achieve lower mean squared error by trading a small bias for a large reduction in variance.

When the assumptions fail, OLS can mislead. Heteroscedastic errors inflate standard errors. Autocorrelated residuals (common in time series) make confidence intervals too narrow. And multicollinearity makes the normal equation numerically unstable — exactly the scenario Ridge regression was designed for.

Try It Yourself

The interactive notebook includes exercises:

Learning rate experiments — Try learning rates of 0.0001, 0.005, and 0.05. Plot the loss curve for each. At what rate does gradient descent diverge?
L1 vs L2 regression — Generate data with 3 outliers. Compare the OLS line (L2) with the minimum absolute deviation line (L1, using scipy.optimize with the min_abs_dev_error loss from the original code). Which is more robust?
Polynomial features — Generate data from $y = x^2 + \epsilon$ and fit linear regression with features $[1, x, x^2]$ . How does the design matrix change?
Condition number — Create two nearly collinear features ( $x_2 = x_1 + \epsilon$ ). Compare the normal equation vs SVD solutions. What happens to np.linalg.cond(X.T @ X)?

Interactive Tools

Regression Playground — Fit linear, polynomial, and regularised models to your own data in the browser
Overfitting Explorer — See the bias-variance tradeoff in action as you increase polynomial degree

Maximum Likelihood Estimation from Scratch — OLS is MLE under Gaussian noise
Gaussian Process Regression: The Bayesian Approach to Curve Fitting — GP regression generalises the normal equation to infinite dimensions
Backpropagation: Neural Nets from First Principles — Gradient descent on a much harder problem

Frequently Asked Questions

What are the different ways to solve linear regression?

Linear regression can be solved with the normal equation (closed-form matrix algebra), gradient descent (iterative optimisation), QR decomposition (numerically stable matrix factorisation), singular value decomposition (SVD), and regularised variants like Ridge and Lasso. Each method has different trade-offs in speed, stability, and applicability.

When should I use gradient descent instead of the normal equation?

Use gradient descent when the number of features is very large (more than about 10,000), because the normal equation requires inverting a matrix that is quadratic in the number of features. Gradient descent scales linearly with the number of features and can also handle datasets too large to fit in memory via mini-batch processing.

What is the difference between Ridge and Lasso regression?

Both add a penalty to prevent overfitting, but Ridge (L2) shrinks coefficients towards zero without eliminating them, while Lasso (L1) can set coefficients exactly to zero, effectively performing feature selection. Use Lasso when you expect many features to be irrelevant; use Ridge when most features contribute to the prediction.

Why does the normal equation sometimes fail?

The normal equation requires inverting the matrix X^T X, which fails when features are perfectly collinear (the matrix is singular). Even near-collinearity causes numerical instability, amplifying rounding errors. QR or SVD-based methods handle these cases more gracefully. Ridge regression also solves this by regularising the matrix.

Is linear regression still useful in the age of deep learning?

Absolutely. Linear regression is interpretable, fast, requires no hyperparameter tuning (for the basic form), and works surprisingly well when the relationship is approximately linear. It serves as a strong baseline, and understanding it deeply is essential for understanding more complex models that build on the same principles.

DEV Community