Berkan Sesen

Posted on Jun 12 • Originally published at sesen.ai

LDA vs PCA: Supervised Meets Unsupervised Dimensionality Reduction

#supervisedlearning #unsupervisedlearning #dimensionalityreduction #statistics

You have a high-dimensional dataset and you need to squeeze it down to two or three dimensions for visualisation or downstream modelling. The go-to move is PCA, and most of the time it works. But consider the Iris dataset: four petal and sepal measurements for 150 flowers across three species. Run PCA and two of the three species land right on top of each other. It looks like they are not separable. But PCA had no idea the species labels existed; it found the direction of maximum variance, not maximum separation. Switch to Linear Discriminant Analysis (LDA) and the same data fans out into three clean clusters.

That experience stuck with me. Dimensionality reduction is one of the most common preprocessing steps in machine learning, yet the choice between supervised and unsupervised reduction is rarely discussed. In a previous post on PCR vs PLS, we explored this contrast for regression. This post draws the same line for classification: PCA ignores labels and maximises variance; LDA uses labels and maximises class separation.

By the end, you will understand both eigenvalue problems, see exactly when PCA fails, and know how to choose between them for your next classification task.

Let's Build It

Click the badge to open the interactive notebook:

We will see LDA in action before diving into the maths. The animation below shows Iris data in its original feature space on the left, then smoothly transitioning into the LDA projection where the three species separate cleanly.

Now let's run both PCA and LDA on the classic Iris dataset (150 flowers, 4 features, 3 species).

import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

iris = load_iris()
X, y = iris.data, iris.target

# PCA: log-transform and standardise (faithful to the original R analysis)
X_log = np.log(X)
scaler = StandardScaler()
X_log_scaled = scaler.fit_transform(X_log)

pca = PCA()
pca.fit(X_log_scaled)
X_pca = X_log_scaled @ pca.components_[:2].T  # manual projection onto first 2 PCs

# LDA: on raw features (faithful to the R code: lda(Species ~ ., data = iris))
lda = LinearDiscriminantAnalysis()
lda.fit(X, y)
X_lda = X @ lda.scalings_[:, :2]  # manual projection onto first 2 LDs

PCA: First 2 PCs explain 96.0% of total variance
LDA: LD1 explains 99.1% of between-class variance

The left panel (PCA) captures 96% of the total variance, yet versicolor and virginica overlap heavily. The right panel (LDA) finds directions that separate all three species. Same data, different objectives, very different outcomes.

What Just Happened?

PCA: Maximising Total Variance

PCA takes log-transformed, standardised features and computes the covariance matrix. It then finds the eigenvectors of that matrix, sorted by eigenvalue. The first eigenvector (PC1) points in the direction of maximum variance across all samples, regardless of class. The second (PC2) is the direction of maximum remaining variance, orthogonal to PC1.

The log transform and standardisation are faithful to the original R analysis: log(iris[,1:4]) with center=TRUE, scale.=TRUE. Log-transforming skewed measurements like petal length brings the distribution closer to normal and stabilises variance. Standardising ensures no single feature dominates simply because of its scale.

We project manually with X_log_scaled @ pca.components_[:2].T, exactly the matrix multiplication the R code performs (scale(log.ir) %*% ir.pca$rotation[,1:2]). The result is a 150 x 2 matrix of coordinates in the new PCA space.

LDA: Maximising Class Separation

LDA takes the raw features (no log, no scaling) and uses the class labels to define two scatter matrices:

Within-class scatter $S_W$ : how spread out the data is within each class
Between-class scatter $S_B$ : how far apart the class means are

It then solves the generalised eigenvalue problem $S_W^{-1} S_B$ to find directions that maximise the ratio of between-class to within-class variance. For $K = 3$ classes, there are at most $K - 1 = 2$ discriminant directions.

The R code uses lda(Species ~ ., data = iris) on raw features, so we do the same. The projection X @ lda.scalings_[:, :2] mirrors the R code's as.matrix(iris[,1:4]) %*% r$scaling[,1:2].

Why PCA Shows Overlap on Iris

Look at the scree plot and discriminant importance side by side:

PC1 captures 73.3% of total variance, and the first two PCs together capture 96.0%. That sounds impressive, but "total variance" is not the same as "class-discriminating variance". Versicolor and virginica differ along a direction that is not the max-variance direction. Setosa separates because it is genuinely far from the other two in every direction, but the versicolor/virginica boundary falls along a lower-variance axis that PCA downweights.

LD1, by contrast, captures 99.1% of the between-class variance. It specifically seeks the direction along which the class means are maximally spread relative to within-class scatter.

The Loadings Tell the Story

The biplot below shows which features drive each principal component. Red arrows represent feature loadings: the direction and magnitude of each feature's contribution.

Petal length and petal width point strongly along PC1, which is also the direction that separates setosa from the other two. Sepal width points in a different direction (roughly along PC2). The versicolor/virginica separation requires a combination that PCA does not prioritise because it does not know the labels.

Going Deeper

The Two Eigenvalue Problems

Both PCA and LDA solve eigenvalue problems, but with different matrices.

PCA solves:

where $\Sigma$ is the covariance matrix of the (standardised) data. The eigenvectors $\mathbf{v}$ are the principal components; the eigenvalues $\lambda$ are the variances along each direction.

LDA solves:

where $S_W = \sum_{k=1}^{K} \sum_{i \in C_k} (\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^T$ is the within-class scatter matrix and $S_B = \sum_{k=1}^{K} n_k (\boldsymbol{\mu}_k - \boldsymbol{\mu})(\boldsymbol{\mu}_k - \boldsymbol{\mu})^T$ is the between-class scatter matrix. The eigenvectors $\mathbf{w}$ are the discriminant directions; the eigenvalues $\lambda$ measure the ratio of between-class to within-class variance along each direction.

The key difference: PCA's objective is purely geometric (find the axes of maximum spread), while LDA's objective is explicitly discriminative (find the axes that best separate known classes).

The Adversarial Example: When PCA Completely Fails

The "aha" moment comes from a simple 2D dataset designed to break PCA. Imagine two classes that differ only along a low-variance axis: both classes have wide horizontal spread (high x-variance) but are separated vertically (small y-gap).

PCA finds the horizontal direction (red arrow, PC1) because it has the most variance. But that direction is orthogonal to the class boundary. Projecting onto PC1 mixes the two classes completely.

LDA finds the vertical direction (green arrow, LD1) because that is where the class means differ. Projecting onto LD1 separates them cleanly.

# Adversarial dataset: max variance perpendicular to class boundary
np.random.seed(42)
n = 150
X_0 = np.column_stack([np.random.randn(n) * 3, np.random.randn(n) * 0.5 - 1])
X_1 = np.column_stack([np.random.randn(n) * 3, np.random.randn(n) * 0.5 + 1])
X_adv = np.vstack([X_0, X_1])
y_adv = np.array([0]*n + [1]*n)

# PCA projection: complete overlap
pca_adv = PCA(n_components=1).fit_transform(X_adv)

# LDA projection: clean separation
lda_adv = LinearDiscriminantAnalysis(n_components=1).fit_transform(X_adv, y_adv)

This is the fundamental lesson: maximum variance and maximum class separation are different objectives. They coincide when class means lie along the direction of maximum variance (which is why PCA works well for setosa). They diverge when the class boundary runs along a low-variance axis.

Hyperparameters and Practical Considerations

n_components for PCA: Choose using the scree plot or cumulative variance threshold (e.g., retain 95% of variance). More components preserve more information but risk noise.

n_components for LDA: At most $K - 1$ where $K$ is the number of classes. For Iris (3 classes), the maximum is 2. This is both a strength (extreme compression) and a limitation (cannot capture more dimensions than classes minus one).

When NOT to use LDA:

Few labels or unreliable labels: LDA trusts the labels completely. Noisy labels corrupt $S_B$ and $S_W$
Highly non-linear boundaries: Fisher's LDA finds linear projections. If classes are separated by a curve, LDA may fail (consider kernel LDA)
$p \gg n$ (many more features than samples): $S_W$ becomes singular. Use regularised LDA (shrinkage='auto' in sklearn)
Gaussian assumption violated: LDA assumes each class is roughly Gaussian with equal covariance. Heavy violations degrade performance

Accuracy on Reduced Features

How much does the choice of reduction method matter for downstream classification? We ran 5-fold cross-validated k-NN (k=5) on Iris features reduced by PCA vs LDA.

With just one component, LDA achieves 96.7% accuracy while PCA manages only 87.3%. The gap narrows with more components, but LDA reaches its ceiling (98.0%) at 2 components, which is its maximum for a 3-class problem. PCA needs 3 components to hit 92.7% and never quite catches up.

This mirrors the PCR vs PLS story: supervised reduction needs fewer components because it knows what matters.

Where This Comes From

Fisher's Original Paper (1936)

Fisher's LDA was introduced in one of the most cited papers in statistics: "The Use of Multiple Measurements in Taxonomic Problems" (Annals of Eugenics, 1936). The paper used the very same Iris dataset we analysed above, which was collected by Edgar Anderson and handed to Fisher for statistical analysis.

Fisher's insight was to find a linear combination of measurements that maximises the ratio of between-group to within-group variance. In the two-class case, this reduces to:

A single vector that points from one class mean to the other, corrected for within-class spread. The multi-class generalisation (which we used) was developed by C.R. Rao in the 1940s, extending Fisher's two-class criterion to $K$ classes via the $S_W^{-1} S_B$ eigenvalue problem.

PCA's Parallel History

PCA traces to Karl Pearson (1901), who proposed fitting lines and planes to data by minimising orthogonal distances. Harold Hotelling (1933) formalised the method using the covariance matrix eigendecomposition in "Analysis of a Complex of Statistical Variables into Principal Components". Both approaches converge to the same solution: the eigenvectors of the covariance matrix.

The Connection

Both methods solve eigenvalue problems. Both produce orthogonal (or at least linearly independent) projection directions. The critical difference is which matrix they decompose. PCA decomposes the covariance matrix (total spread). LDA decomposes the ratio of between-class to within-class scatter (discriminative spread). This parallels the difference between K-Means (unsupervised) and classification (supervised) in the clustering world.

Modern Variants

Since Fisher and Hotelling, both methods have been extended:

Kernel PCA / Kernel LDA: Apply the kernel trick for non-linear boundaries. Map data into a high-dimensional feature space where linear methods become non-linear in the original space
Regularised LDA: Add a ridge penalty to $S_W$ to handle singular or near-singular within-class scatter (essential when $p > n$ )
Quadratic Discriminant Analysis (QDA): Drops the equal-covariance assumption, fitting a separate covariance matrix per class
Sparse PCA: Adds an $\ell_1$ penalty to produce interpretable components with few non-zero loadings

For a thorough treatment, see Bishop's Pattern Recognition and Machine Learning, Chapter 4.1 (Fisher's discriminant) and Chapter 12.1 (PCA), or Raschka's accessible 2014 tutorial which was the inspiration for the original R code we translated.

Interactive Tools

Classification Metrics Calculator — Evaluate precision, recall, and F1 when using LDA as a classifier
Overfitting Explorer — See how model complexity drives overfitting, the problem dimensionality reduction helps prevent

PCR vs PLS: Dimensionality Reduction for Regression — PCA applied to regression, plus a supervised alternative
From K-Means to GMM: Hard vs Soft Clustering — Another unsupervised method that benefits from reduced dimensions
Maximum Likelihood Estimation from Scratch — The optimisation foundations behind eigenvalue methods

Frequently Asked Questions

What is the difference between LDA and PCA?

PCA finds directions of maximum variance in the data, ignoring class labels entirely (unsupervised). LDA finds directions that maximise the separation between classes (supervised). PCA is for general dimensionality reduction; LDA is specifically for improving class discrimination. They can produce very different projections on the same data.

Can I use LDA for more than two classes?

Yes. LDA generalises to multi-class problems, projecting data into at most (k-1) dimensions where k is the number of classes. For a 10-class problem, LDA can reduce to at most 9 dimensions. This is often enough to capture the class structure while dramatically reducing dimensionality.

When should I use PCA instead of LDA?

Use PCA when you have no class labels, when your goal is general-purpose dimensionality reduction (visualisation, compression, noise removal), or when you want to preprocess data before applying another algorithm. PCA is also more robust when class boundaries are not well-defined or when classes overlap heavily.

Does PCA always lose useful information for classification?

Not necessarily. PCA preserves the directions of highest variance, which often align with class-discriminative directions in practice. However, PCA can discard low-variance directions that happen to be highly discriminative. If you have labels and care about classification, LDA or supervised methods are safer choices.

Can I combine PCA and LDA?

Yes, and it is common practice. When the number of features is much larger than the number of samples, LDA's scatter matrices become singular. Applying PCA first to reduce dimensionality (while retaining most variance) and then applying LDA on the reduced features is a standard and effective pipeline.

DEV Community