Abhijeet Pratap Singh

Posted on Jul 5

Principal Component Analysis (PCA)

#algorithms #beginners #datascience #machinelearning

1. The Problem It Solves

As datasets grow, they often collect dozens—or even hundreds—of features.

The problem is that many of these features carry almost the same information.

For example:

Page Views
Session Duration
Click Count
Active Minutes

These metrics are often highly correlated.

Feeding all of them into a machine learning model increases computation, introduces multicollinearity, and often adds very little new information.

Principal Component Analysis (PCA) solves this problem by compressing the dataset into a much smaller set of new variables while preserving as much information as possible.

Instead of removing features, PCA combines them into new synthetic features called Principal Components.

The goal isn't to lose information.

It's to remove redundancy.

2. Core Intuition

Imagine you're taking a photograph of a 3D airplane model.

If you photograph it from the front, you mostly see a thin vertical shape.

You lose almost all of the airplane's structure.

Now rotate the airplane.

Take another picture from above.

Suddenly you capture the wings, the body, and the overall shape.

One photograph contains much more useful information than the other.

PCA does exactly this mathematically.

Instead of changing the data,

it rotates the coordinate system.

It looks for the viewing angle that captures the largest spread in the data.

That becomes the First Principal Component (PC1).

Then it finds another completely independent direction that captures the next largest amount of variation.

That becomes PC2.

It keeps repeating this until every important direction has been discovered.

3. How the Algorithm Works

PCA transforms correlated variables into a new set of uncorrelated variables called Principal Components.

Step 1 — Standardize the Data

Since PCA measures variance, every feature must first be placed on the same scale.

Without scaling,

features with larger numerical values dominate the analysis.

This is why StandardScaler is almost always used before PCA.

Step 2 — Center the Data

The mean of every feature is subtracted.

This shifts the dataset so every feature has a mean of zero.

Centering ensures PCA measures variation around the center of the data.

Step 3 — Compute the Covariance Matrix

PCA now measures how every feature varies relative to every other feature.

This relationship is captured in the covariance matrix.

Large covariance values indicate strong relationships between features.

Step 4 — Compute Eigenvectors and Eigenvalues

Next, PCA performs an eigendecomposition (or more commonly, Singular Value Decomposition).

The relationship is defined as:

Where:

v = Eigenvector (direction)
λ = Eigenvalue (amount of variance captured)

Think of it this way:

Eigenvectors tell you where to look.
Eigenvalues tell you how much information exists in that direction.

Step 5 — Create Principal Components

The eigenvectors are sorted from highest to lowest eigenvalue.

The first component captures the largest amount of variation.

The second captures the next largest amount.

Each component is always perpendicular (orthogonal) to the previous one.

This guarantees that every Principal Component is completely uncorrelated with the others.

4. What Is PCA Optimizing?

PCA searches for the direction that captures the greatest possible variance.

Mathematically, it solves:

The objective is simple:

Capture the maximum information using the fewest dimensions.

Another way to think about it is that PCA minimizes the amount of information lost when projecting high-dimensional data into a lower-dimensional space.

5. Explained Variance

Every Principal Component explains part of the dataset's total variance.

Example:

Component	Variance Explained
PC1	58%
PC2	24%
PC3	10%
PC4	5%
Remaining	3%

If the first two components explain 82% of the variance,

you may safely reduce dozens of original features down to just two components.

6. When Should You Use PCA?

PCA works well when:

There are many correlated numerical features.
Dimensionality reduction is needed.
Models train slowly because of many variables.
Visualization of high-dimensional data is required.
Multicollinearity is hurting downstream models.

Typical applications include:

Feature reduction
Data visualization
Image compression
Face recognition
Recommendation systems
Bioinformatics
Financial modeling

7. Advantages

PCA offers several important benefits.

Reduces dimensionality.
Removes multicollinearity.
Speeds up machine learning models.
Reduces storage requirements.
Improves visualization.
Eliminates redundant information.
Creates completely uncorrelated features.

8. When It Starts Breaking Down

Despite its usefulness, PCA has several limitations.

Difficult to Interpret

The new Principal Components are mathematical combinations of many features.

Instead of saying

"Session Duration caused the prediction,"

you now have

"Principal Component 2 contributed most."

This is much harder to explain to business stakeholders.

Assumes Linear Relationships

PCA only discovers linear directions.

It cannot capture curved or highly non-linear structures.

Methods like Kernel PCA, t-SNE, or UMAP perform better on non-linear data.

Sensitive to Scaling

Without feature scaling,

variables with larger units dominate the variance calculations.

This produces misleading components.

Sensitive to Outliers

A few extreme observations can dramatically change the direction of the Principal Components because variance depends on squared distances.

9. Python Implementation

import numpy as np
import pandas as pd

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Generate correlated features
np.random.seed(42)

base = np.random.normal(10,2,100)

df = pd.DataFrame({
    "Feature_A": base * 2.5 + np.random.normal(0,0.5,100),
    "Feature_B": base * 1.2 + np.random.normal(0,0.2,100),
    "Session_Time": base * 15 + np.random.normal(0,3,100),
    "Page_Views": base * 4 + np.random.normal(0,1,100)
})

# Standardize
scaler = StandardScaler()

X_scaled = scaler.fit_transform(df)

# Apply PCA
pca = PCA(n_components=0.90)

X_pca = pca.fit_transform(X_scaled)

print("Original Features:", df.shape[1])
print("Principal Components:", X_pca.shape[1])

print("\nExplained Variance Ratio")
print(pca.explained_variance_ratio_)

print("\nTotal Variance Retained")
print(sum(pca.explained_variance_ratio_))

10. How to Evaluate PCA

Unlike supervised learning, PCA has no prediction accuracy.

Instead, we evaluate how much information is retained.

Explained Variance Ratio

Shows how much variance each Principal Component captures.

Higher values indicate more informative components.

Cumulative Explained Variance

Measures the total variance preserved after selecting multiple components.

Many practitioners retain 90–95% of the total variance.

Scree Plot

A Scree Plot graphs the explained variance of each component.

The "elbow" helps determine how many components should be kept.

Reconstruction Error

Measures how much information is lost when reconstructing the original dataset from the reduced components.

Lower reconstruction error indicates better compression.

11. Real-World Engineering Notes

Some practical lessons you'll quickly discover:

Always standardize numerical features before PCA.
PCA is often used as a preprocessing step before Logistic Regression, SVMs, and Neural Networks.
Tree-based models (Decision Trees, Random Forests, XGBoost) usually don't benefit much from PCA because they naturally handle correlated features.
PCA is excellent for visualization—reducing hundreds of dimensions down to two or three makes complex datasets much easier to explore.
The first few components often capture the vast majority of useful information.

12. PCA vs Feature Selection

These are often confused, but they solve different problems.

Feature Selection	PCA
Keeps original features	Creates entirely new features
Easy to interpret	Hard to interpret
Removes unnecessary columns	Combines existing columns
Human-readable	Mathematical representation

13. Key Takeaways

PCA is an unsupervised dimensionality reduction algorithm.
It compresses many correlated features into a smaller set of uncorrelated Principal Components.
The algorithm works by finding the directions that capture the maximum variance in the data.
Principal Components are orthogonal, eliminating multicollinearity.
Feature scaling is essential before applying PCA.
PCA improves computational efficiency and reduces redundancy but sacrifices interpretability because the resulting components are mathematical combinations of the original features.

DEV Community