1. The Problem It Solves
As datasets grow, they often collect dozens—or even hundreds—of features.
The problem is that many of these features carry almost the same information.
For example:
- Page Views
- Session Duration
- Click Count
- Active Minutes
These metrics are often highly correlated.
Feeding all of them into a machine learning model increases computation, introduces multicollinearity, and often adds very little new information.
Principal Component Analysis (PCA) solves this problem by compressing the dataset into a much smaller set of new variables while preserving as much information as possible.
Instead of removing features, PCA combines them into new synthetic features called Principal Components.
The goal isn't to lose information.
It's to remove redundancy.
2. Core Intuition
Imagine you're taking a photograph of a 3D airplane model.
If you photograph it from the front, you mostly see a thin vertical shape.
You lose almost all of the airplane's structure.
Now rotate the airplane.
Take another picture from above.
Suddenly you capture the wings, the body, and the overall shape.
One photograph contains much more useful information than the other.
PCA does exactly this mathematically.
Instead of changing the data,
it rotates the coordinate system.
It looks for the viewing angle that captures the largest spread in the data.
That becomes the First Principal Component (PC1).
Then it finds another completely independent direction that captures the next largest amount of variation.
That becomes PC2.
It keeps repeating this until every important direction has been discovered.
3. How the Algorithm Works
PCA transforms correlated variables into a new set of uncorrelated variables called Principal Components.
Step 1 — Standardize the Data
Since PCA measures variance, every feature must first be placed on the same scale.
Without scaling,
features with larger numerical values dominate the analysis.
This is why StandardScaler is almost always used before PCA.
Step 2 — Center the Data
The mean of every feature is subtracted.
This shifts the dataset so every feature has a mean of zero.
Centering ensures PCA measures variation around the center of the data.
Step 3 — Compute the Covariance Matrix
PCA now measures how every feature varies relative to every other feature.
This relationship is captured in the covariance matrix.
Large covariance values indicate strong relationships between features.
Step 4 — Compute Eigenvectors and Eigenvalues
Next, PCA performs an eigendecomposition (or more commonly, Singular Value Decomposition).
The relationship is defined as:
Where:
- v = Eigenvector (direction)
- λ = Eigenvalue (amount of variance captured)
Think of it this way:
- Eigenvectors tell you where to look.
- Eigenvalues tell you how much information exists in that direction.
Step 5 — Create Principal Components
The eigenvectors are sorted from highest to lowest eigenvalue.
The first component captures the largest amount of variation.
The second captures the next largest amount.
Each component is always perpendicular (orthogonal) to the previous one.
This guarantees that every Principal Component is completely uncorrelated with the others.
4. What Is PCA Optimizing?
PCA searches for the direction that captures the greatest possible variance.
Mathematically, it solves:
The objective is simple:
Capture the maximum information using the fewest dimensions.
Another way to think about it is that PCA minimizes the amount of information lost when projecting high-dimensional data into a lower-dimensional space.
5. Explained Variance
Every Principal Component explains part of the dataset's total variance.
Example:
| Component | Variance Explained |
|---|---|
| PC1 | 58% |
| PC2 | 24% |
| PC3 | 10% |
| PC4 | 5% |
| Remaining | 3% |
If the first two components explain 82% of the variance,
you may safely reduce dozens of original features down to just two components.
6. When Should You Use PCA?
PCA works well when:
- There are many correlated numerical features.
- Dimensionality reduction is needed.
- Models train slowly because of many variables.
- Visualization of high-dimensional data is required.
- Multicollinearity is hurting downstream models.
Typical applications include:
- Feature reduction
- Data visualization
- Image compression
- Face recognition
- Recommendation systems
- Bioinformatics
- Financial modeling
7. Advantages
PCA offers several important benefits.
- Reduces dimensionality.
- Removes multicollinearity.
- Speeds up machine learning models.
- Reduces storage requirements.
- Improves visualization.
- Eliminates redundant information.
- Creates completely uncorrelated features.
8. When It Starts Breaking Down
Despite its usefulness, PCA has several limitations.
Difficult to Interpret
The new Principal Components are mathematical combinations of many features.
Instead of saying
"Session Duration caused the prediction,"
you now have
"Principal Component 2 contributed most."
This is much harder to explain to business stakeholders.
Assumes Linear Relationships
PCA only discovers linear directions.
It cannot capture curved or highly non-linear structures.
Methods like Kernel PCA, t-SNE, or UMAP perform better on non-linear data.
Sensitive to Scaling
Without feature scaling,
variables with larger units dominate the variance calculations.
This produces misleading components.
Sensitive to Outliers
A few extreme observations can dramatically change the direction of the Principal Components because variance depends on squared distances.
9. Python Implementation
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Generate correlated features
np.random.seed(42)
base = np.random.normal(10,2,100)
df = pd.DataFrame({
"Feature_A": base * 2.5 + np.random.normal(0,0.5,100),
"Feature_B": base * 1.2 + np.random.normal(0,0.2,100),
"Session_Time": base * 15 + np.random.normal(0,3,100),
"Page_Views": base * 4 + np.random.normal(0,1,100)
})
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
# Apply PCA
pca = PCA(n_components=0.90)
X_pca = pca.fit_transform(X_scaled)
print("Original Features:", df.shape[1])
print("Principal Components:", X_pca.shape[1])
print("\nExplained Variance Ratio")
print(pca.explained_variance_ratio_)
print("\nTotal Variance Retained")
print(sum(pca.explained_variance_ratio_))
10. How to Evaluate PCA
Unlike supervised learning, PCA has no prediction accuracy.
Instead, we evaluate how much information is retained.
Explained Variance Ratio
Shows how much variance each Principal Component captures.
Higher values indicate more informative components.
Cumulative Explained Variance
Measures the total variance preserved after selecting multiple components.
Many practitioners retain 90–95% of the total variance.
Scree Plot
A Scree Plot graphs the explained variance of each component.
The "elbow" helps determine how many components should be kept.
Reconstruction Error
Measures how much information is lost when reconstructing the original dataset from the reduced components.
Lower reconstruction error indicates better compression.
11. Real-World Engineering Notes
Some practical lessons you'll quickly discover:
- Always standardize numerical features before PCA.
- PCA is often used as a preprocessing step before Logistic Regression, SVMs, and Neural Networks.
- Tree-based models (Decision Trees, Random Forests, XGBoost) usually don't benefit much from PCA because they naturally handle correlated features.
- PCA is excellent for visualization—reducing hundreds of dimensions down to two or three makes complex datasets much easier to explore.
- The first few components often capture the vast majority of useful information.
12. PCA vs Feature Selection
These are often confused, but they solve different problems.
| Feature Selection | PCA |
|---|---|
| Keeps original features | Creates entirely new features |
| Easy to interpret | Hard to interpret |
| Removes unnecessary columns | Combines existing columns |
| Human-readable | Mathematical representation |
13. Key Takeaways
- PCA is an unsupervised dimensionality reduction algorithm.
- It compresses many correlated features into a smaller set of uncorrelated Principal Components.
- The algorithm works by finding the directions that capture the maximum variance in the data.
- Principal Components are orthogonal, eliminating multicollinearity.
- Feature scaling is essential before applying PCA.
- PCA improves computational efficiency and reduces redundancy but sacrifices interpretability because the resulting components are mathematical combinations of the original features.
Top comments (0)