The One-Line Summary: PCA finds the directions in your data where the most "stuff happens" (variance), then projects your data onto those directions. 100 features become 10 features, but those 10 capture 95% of what made the 100 interesting.
The Sculptor's Photography Problem
Maria is a sculptor. She's just finished her masterpiece — a complex abstract piece with curves, angles, textures, and shadows that shift as you walk around it.
A magazine wants ONE photograph.
One. Single. Photo.
Her sculpture exists in 3D space. A photograph is 2D. She's losing an entire dimension. But which angle captures the MOST?
Attempt 1: Front View
Camera → 🎥
┌─────┐
│ │
│ □ │
│ │
└─────┘
Result: Flat. Boring. Lost all the depth.
Information captured: 40%
Attempt 2: Side View
🎥
↓
┌───────────┐
│ ╱╲ ╱╲ │
│╱ ╲ ╱ ╲ │
└───────────┘
Result: Shows curves but misses the front detail.
Information captured: 45%
Attempt 3: The "Magic Angle"
Maria walks around, studying shadows, analyzing how light plays across surfaces. She finds ONE angle where:
- The main curves are visible
- The depth creates compelling shadows
- The texture is apparent
- The proportions are clear
🎥
↘
┌─────────┐
╱│ ╱╲ ╱╲ │
╱ │╱ ╲_╱ ╲│╲
└─────────┘
Result: The sculpture's ESSENCE is captured.
Information captured: 92%!
This "magic angle" is what PCA finds for your data.
What PCA Actually Does
PCA (Principal Component Analysis) answers the question:
"If I HAD to describe this data with fewer dimensions, which directions capture the most variation?"
Original Data: 100 dimensions (features)
↓
[PCA Magic]
↓
New Data: 10 dimensions that capture 95% of the variance
You lost 90 dimensions but kept 95% of what mattered!
The Shadow Intuition
Think of your data as a 3D cloud of points floating in space.
3D Data Cloud:
z ↑
│ ● ●
│ ● ● ●
│ ● ● ● ●
│ ● ● ●
│ ● ●
└──────────→ y
╱
↙ x
Now imagine shining a flashlight on this cloud and looking at the shadow on the wall.
Different flashlight angles create different shadows:
Angle 1: Shadow is a thin line Angle 2: Shadow shows spread
(Lost almost everything!) (Captured the shape!)
Wall: Wall:
│ │
│ │ │ ● ●
│ │ │ ● ● ●
│ │ │ ● ● ● ●
│ │ ● ● ●
│ ● ●
Variance: LOW Variance: HIGH
Information: LOST Information: PRESERVED
PCA finds the flashlight angle that creates the shadow with MAXIMUM SPREAD (variance).
Why spread? Because spread means different points cast different shadows. If everything collapses to a line, you can't tell points apart anymore!
Step-by-Step: How PCA Works
Step 1: Center the Data
Move everything so the center (mean) is at the origin.
Before centering: After centering:
y ↑ y ↑
│ ●●● │ ●●●
│ ●●●●● │ ●●●●●
│ ●●●●●●● → │ ●●●●●●●
│ ●●●●● │ ●●●●●
│ ●●● │ ●●●
└──────────────→ x ────────┼──────────→ x
│
Mean is somewhere Mean is at (0,0)
from sklearn.preprocessing import StandardScaler
import numpy as np
# Center (and scale) the data
scaler = StandardScaler()
X_centered = scaler.fit_transform(X)
Step 2: Find the Direction of Maximum Variance
PCA asks: "If I drew a line through this cloud, which direction would capture the most spread?"
y ↑
│ ● ●
│ ● ●
│ ● ● ●
│ ● ● ● ●
────────┼───●───●───●───────→ x
│ ● ● ● ●
│ ● ● ●
│ ● ●
│ ● ●
│
Project onto X-axis: ●●●●●●●●●●●●●●●●●●● (lots of spread!)
Project onto Y-axis: ●●●●●●●● (less spread)
PC1 = X-axis direction (more variance = more information)
This first direction is called Principal Component 1 (PC1).
Step 3: Find the Next Best Direction (Perpendicular!)
PC2 must be perpendicular (orthogonal) to PC1. This ensures no redundancy.
PC2
↑
│ ● ●
│ ● ●
│ ● ● ●
──────────┼───●───●───●────────→ PC1
│ ● ● ● ●
│ ● ● ●
│ ● ●
│ ● ●
PC1: Direction of MAXIMUM variance
PC2: Direction of MAXIMUM remaining variance (perpendicular to PC1)
Step 4: Project Data onto Principal Components
Transform every point from original coordinates to PC coordinates.
from sklearn.decomposition import PCA
# Original: 100 features
print(f"Original shape: {X.shape}") # (1000, 100)
# PCA: Keep 10 components
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X_centered)
print(f"Transformed shape: {X_pca.shape}") # (1000, 10)
Complete Working Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits
# Load handwritten digits (64 features = 8x8 pixel images)
digits = load_digits()
X, y = digits.data, digits.target
print(f"Original shape: {X.shape}") # (1797, 64)
print(f"Each digit is described by 64 pixel values")
# Scale the data (important for PCA!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA - keep all components first to see variance explained
pca_full = PCA()
pca_full.fit(X_scaled)
# How much variance does each component explain?
variance_explained = pca_full.explained_variance_ratio_
print("\nVariance explained by each component:")
print("-" * 45)
cumulative = 0
for i, var in enumerate(variance_explained[:15]):
cumulative += var
bar = "█" * int(var * 100)
print(f"PC{i+1:2d}: {bar:<12} {var:5.1%} (cumulative: {cumulative:5.1%})")
# Find number of components for 95% variance
cumsum = np.cumsum(variance_explained)
n_95 = np.argmax(cumsum >= 0.95) + 1
print(f"\nComponents needed for 95% variance: {n_95}")
print(f"Dimensionality reduction: 64 → {n_95} ({100*(64-n_95)/64:.0f}% reduction!)")
# Apply PCA with optimal components
pca = PCA(n_components=n_95)
X_pca = pca.fit_transform(X_scaled)
print(f"\nNew shape: {X_pca.shape}")
Output:
Original shape: (1797, 64)
Each digit is described by 64 pixel values
Variance explained by each component:
---------------------------------------------
PC 1: ████████████ 12.0% (cumulative: 12.0%)
PC 2: █████████ 9.5% (cumulative: 21.4%)
PC 3: ████████ 8.4% (cumulative: 29.9%)
PC 4: ██████ 6.5% (cumulative: 36.4%)
PC 5: █████ 5.5% (cumulative: 41.8%)
PC 6: █████ 5.0% (cumulative: 46.8%)
PC 7: ████ 4.3% (cumulative: 51.2%)
PC 8: ████ 3.9% (cumulative: 55.1%)
PC 9: ███ 3.6% (cumulative: 58.7%)
PC10: ███ 3.3% (cumulative: 62.0%)
PC11: ███ 3.0% (cumulative: 65.1%)
PC12: ██ 2.7% (cumulative: 67.8%)
PC13: ██ 2.5% (cumulative: 70.3%)
PC14: ██ 2.2% (cumulative: 72.5%)
PC15: ██ 2.0% (cumulative: 74.5%)
Components needed for 95% variance: 41
Dimensionality reduction: 64 → 41 (36% reduction!)
Visualizing What PCA Captures
import matplotlib.pyplot as plt
# Project digits to 2D for visualization
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)
# Plot
plt.figure(figsize=(12, 10))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='tab10', alpha=0.7, s=20)
plt.colorbar(scatter, label='Digit')
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.1%} variance)')
plt.title('Handwritten Digits in 2D (PCA)\n64 dimensions → 2 dimensions')
plt.tight_layout()
plt.savefig('pca_digits.png', dpi=150)
plt.show()
print(f"64D → 2D, but we can STILL see digit clusters!")
print(f"PC1 + PC2 explain {sum(pca_2d.explained_variance_ratio_):.1%} of variance")
Even with just 2 dimensions (from 64!), you can see the digits clustering!
The Scree Plot: How Many Components?
# The "Elbow" method
plt.figure(figsize=(12, 5))
# Plot 1: Individual variance
plt.subplot(1, 2, 1)
plt.bar(range(1, 21), variance_explained[:20], alpha=0.7, color='steelblue')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Variance Explained by Each PC')
# Plot 2: Cumulative variance
plt.subplot(1, 2, 2)
plt.plot(range(1, len(cumsum)+1), cumsum, 'b-o', markersize=4)
plt.axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
plt.axvline(x=n_95, color='g', linestyle='--', label=f'{n_95} components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance Explained')
plt.title('Cumulative Variance Explained')
plt.legend()
plt.tight_layout()
plt.savefig('scree_plot.png', dpi=150)
plt.show()
Look for the "elbow" — where adding more components gives diminishing returns.
What Do Principal Components Mean?
Each PC is a weighted combination of original features.
# Look at what PC1 "means"
print("PC1 is a combination of original features:")
print("-" * 50)
# Get the loadings (weights)
loadings = pca.components_[0] # First principal component
# Show top positive and negative contributors
feature_names = [f'pixel_{i}' for i in range(64)]
pc1_loadings = pd.Series(loadings, index=feature_names)
print("\nTop 5 POSITIVE contributors to PC1:")
print(pc1_loadings.nlargest(5))
print("\nTop 5 NEGATIVE contributors to PC1:")
print(pc1_loadings.nsmallest(5))
For images, you can visualize what each PC "looks for":
# Visualize first 6 principal components as images
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
for i, ax in enumerate(axes.flat):
# Reshape the component weights back to 8x8 image
component_image = pca.components_[i].reshape(8, 8)
ax.imshow(component_image, cmap='RdBu_r')
ax.set_title(f'PC{i+1} ({pca.explained_variance_ratio_[i]:.1%})')
ax.axis('off')
plt.suptitle('What Each Principal Component "Looks For"', fontsize=14)
plt.tight_layout()
plt.savefig('pc_components.png', dpi=150)
plt.show()
When to Use PCA
✅ Use Case 1: Too Many Features (Curse of Dimensionality)
# Problem: Model is slow and overfitting
X_original = load_high_dimensional_data() # 10,000 features!
model.fit(X_original, y) # Slow, overfits
# Solution: PCA first
pca = PCA(n_components=100) # 10,000 → 100
X_reduced = pca.fit_transform(X_original)
model.fit(X_reduced, y) # Fast, generalizes better!
✅ Use Case 2: Multicollinearity (Correlated Features)
When features are highly correlated, models struggle. PCA creates UNCORRELATED components.
# Check correlation
print("Original feature correlations:")
print(np.corrcoef(X.T)[:5, :5]) # Many high correlations!
# After PCA: Components are orthogonal (correlation = 0)
X_pca = PCA(n_components=5).fit_transform(X)
print("\nPCA component correlations:")
print(np.corrcoef(X_pca.T).round(10)) # All zeros except diagonal!
Output:
Original feature correlations:
[[ 1. 0.89 0.85 0.72 0.68]
[ 0.89 1. 0.91 0.78 0.71]
[ 0.85 0.91 1. 0.83 0.75]
[ 0.72 0.78 0.83 1. 0.88]
[ 0.68 0.71 0.75 0.88 1. ]]
PCA component correlations:
[[ 1. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 1.]]
PCA eliminated all correlation!
✅ Use Case 3: Visualization (Any Dimensions → 2D/3D)
# Can't visualize 100 dimensions!
# But PCA can project to 2D while preserving structure
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_100d)
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels)
plt.title('100D Data Visualized in 2D')
plt.show()
✅ Use Case 4: Noise Reduction
Low-variance components often capture noise, not signal.
# Keep only components that explain real variance
pca = PCA(n_components=0.95) # Keep 95% variance
X_denoised = pca.fit_transform(X_noisy)
# The noisy components were discarded!
print(f"Removed {X_noisy.shape[1] - X_denoised.shape[1]} noisy dimensions")
✅ Use Case 5: Speeding Up Other Algorithms
from sklearn.neighbors import KNeighborsClassifier
import time
# KNN on 10,000 features: SLOW
start = time.time()
knn = KNeighborsClassifier()
knn.fit(X_10000, y)
print(f"KNN on 10,000 features: {time.time()-start:.2f}s")
# KNN on 100 PCA features: FAST
X_pca = PCA(n_components=100).fit_transform(X_10000)
start = time.time()
knn = KNeighborsClassifier()
knn.fit(X_pca, y)
print(f"KNN on 100 PCA features: {time.time()-start:.2f}s")
When NOT to Use PCA
❌ Don't Use When: Interpretability Matters
# After PCA, you can't say "income is important"
# You can only say "PC1 is important" — but what IS PC1?
# PC1 = 0.32×income + 0.28×age + 0.21×credit_score - 0.15×debt + ...
# Try explaining that to a business stakeholder!
❌ Don't Use When: Features Have Different Scales (Without Scaling!)
# ❌ WRONG: PCA on unscaled data
X = pd.DataFrame({
'age': [25, 30, 35, 40], # Range: 25-40
'income': [50000, 75000, 100000, 125000] # Range: 50000-125000
})
pca = PCA(n_components=1)
pca.fit(X)
print(pca.components_) # Income dominates because of scale!
# [[0.0001, 0.9999]] ← Age contributes almost nothing!
# ✅ RIGHT: Scale first!
X_scaled = StandardScaler().fit_transform(X)
pca.fit(X_scaled)
print(pca.components_) # Both features contribute fairly
# [[0.707, 0.707]]
❌ Don't Use When: Relationships Are Non-Linear
PCA finds LINEAR directions. If your data has curves, PCA misses them.
# Example: Data forms a curve (like a spiral)
# PCA will draw a straight line through it — useless!
# For non-linear dimensionality reduction, use:
# - t-SNE (visualization)
# - UMAP (visualization + some ML tasks)
# - Kernel PCA (PCA with kernel trick)
# - Autoencoders (deep learning)
❌ Don't Use When: All Features Are Equally Important
If every feature explains the same variance, PCA can't help.
# Check first: Is there variance to compress?
pca = PCA()
pca.fit(X)
if pca.explained_variance_ratio_[0] < 0.1:
print("Warning: No dominant direction found!")
print("All features are roughly equally important.")
print("PCA won't help much here.")
The Complete PCA Pipeline
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
# Load data (example: 100 features)
X, y = load_data()
# Split FIRST (prevent data leakage!)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Method 1: Manual steps
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use train's mean/std!
pca = PCA(n_components=0.95) # Keep 95% variance
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled) # Use train's PCA!
print(f"Original: {X_train.shape[1]} features")
print(f"After PCA: {X_train_pca.shape[1]} features")
# Method 2: Pipeline (cleaner, handles CV correctly)
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=0.95)),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Cross-validation with pipeline handles everything correctly!
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='accuracy')
print(f"\nCV Accuracy: {scores.mean():.1%} ± {scores.std():.1%}")
# Final evaluation
pipeline.fit(X_train, y_train)
test_accuracy = pipeline.score(X_test, y_test)
print(f"Test Accuracy: {test_accuracy:.1%}")
Common Mistakes
Mistake 1: Not Scaling Before PCA
# ❌ WRONG
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X) # Large-scale features dominate!
# ✅ RIGHT
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X_scaled)
Mistake 2: Fitting PCA on Test Data
# ❌ WRONG: Data leakage!
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.fit_transform(X_test) # NO! Different transformation!
# ✅ RIGHT
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test) # Use train's PCA transformation!
Mistake 3: Keeping Too Few/Many Components
# ❌ WRONG: Arbitrary number
pca = PCA(n_components=10) # Why 10?
# ✅ RIGHT: Based on variance explained
pca = PCA(n_components=0.95) # Keep 95% variance
# Or: Use cross-validation to find optimal number
Mistake 4: Using PCA for Classification Directly
# ❌ WRONG: PCA doesn't know about labels!
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X) # Ignores y completely!
# ✅ RIGHT: Use LDA if you want supervised dimensionality reduction
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X, y) # Uses labels to maximize class separation!
Quick Reference: PCA Parameters
| Parameter | What It Does | Recommended |
|---|---|---|
n_components (int) |
Keep exactly N components | When you know target size |
n_components (float) |
Keep N% variance |
0.95 (95% variance) |
n_components='mle' |
Auto-select using MLE | For automatic selection |
whiten=True |
Scale components to unit variance | For some algorithms |
svd_solver='full' |
Exact computation | For accuracy |
svd_solver='randomized' |
Approximate (faster) | For large datasets |
Key Takeaways
PCA finds directions of maximum variance — Like finding the best camera angle for a sculpture
It's dimensionality reduction, not feature selection — Creates NEW features from combinations
Always scale before PCA — Or large-magnitude features dominate
Components are orthogonal — No correlation between them
Use explained variance to choose components — 95% is a common threshold
Fit on train, transform on test — Never fit PCA on test data!
PCA is unsupervised — It doesn't know about your target variable
Great for visualization, speed, and noise reduction — Not for interpretability
The One-Sentence Summary
PCA is Maria the sculptor finding the one camera angle that captures 95% of her 3D masterpiece's essence in a single 2D photograph — losing a dimension but keeping what matters most.
What's Next?
Now that you understand PCA, you're ready for:
- Kernel PCA — PCA for non-linear relationships
- t-SNE and UMAP — Non-linear visualization techniques
- LDA — Supervised dimensionality reduction
- Autoencoders — Deep learning approach to compression
Follow me for the next article in this series!
Let's Connect!
If PCA finally makes sense now, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
What's the highest dimensionality reduction ratio you've achieved with PCA? I've seen 10,000 → 50 with 95% variance preserved!
The difference between a model drowning in 10,000 features and one that sails smoothly on 100? Finding the camera angle — the principal components — that captures what actually matters. That's PCA.
Share this with someone struggling to visualize their 100-dimensional data. There's a 2D photograph waiting to be taken.
Happy projecting! 📸
Top comments (0)