DEV Community

Cover image for PCA Explained: Finding the Perfect Angle to Photograph a Sculpture So You Capture Everything in One Shot
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

PCA Explained: Finding the Perfect Angle to Photograph a Sculpture So You Capture Everything in One Shot

The One-Line Summary: PCA finds the directions in your data where the most "stuff happens" (variance), then projects your data onto those directions. 100 features become 10 features, but those 10 capture 95% of what made the 100 interesting.


The Sculptor's Photography Problem

Maria is a sculptor. She's just finished her masterpiece — a complex abstract piece with curves, angles, textures, and shadows that shift as you walk around it.

A magazine wants ONE photograph.

One. Single. Photo.

Her sculpture exists in 3D space. A photograph is 2D. She's losing an entire dimension. But which angle captures the MOST?


Attempt 1: Front View

        Camera → 🎥

                ┌─────┐
                │     │
                │  □  │
                │     │
                └─────┘

Result: Flat. Boring. Lost all the depth.
Information captured: 40%
Enter fullscreen mode Exit fullscreen mode

Attempt 2: Side View

                        🎥
                         ↓
                ┌───────────┐
                │ ╱╲    ╱╲  │
                │╱  ╲  ╱  ╲ │
                └───────────┘

Result: Shows curves but misses the front detail.
Information captured: 45%
Enter fullscreen mode Exit fullscreen mode

Attempt 3: The "Magic Angle"

Maria walks around, studying shadows, analyzing how light plays across surfaces. She finds ONE angle where:

  • The main curves are visible
  • The depth creates compelling shadows
  • The texture is apparent
  • The proportions are clear
                    🎥
                     ↘
                   ┌─────────┐
                  ╱│ ╱╲   ╱╲ │
                 ╱ │╱  ╲_╱  ╲│╲
                   └─────────┘ 

Result: The sculpture's ESSENCE is captured.
Information captured: 92%!
Enter fullscreen mode Exit fullscreen mode

This "magic angle" is what PCA finds for your data.


What PCA Actually Does

PCA (Principal Component Analysis) answers the question:

"If I HAD to describe this data with fewer dimensions, which directions capture the most variation?"

Original Data: 100 dimensions (features)
                        ↓
                [PCA Magic]
                        ↓
New Data: 10 dimensions that capture 95% of the variance

You lost 90 dimensions but kept 95% of what mattered!
Enter fullscreen mode Exit fullscreen mode

The Shadow Intuition

Think of your data as a 3D cloud of points floating in space.

3D Data Cloud:

        z ↑
          │    ● ●
          │  ●  ●  ●
          │ ●  ●  ●  ●
          │  ●  ●  ●
          │    ● ●
          └──────────→ y
         ╱
       ↙ x
Enter fullscreen mode Exit fullscreen mode

Now imagine shining a flashlight on this cloud and looking at the shadow on the wall.

Different flashlight angles create different shadows:

Angle 1: Shadow is a thin line      Angle 2: Shadow shows spread
(Lost almost everything!)           (Captured the shape!)

    Wall:                               Wall:
    │                                   │
    │      │                            │    ●  ●
    │      │                            │  ●  ●  ●
    │      │                            │ ●  ●  ●  ●
    │                                   │  ●  ●  ●
                                        │    ●  ●

    Variance: LOW                       Variance: HIGH
    Information: LOST                   Information: PRESERVED
Enter fullscreen mode Exit fullscreen mode

PCA finds the flashlight angle that creates the shadow with MAXIMUM SPREAD (variance).

Why spread? Because spread means different points cast different shadows. If everything collapses to a line, you can't tell points apart anymore!


Step-by-Step: How PCA Works

Step 1: Center the Data

Move everything so the center (mean) is at the origin.

Before centering:                 After centering:

    y ↑                              y ↑
      │         ●●●                    │    ●●●
      │        ●●●●●                   │   ●●●●●
      │       ●●●●●●●      →           │  ●●●●●●●
      │        ●●●●●                   │   ●●●●●
      │         ●●●                    │    ●●●
      └──────────────→ x       ────────┼──────────→ x
                                       │
      Mean is somewhere               Mean is at (0,0)
Enter fullscreen mode Exit fullscreen mode
from sklearn.preprocessing import StandardScaler
import numpy as np

# Center (and scale) the data
scaler = StandardScaler()
X_centered = scaler.fit_transform(X)
Enter fullscreen mode Exit fullscreen mode

Step 2: Find the Direction of Maximum Variance

PCA asks: "If I drew a line through this cloud, which direction would capture the most spread?"

                y ↑
                  │       ● ●
                  │     ●     ●
                  │   ●    ●    ●
                  │  ●   ●   ●   ●
          ────────┼───●───●───●───────→ x
                  │  ●   ●   ●   ●
                  │   ●    ●    ●
                  │     ●     ●
                  │       ● ●
                  │

Project onto X-axis: ●●●●●●●●●●●●●●●●●●● (lots of spread!)
Project onto Y-axis: ●●●●●●●● (less spread)

PC1 = X-axis direction (more variance = more information)
Enter fullscreen mode Exit fullscreen mode

This first direction is called Principal Component 1 (PC1).


Step 3: Find the Next Best Direction (Perpendicular!)

PC2 must be perpendicular (orthogonal) to PC1. This ensures no redundancy.

                    PC2
                     ↑
                     │       ● ●
                     │     ●     ●
                     │   ●    ●    ●
           ──────────┼───●───●───●────────→ PC1
                     │  ●   ●   ●   ●
                     │   ●    ●    ●
                     │     ●     ●
                     │       ● ●

PC1: Direction of MAXIMUM variance
PC2: Direction of MAXIMUM remaining variance (perpendicular to PC1)
Enter fullscreen mode Exit fullscreen mode

Step 4: Project Data onto Principal Components

Transform every point from original coordinates to PC coordinates.

from sklearn.decomposition import PCA

# Original: 100 features
print(f"Original shape: {X.shape}")  # (1000, 100)

# PCA: Keep 10 components
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X_centered)

print(f"Transformed shape: {X_pca.shape}")  # (1000, 10)
Enter fullscreen mode Exit fullscreen mode

Complete Working Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits

# Load handwritten digits (64 features = 8x8 pixel images)
digits = load_digits()
X, y = digits.data, digits.target

print(f"Original shape: {X.shape}")  # (1797, 64)
print(f"Each digit is described by 64 pixel values")

# Scale the data (important for PCA!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA - keep all components first to see variance explained
pca_full = PCA()
pca_full.fit(X_scaled)

# How much variance does each component explain?
variance_explained = pca_full.explained_variance_ratio_

print("\nVariance explained by each component:")
print("-" * 45)
cumulative = 0
for i, var in enumerate(variance_explained[:15]):
    cumulative += var
    bar = "" * int(var * 100)
    print(f"PC{i+1:2d}: {bar:<12} {var:5.1%}  (cumulative: {cumulative:5.1%})")

# Find number of components for 95% variance
cumsum = np.cumsum(variance_explained)
n_95 = np.argmax(cumsum >= 0.95) + 1
print(f"\nComponents needed for 95% variance: {n_95}")
print(f"Dimensionality reduction: 64 → {n_95} ({100*(64-n_95)/64:.0f}% reduction!)")

# Apply PCA with optimal components
pca = PCA(n_components=n_95)
X_pca = pca.fit_transform(X_scaled)

print(f"\nNew shape: {X_pca.shape}")
Enter fullscreen mode Exit fullscreen mode

Output:

Original shape: (1797, 64)
Each digit is described by 64 pixel values

Variance explained by each component:
---------------------------------------------
PC 1: ████████████ 12.0%  (cumulative: 12.0%)
PC 2: █████████    9.5%  (cumulative: 21.4%)
PC 3: ████████     8.4%  (cumulative: 29.9%)
PC 4: ██████       6.5%  (cumulative: 36.4%)
PC 5: █████        5.5%  (cumulative: 41.8%)
PC 6: █████        5.0%  (cumulative: 46.8%)
PC 7: ████         4.3%  (cumulative: 51.2%)
PC 8: ████         3.9%  (cumulative: 55.1%)
PC 9: ███          3.6%  (cumulative: 58.7%)
PC10: ███          3.3%  (cumulative: 62.0%)
PC11: ███          3.0%  (cumulative: 65.1%)
PC12: ██           2.7%  (cumulative: 67.8%)
PC13: ██           2.5%  (cumulative: 70.3%)
PC14: ██           2.2%  (cumulative: 72.5%)
PC15: ██           2.0%  (cumulative: 74.5%)

Components needed for 95% variance: 41
Dimensionality reduction: 64 → 41 (36% reduction!)
Enter fullscreen mode Exit fullscreen mode

Visualizing What PCA Captures

import matplotlib.pyplot as plt

# Project digits to 2D for visualization
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)

# Plot
plt.figure(figsize=(12, 10))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='tab10', alpha=0.7, s=20)
plt.colorbar(scatter, label='Digit')
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.1%} variance)')
plt.title('Handwritten Digits in 2D (PCA)\n64 dimensions → 2 dimensions')
plt.tight_layout()
plt.savefig('pca_digits.png', dpi=150)
plt.show()

print(f"64D → 2D, but we can STILL see digit clusters!")
print(f"PC1 + PC2 explain {sum(pca_2d.explained_variance_ratio_):.1%} of variance")
Enter fullscreen mode Exit fullscreen mode

Even with just 2 dimensions (from 64!), you can see the digits clustering!


The Scree Plot: How Many Components?

# The "Elbow" method
plt.figure(figsize=(12, 5))

# Plot 1: Individual variance
plt.subplot(1, 2, 1)
plt.bar(range(1, 21), variance_explained[:20], alpha=0.7, color='steelblue')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Variance Explained by Each PC')

# Plot 2: Cumulative variance
plt.subplot(1, 2, 2)
plt.plot(range(1, len(cumsum)+1), cumsum, 'b-o', markersize=4)
plt.axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
plt.axvline(x=n_95, color='g', linestyle='--', label=f'{n_95} components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance Explained')
plt.title('Cumulative Variance Explained')
plt.legend()

plt.tight_layout()
plt.savefig('scree_plot.png', dpi=150)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Look for the "elbow" — where adding more components gives diminishing returns.


What Do Principal Components Mean?

Each PC is a weighted combination of original features.

# Look at what PC1 "means"
print("PC1 is a combination of original features:")
print("-" * 50)

# Get the loadings (weights)
loadings = pca.components_[0]  # First principal component

# Show top positive and negative contributors
feature_names = [f'pixel_{i}' for i in range(64)]
pc1_loadings = pd.Series(loadings, index=feature_names)

print("\nTop 5 POSITIVE contributors to PC1:")
print(pc1_loadings.nlargest(5))

print("\nTop 5 NEGATIVE contributors to PC1:")
print(pc1_loadings.nsmallest(5))
Enter fullscreen mode Exit fullscreen mode

For images, you can visualize what each PC "looks for":

# Visualize first 6 principal components as images
fig, axes = plt.subplots(2, 3, figsize=(12, 8))

for i, ax in enumerate(axes.flat):
    # Reshape the component weights back to 8x8 image
    component_image = pca.components_[i].reshape(8, 8)
    ax.imshow(component_image, cmap='RdBu_r')
    ax.set_title(f'PC{i+1} ({pca.explained_variance_ratio_[i]:.1%})')
    ax.axis('off')

plt.suptitle('What Each Principal Component "Looks For"', fontsize=14)
plt.tight_layout()
plt.savefig('pc_components.png', dpi=150)
plt.show()
Enter fullscreen mode Exit fullscreen mode

When to Use PCA

✅ Use Case 1: Too Many Features (Curse of Dimensionality)

# Problem: Model is slow and overfitting
X_original = load_high_dimensional_data()  # 10,000 features!
model.fit(X_original, y)  # Slow, overfits

# Solution: PCA first
pca = PCA(n_components=100)  # 10,000 → 100
X_reduced = pca.fit_transform(X_original)
model.fit(X_reduced, y)  # Fast, generalizes better!
Enter fullscreen mode Exit fullscreen mode

✅ Use Case 2: Multicollinearity (Correlated Features)

When features are highly correlated, models struggle. PCA creates UNCORRELATED components.

# Check correlation
print("Original feature correlations:")
print(np.corrcoef(X.T)[:5, :5])  # Many high correlations!

# After PCA: Components are orthogonal (correlation = 0)
X_pca = PCA(n_components=5).fit_transform(X)
print("\nPCA component correlations:")
print(np.corrcoef(X_pca.T).round(10))  # All zeros except diagonal!
Enter fullscreen mode Exit fullscreen mode

Output:

Original feature correlations:
[[ 1.    0.89  0.85  0.72  0.68]
 [ 0.89  1.    0.91  0.78  0.71]
 [ 0.85  0.91  1.    0.83  0.75]
 [ 0.72  0.78  0.83  1.    0.88]
 [ 0.68  0.71  0.75  0.88  1.  ]]

PCA component correlations:
[[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]]
Enter fullscreen mode Exit fullscreen mode

PCA eliminated all correlation!


✅ Use Case 3: Visualization (Any Dimensions → 2D/3D)

# Can't visualize 100 dimensions!
# But PCA can project to 2D while preserving structure

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_100d)

plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels)
plt.title('100D Data Visualized in 2D')
plt.show()
Enter fullscreen mode Exit fullscreen mode

✅ Use Case 4: Noise Reduction

Low-variance components often capture noise, not signal.

# Keep only components that explain real variance
pca = PCA(n_components=0.95)  # Keep 95% variance
X_denoised = pca.fit_transform(X_noisy)

# The noisy components were discarded!
print(f"Removed {X_noisy.shape[1] - X_denoised.shape[1]} noisy dimensions")
Enter fullscreen mode Exit fullscreen mode

✅ Use Case 5: Speeding Up Other Algorithms

from sklearn.neighbors import KNeighborsClassifier
import time

# KNN on 10,000 features: SLOW
start = time.time()
knn = KNeighborsClassifier()
knn.fit(X_10000, y)
print(f"KNN on 10,000 features: {time.time()-start:.2f}s")

# KNN on 100 PCA features: FAST
X_pca = PCA(n_components=100).fit_transform(X_10000)
start = time.time()
knn = KNeighborsClassifier()
knn.fit(X_pca, y)
print(f"KNN on 100 PCA features: {time.time()-start:.2f}s")
Enter fullscreen mode Exit fullscreen mode

When NOT to Use PCA

❌ Don't Use When: Interpretability Matters

# After PCA, you can't say "income is important"
# You can only say "PC1 is important" — but what IS PC1?

# PC1 = 0.32×income + 0.28×age + 0.21×credit_score - 0.15×debt + ...

# Try explaining that to a business stakeholder!
Enter fullscreen mode Exit fullscreen mode

❌ Don't Use When: Features Have Different Scales (Without Scaling!)

# ❌ WRONG: PCA on unscaled data
X = pd.DataFrame({
    'age': [25, 30, 35, 40],           # Range: 25-40
    'income': [50000, 75000, 100000, 125000]  # Range: 50000-125000
})

pca = PCA(n_components=1)
pca.fit(X)
print(pca.components_)  # Income dominates because of scale!
# [[0.0001, 0.9999]]  ← Age contributes almost nothing!

# ✅ RIGHT: Scale first!
X_scaled = StandardScaler().fit_transform(X)
pca.fit(X_scaled)
print(pca.components_)  # Both features contribute fairly
# [[0.707, 0.707]]
Enter fullscreen mode Exit fullscreen mode

❌ Don't Use When: Relationships Are Non-Linear

PCA finds LINEAR directions. If your data has curves, PCA misses them.

# Example: Data forms a curve (like a spiral)
# PCA will draw a straight line through it — useless!

# For non-linear dimensionality reduction, use:
# - t-SNE (visualization)
# - UMAP (visualization + some ML tasks)
# - Kernel PCA (PCA with kernel trick)
# - Autoencoders (deep learning)
Enter fullscreen mode Exit fullscreen mode

❌ Don't Use When: All Features Are Equally Important

If every feature explains the same variance, PCA can't help.

# Check first: Is there variance to compress?
pca = PCA()
pca.fit(X)

if pca.explained_variance_ratio_[0] < 0.1:
    print("Warning: No dominant direction found!")
    print("All features are roughly equally important.")
    print("PCA won't help much here.")
Enter fullscreen mode Exit fullscreen mode

The Complete PCA Pipeline

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# Load data (example: 100 features)
X, y = load_data()

# Split FIRST (prevent data leakage!)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Method 1: Manual steps
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use train's mean/std!

pca = PCA(n_components=0.95)  # Keep 95% variance
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)  # Use train's PCA!

print(f"Original: {X_train.shape[1]} features")
print(f"After PCA: {X_train_pca.shape[1]} features")

# Method 2: Pipeline (cleaner, handles CV correctly)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Cross-validation with pipeline handles everything correctly!
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='accuracy')
print(f"\nCV Accuracy: {scores.mean():.1%} ± {scores.std():.1%}")

# Final evaluation
pipeline.fit(X_train, y_train)
test_accuracy = pipeline.score(X_test, y_test)
print(f"Test Accuracy: {test_accuracy:.1%}")
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake 1: Not Scaling Before PCA

# ❌ WRONG
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X)  # Large-scale features dominate!

# ✅ RIGHT
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X_scaled)
Enter fullscreen mode Exit fullscreen mode

Mistake 2: Fitting PCA on Test Data

# ❌ WRONG: Data leakage!
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.fit_transform(X_test)  # NO! Different transformation!

# ✅ RIGHT
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)  # Use train's PCA transformation!
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Keeping Too Few/Many Components

# ❌ WRONG: Arbitrary number
pca = PCA(n_components=10)  # Why 10?

# ✅ RIGHT: Based on variance explained
pca = PCA(n_components=0.95)  # Keep 95% variance
# Or: Use cross-validation to find optimal number
Enter fullscreen mode Exit fullscreen mode

Mistake 4: Using PCA for Classification Directly

# ❌ WRONG: PCA doesn't know about labels!
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)  # Ignores y completely!

# ✅ RIGHT: Use LDA if you want supervised dimensionality reduction
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X, y)  # Uses labels to maximize class separation!
Enter fullscreen mode Exit fullscreen mode

Quick Reference: PCA Parameters

Parameter What It Does Recommended
n_components (int) Keep exactly N components When you know target size
n_components (float) Keep N% variance 0.95 (95% variance)
n_components='mle' Auto-select using MLE For automatic selection
whiten=True Scale components to unit variance For some algorithms
svd_solver='full' Exact computation For accuracy
svd_solver='randomized' Approximate (faster) For large datasets

Key Takeaways

  1. PCA finds directions of maximum variance — Like finding the best camera angle for a sculpture

  2. It's dimensionality reduction, not feature selection — Creates NEW features from combinations

  3. Always scale before PCA — Or large-magnitude features dominate

  4. Components are orthogonal — No correlation between them

  5. Use explained variance to choose components — 95% is a common threshold

  6. Fit on train, transform on test — Never fit PCA on test data!

  7. PCA is unsupervised — It doesn't know about your target variable

  8. Great for visualization, speed, and noise reduction — Not for interpretability


The One-Sentence Summary

PCA is Maria the sculptor finding the one camera angle that captures 95% of her 3D masterpiece's essence in a single 2D photograph — losing a dimension but keeping what matters most.


What's Next?

Now that you understand PCA, you're ready for:

  • Kernel PCA — PCA for non-linear relationships
  • t-SNE and UMAP — Non-linear visualization techniques
  • LDA — Supervised dimensionality reduction
  • Autoencoders — Deep learning approach to compression

Follow me for the next article in this series!


Let's Connect!

If PCA finally makes sense now, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's the highest dimensionality reduction ratio you've achieved with PCA? I've seen 10,000 → 50 with 95% variance preserved!


The difference between a model drowning in 10,000 features and one that sails smoothly on 100? Finding the camera angle — the principal components — that captures what actually matters. That's PCA.


Share this with someone struggling to visualize their 100-dimensional data. There's a 2D photograph waiting to be taken.

Happy projecting! 📸

Top comments (0)