jacobjerryarackal

Posted on May 18

PCA (Principal Component Analysis): Finding the Hidden Structure in High‑Dimensional Data

#machinelearning #ai #programming #datascience

1. The Problem It Solves

Imagine you have a dataset with 100 features, for example, pixel values from 10×10 grayscale images, or 100 different sensor readings from a factory machine. Visualising the data is impossible, many features might be redundant (e.g., two sensors that always move together), and training a model on all 100 features can be slow and prone to overfitting. PCA solves the dimensionality reduction problem: it finds a small number of new features (called principal components) that capture the most important patterns in the data. Real‑world applications include compressing images, visualising high‑dimensional data in 2D/3D, speeding up other algorithms, and removing noise.

2. The Core Idea (Intuition First)

Think of a flat, elongated cloud of points on a piece of paper. If you had to draw a single line through that cloud that “explains” as much of the shape as possible, you would draw the line along the longest direction of the cloud that’s the first principal component. Then, if you had to draw a second line, perpendicular to the first, that explains the next most variation, that’s the second principal component. PCA finds these “lines of maximum variance” in your data. By keeping only the top few such directions, you can represent each data point with just two or three numbers instead of hundreds, while losing as little information as possible.

Technically, PCA looks for orthogonal directions (principal components) that maximise the variance of the projected data. It turns out these directions are the eigenvectors of the data’s covariance matrix, sorted by their eigenvalues.

3. How It Works (The Math + Logic)

Let’s walk through PCA step by step.

Step 1 – Standardise the data

PCA is sensitive to scale. We first centre (subtract mean) and often scale (divide by standard deviation) each feature.

For a data matrix $X$ of shape $(n_{\text{samples}}, n_{\text{features}})$ , we compute the mean of each feature and subtract it:

X_{\text{centered}} = X - \bar{X}

Step 2 – Compute the covariance matrix

The covariance matrix $C$ (size $n_{\text{features}} \times n_{\text{features}}$ ) tells us how each pair of features varies together. Entry $C_{ij}$ is the covariance between feature $i$ and feature $j$ .

\mathbf{C} = \frac{1}{n-1} X_{\text{centered}}^T X_{\text{centered}}

Step 3 – Compute eigenvalues and eigenvectors

We find eigenvectors $\mathbf{v}$ and eigenvalues $\lambda$ that satisfy:

\mathbf{C} \mathbf{v} = \lambda \mathbf{v}

Each eigenvector v is a principal component (a direction in the original feature space).
Its corresponding eigenvalue λ is the amount of variance explained by that component.

We sort the eigenvectors by decreasing eigenvalues. The first eigenvector points in the direction of highest variance, the second is orthogonal and points in the next highest, and so on.

Step 4 – Choose the top k components

We keep only the first k eigenvectors. The value k is often chosen by looking at the “explained variance ratio”: sum of the top k eigenvalues divided by the sum of all eigenvalues. A common rule is to keep enough components to explain 95% of the total variance.

Step 5 – Project the data

Finally, we transform the original data into the new lower‑dimensional space:

X_{\text{reduced}} = X_{\text{centered}} \cdot W_k

where $W_k$ is a matrix whose columns are the top $k$ eigenvectors. The new data has shape $(n_{\text{samples}}, k)$ .

4. When to Use It

Use PCA when:

You need to visualise high‑dimensional data in 2D or 3D (e.g., scatter plot of customers after PCA).
You want to speed up another algorithm (e.g., reduce 1000 features to 50 before training a k‑NN or SVM).
You suspect multicollinearity – many features are highly correlated. PCA decorrelates them.
You want to remove noise – small variance components often capture noise.

Assumptions:

PCA is linear, it only finds linear combinations of original features. If the underlying structure is highly non‑linear, autoencoders or t‑SNE may work better.
It assumes that directions of maximum variance are “important” for your downstream task. This is often true but not guaranteed (e.g., the smallest variance direction might separate classes better).

When it fails:

When features are not linearly related to the underlying structure (e.g., a spiral or circle pattern).
When you need interpretability, the principal components are linear mixes of all original features, so they are not as interpretable as the original features themselves.

5. My Implementation

We’ll use the classic Iris dataset (4 features) and reduce it to 2D for visualisation. Then we’ll see how much variance is preserved.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load data
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# Standardise (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)  # reduce to 2 dimensions
X_pca = pca.fit_transform(X_scaled)

# Explained variance ratio
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance preserved: {sum(pca.explained_variance_ratio_):.2%}")
print(f"Principal components (directions in original feature space):")
for i, comp in enumerate(pca.components_):
    print(f"  PC{i+1}: {', '.join([f'{name}: {val:.2f}' for name, val in zip(feature_names, comp)])}")

# Visualise
plt.figure(figsize=(8,6))
colors = ['red', 'green', 'blue']
for target, color, label in zip(np.unique(y), colors, target_names):
    plt.scatter(X_pca[y == target, 0], X_pca[y == target, 1], 
                c=color, label=label, alpha=0.7)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('Iris dataset after PCA (2D projection)')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

Output (numerical values may vary slightly):

Explained variance ratio: [0.72962445 0.22850762]
Total variance preserved: 95.81%
Principal components (directions in original feature space):
  PC1: sepal length (cm): 0.52, sepal width (cm): -0.27, petal length (cm): 0.58, petal width (cm): 0.56
  PC2: sepal length (cm): 0.38, sepal width (cm): 0.92, petal length (cm): 0.02, petal width (cm): 0.07

The first two principal components capture 95.8% of the total variance – meaning we lose very little information by going from 4 dimensions down to 2. The scatter plot clearly separates the three species, showing PCA preserved the meaningful structure.

6. Key Takeaways

PCA is a linear dimensionality reduction technique that finds directions of maximum variance using eigenvectors of the covariance matrix. It is fast, deterministic, and works well for many real‑world datasets.
Always standardise your data before PCA otherwise features with large scales will dominate the principal components, and the results will be misleading.
Use the explained variance ratio to decide how many components to keep. A common choice is to keep enough to explain 90–95% of the variance. For visualisation, use 2 or 3 components and accept some information loss.

DEV Community