Dimensionality reduction is a fundamental concept in machine learning used to reduce the number of input features (dimensions) in a dataset while preserving as much important information as possible.
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a linear dimensionality reduction technique that transforms data into a new coordinate system.
Instead of using the original features, PCA creates new variables called principal components, which are:
- Linear combinations of the original features
- Ordered by importance (variance explained)
PCA works by identifying directions (called principal axes) where the data varies the most.
The first principal component captures the maximum variance while the second principal component captures the next highest variance.
This allows us to:
- Keep only the most informative components
- Discard less important ones
How PCA Works
- Standardize the data - Features must be scaled (very important for PCA)
- Compute the covariance matrix - Shows relationships between features.
- Compute eigenvalues and eigenvectors
- Eigenvectors - directions (principal components)
- Eigenvalues - importance (variance explained)
- Sort components by eigenvalues - Highest variance first.
- Select top K components - Reduce dimensions.
- Transform the data - Project data onto new axes.
Workflow
Splitting data.
#splitting data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Scaling data.
#scaling
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
Training
#PCA Principal Components
from sklearn.decomposition import PCA
pca_full = PCA()
pca_full.fit(X_scaled)
This line of code is about understanding how much information the PCA model is capturing as you add more components.
import numpy as np
cumvar = np.cumsum(pca_full.explained_variance_ratio_)
Plotting a cumvar on 95% threshold.
You look for the point where the curve starts flattening (diminishing returns).
import matplotlib.pyplot as plt
plt.figure(figsize = (9,4))
plt.plot (cumvar, linewidth = 2)
plt.axhline(0.95, c = 'red', linestyle = '--', label = '95% threshold')
plt.axhline(0.99, c = 'orange', linestyle = '--', label = '99% threshold')
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.title('how many components to explain 95% of the variance')
plt.grid(alpha= 0.3)
plt.legend()
plt.show()
Training
# fit pca on training only
pca_train = PCA(n_components=0.95)
X_train_r = pca_train.fit_transform(X_train)
X_test_r = pca_train.transform(X_test)
Visualization
using the trained PCA model dimensions.
import matplotlib.pyplot as plt
plt.figure(figsize=(6,3))
scatter = plt.scatter(X_2d[:,0], X_2d[:,1], c = y, cmap = 'tab10', alpha=0.7, s = 20)
plt.colorbar(scatter, label = 'digit class')
plt.title('64-dimensional digit data projected to 2d via PCA')
plt.xlabel('PC1(highest variance direction)')
plt.ylabel('pc2(second highest variance direction)')
plt.show()
Distributed Stochastic Neighbor Embedding (t-SNE)
t-SNE is a non-linear dimensionality reduction technique specifically designed for visualizing high-dimensional data in 2D or 3D spaces. It works by modelling focuses pairwise similarities between data points in the high-dimensional space and optimizing their representation in a lower-dimensional space to preserve these similarities.
Unlike PCA t-SNE focuses on maintaining local relationships by minimizing the Kullback–Leibler divergence (KL divergence) between the high-dimensional and low-dimensional distributions of data points making it highly effective to find clusters and patterns in complex datasets.
However it takes a lot of time to run the results and it doesn't work well with very large datasets.
t-SNE is primarily used for exploratory data analysis and visualization rather than feature reduction or pre processing.
Importing load digits from sklearn datasets.
from sklearn.datasets import load_digits
data = load_digits()
X = data.data
y = data.target
Scaling data
# scaling data
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
Generating 1000 figures randomly for easy training from the scaled data.
import numpy as np
ids = np.random.choice(len(X_scaled),1000, replace=False)
X_sub, y_sub = X_scaled[ids], y[ids]
Running t-sne.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2,
perplexity=30,
max_iter=1000,
random_state=42,
)
X_tsne = tsne.fit_transform(X_sub)
Visualization.
#plotting
import matplotlib.pyplot as plt
plt.figure(figsize= (10,7))
scatter = plt.scatter(X_tsne[:,0],X_tsne[:,1], c = y_sub,cmap='tab10',alpha=0.7, s=25)
cbar= plt.colorbar(scatter)
cbar.set_ticks(range(10))
cbar.set_ticklabels([str(i)for i in range(10)])
cbar.set_label('Digit class')



Top comments (0)