Principal Component Analysis (PCA) is an approach used for dimensionality reduction, which improve the data visualization while preserving the maximum of information of original data. The loss of information is investigated by the variance between original data and the compressed (projected) data, which aim is to maximize the variance.
In machine learning, the PCA influences the speed of the learning algorithm, which reduces a high dimensionality set of features (e.g. 10,000) to a lower one (e.g. 1,000). So, the PCA enables the algorithm to run faster.
The PCA needs a preprocessing step before the algorithm itself. The preprocessing aims to normalize each feature of data, where each feature in the dataset will have zero mean and unit variance. The code is shown below:
def featureNormalize(X): mu = np.mean(X, axis=0) # Mean of each feature sigma = np.std(X, axis=0) # Standard deviation of each feature X_norm = (X-mu)/sigma # Normalized data (zero mean and unite standard deviation) return X_norm
To reduce the data from n-dimensions to k-dimensions (equivalent to the number of components), firstly it is computed the covariance matrix denoted by ∑ as follows:
Where X is the n-dimensional data, m is the number of examples, and ∑ is the covariance matrix of size n x n.
Next, it is computed the eigenvectors of covariance matrix through the Singular Value Decomposition or SVD, which we get the unitary matrix U and the vector with singular values S. Choosing the k first columns of matrix U we get the new matrix Ur, which is used to obtain the reduced data Z of k-dimensions as shows the equation below:
The Python code to perform the PCA is shown below:
def pca(X, k): m = np.size(X, axis=0) # Number of examples sigma = (1/m)*X.T.dot(X) # Covariance Matrix [U, S, V] = np.linalg.svd(sigma) # Singular Decomposition Value Ur = U[:, 0:k] # U reduce Z = X.dot(Ur) # Projected data of k-dimensions return Z, S
The original data X and the number of components k are the inputs of function and the projected data Z and the singular values vector are the outputs of the function.
The core of PCA is the choice of the number of components k, which is investigated by cumulative explained variance ratio and evaluate it we can preserve the information of original data. The variance ratio is obtained by the singular values vector S as shown in the equation below:
Where k is the number of components (k-dimensions) and n is the number of features. The choice of k is done by selection of the smallest values of k, which has a variance ratio higher than a specific threshold, 99% for example. The code to calculate the cumulative explained variance ratio is shown below:
def cumulativeExplainedVariance(S, k_range): variance_ratio = np.zeros(k_range) # Cumulative explained variance ratio for i in range(k_range): variance_ratio[i] = np.sum(S[0:i+1])/np.sum(S) return variance_ratio
The function inputs are the singular value vector S and the range of components investigated k. The output is a vector of the cumulative explained variance ratios.
The iris dataset is used to analyze the choice of the number of components in PCA. A brief description of the dataset and its features is shown in the previous blog post of Fisher’s discriminant.
Investigating the iris dataset for k values in range 1–3, we obtain the plot of the number of components by the cumulative explained variance ratio below:
It is observed that k = 3 is the smallest value of k with a cumulative explained variance ratio of 0.9948, which is higher than the threshold of 0.99 (red dashed line). So, k = 3 is the small value that maximizes the variance between the original data and the projected data. The figure below shows the 3D scatter plot of the projected data from iris data with k = 3.
The nutrient analysis of pizza dataset is investigated now. The dataset is available on Kaggle and has 300 examples and 7 features distributed in 10 classes. Some features are the amount of water and protein per 100 grams in the sample.
Inspecting the nutrients of pizza dataset for k values in range 1–6, we obtain the plot of the number of components by the cumulative explained variance ratio below:
Checking the plot above it is noted that k = 4 is the smallest value of k with a cumulative explained variance ratio of 0.9960, which is higher than the threshold of 0.99 (red dashed line). Thus, k = 4 is the small value that maximizes the variance between the original data and the projected data. The figure below shows the plot of feature 1 and feature 2 of the projected data from nutrients of pizza data with k = 4.
In this way, evaluating the cumulative explained variance ratio is a reliable method to choose the number of components on PCA, which performs the dimensionality reduction while maximizing the variance between the original data and the projected data.
If you interested in dimensionality reduction with Fisher’s linear discriminant, I wrote a blog post about it. If you want to check it out: blog post.