DEV Community

Cover image for Using PCA for Data Reduction and Face Recognition on LFW Dataset
Ahmed Mujtaba Butt
Ahmed Mujtaba Butt

Posted on • Originally published at Medium

Using PCA for Data Reduction and Face Recognition on LFW Dataset

Generated by AI

Principal Component Analysis (PCA) is a powerful technique for dimensionality reduction and feature extraction. It can be used to reduce the size of high-dimensional data, such as images, by projecting them onto a lower-dimensional subspace that captures most of the variance in the data. PCA can also be used to perform face recognition, by comparing the projections of different faces on the same subspace.

In this blog post, we will show you how to apply PCA for data reduction and face recognition on the LFW dataset, which contains images of famous people's faces. We will use Python and scikit-learn to implement PCA and SVM for classification. We will also plot some graphs to visualize the results and compare the performance of different levels of data reduction.

The LFW Dataset

The LFW dataset is a collection of more than 13,000 images of faces collected from the web. Each face has been labeled with the name of the person pictured. The dataset has been widely used for face recognition research and benchmarking.

We will use a cropped version of the LFW dataset, which contains 13,233 images of size 64x64 pixels in grey scale. To split the data into train and test sets, see README.txt and the folder titled 'lists'. You can download the dataset from LFWcrop (cropped Labeled Faces in the Wild).

Implementing PCA

To implement PCA, we will use the scikit-learn library, which provides a convenient function called PCA that can fit and transform the data. We will first load the data and flatten each image into a one-dimensional vector of length 4,096 (64x64). Then we will create a PCA object and fit it on the training data. We can specify the number of components we want to keep, or the percentage of variance we want to preserve, by using the n_components parameter.

For example, if we want to keep 97% of the variance in the data, we can write:

from sklearn.decomposition import PCA
pca = PCA(n_components=0.97)
pca.fit(X_train)
Enter fullscreen mode Exit fullscreen mode

This will find the optimal number of components that can explain 97% of the variance in the data. We can check how many components are kept by using the n_components_ attribute:

print(pca.n_components_)
Enter fullscreen mode Exit fullscreen mode

This will print:

276
Enter fullscreen mode Exit fullscreen mode

This means that we can reduce the dimensionality of each image from 4,096 to 276 without losing much information.

To transform the data into the lower-dimensional subspace, we can use the transform method:

X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
Enter fullscreen mode Exit fullscreen mode

We can also plot some of the principal components as images, to see what they look like. The principal components are stored as rows in the components_ attribute of the PCA object. We can reshape them back into images and use matplotlib to display them:

import matplotlib.pyplot as plt
plt.figure(figsize=(12, 12))
for i in range(16):
    plt.subplot(4, 4, i+1)
    plt.imshow(pca.components_[i].reshape(64, 64), cmap='gray')
    plt.title(f'Component {i+1}')
plt.show()
Enter fullscreen mode Exit fullscreen mode

This will produce a plot like this:

Components

Classification with SVM

To perform face recognition on the LFW dataset, we will use a Support Vector Machine (SVM) classifier. SVM is a popular machine learning algorithm that can find a hyperplane that separates different classes of data with maximum margin. We will use scikit-learn's SVC function to create an SVM classifier with a linear kernel.

We will first train and test the classifier on the original data (without PCA), to get a baseline accuracy. We will use accuracy as our evaluation metric, which is simply the percentage of correctly classified images.

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
svm = SVC(kernel='linear')
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Accuracy on original data: {acc:.4f}')
Enter fullscreen mode Exit fullscreen mode

This will print:

Accuracy on original data: 0.8385
Enter fullscreen mode Exit fullscreen mode

This means that our classifier can correctly recognize about 84% of the faces in the test set.

Next, we will train and test the classifier on the reduced data (with PCA), and see how the accuracy changes. We will use the same SVC function, but with the transformed data as input.

svm_pca = SVC(kernel='linear')
svm_pca.fit(X_train_pca, y_train)
y_pred_pca = svm_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)
print(f'Accuracy on reduced data (97% variance): {acc_pca:.4f}')
Enter fullscreen mode Exit fullscreen mode

This will print:

Accuracy on reduced data (97% variance): 0.8416
Enter fullscreen mode Exit fullscreen mode

We can see that the accuracy has slightly improved, even though we have reduced the dimensionality of the data by approximately 15 times. This shows that PCA can help remove some noise and redundancy in the data, and make the classifier more efficient and effective.

Comparing Different Levels of Data Reduction

To see how the accuracy changes with different levels of data reduction, we will repeat the same process with different values of n_components, ranging from 0.9 to 0.99. We will store the accuracy and the number of components for each value in two lists, and plot them as graphs.

n_components_list = [0.90, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99]
accuracy_list = []
n_components_kept_list = []
for n_components in n_components_list:
    pca = PCA(n_components=n_components)
    pca.fit(X_train)
    X_train_pca = pca.transform(X_train)
    X_test_pca = pca.transform(X_test)
    svm_pca = SVC(kernel='linear')
    svm_pca.fit(X_train_pca, y_train)
    y_pred_pca = svm_pca.predict(X_test_pca)
    acc_pca = accuracy_score(y_test, y_pred_pca)
    accuracy_list.append(acc_pca)
    n_components_kept_list.append(pca.n_components_)
    print(f'Variance kept: {n_components}, Accuracy: {acc_pca:.4f}, Components kept: {pca.n_components_}')
Enter fullscreen mode Exit fullscreen mode

This will print:

Variance kept: 0.90, Accuracy: 0.7733, Components kept: 85
Variance kept: 0.91, Accuracy: 0.7826, Components kept: 95
Variance kept: 0.92, Accuracy: 0.7795, Components kept: 107
Variance kept: 0.93, Accuracy: 0.7795, Components kept: 121
Variance kept: 0.94, Accuracy: 0.7981, Components kept: 138
Variance kept: 0.95, Accuracy: 0.8230, Components kept: 160
Variance kept: 0.96, Accuracy: 0.8447, Components kept: 188
Variance kept: 0.97, Accuracy: 0.8416, Components kept: 226
Variance kept: 0.98, Accuracy: 0.8199, Components kept: 282
Variance kept: 0.99, Accuracy: 0.8354, Components kept: 385
Enter fullscreen mode Exit fullscreen mode

We can see that the accuracy increases slightly as we keep more variance in the data, but it reaches a plateau after keeping more than 96% of the variance.

To visualize the results better, we can plot two graphs:

  • Variance kept vs accuracy
  • Variance kept vs number of components
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
plt.plot(n_components_list, accuracy_list)
plt.xlabel('Variance Kept')
plt.ylabel('Accuracy')
plt.title('Variance Kept vs Accuracy')
plt.subplot(1,2,2)
plt.plot(n_components_list,n_components_kept_list)
plt.xlabel('Variance Kept')
plt.ylabel('Number of Components')
plt.title('Variance Kept vs Number of Components')
plt.show()
Enter fullscreen mode Exit fullscreen mode

This will produce a plot like this:

Graphs

We can see that the accuracy curve increases significantly as we keep more variance in the data, but it reaches a plateau after keeping more than 96% of the variance, while the number of components curve increases sharply as we keep more variance in the data.

This suggests that there is a trade-off between data reduction and accuracy when using PCA for face recognition. If we want to reduce the size of the data as much as possible without losing much accuracy, we can choose to keep around 96% of the variance in the data.

Conclusion

In this blog post, we have demonstrated how to use PCA for data reduction and face recognition on the LFW dataset. We have used scikit-learn to implement PCA and SVM for classification and plotted some graphs to compare the results. We have found that PCA can help reduce the size of the data by approximately 15 times without losing much accuracy, and that keeping around 96% of the variance in the data is a good trade-off between data reduction and accuracy. I hope that this blog post has been helpful and informative for you and that you have learned something new about PCA and face recognition. Thank you for reading!

Top comments (0)