Rescuing the Signal: How PCA Salvages Accuracy from Catastrophic Data Poisoning

#ai #learning #machinelearning

The "Garbage In, Garbage Out" Reality

In the controlled environment of a classroom or a Kaggle competition, we are often handed pristine, pre-cleaned data. But in the real world—where sensors degrade, transmission lines suffer from electromagnetic interference, and environments are unpredictable—data is rarely "clean".

My latest project at Purdue focused on a fundamental truth of Machine Learning: Garbage In, Garbage Out. By intentionally "poisoning" a classic dataset with massive amounts of noise, my team and I explored the brittleness of standard classifiers and, more importantly, how mathematical remediation like Principal Component Analysis (PCA) can be used to rescue a failing system.

The Challenge: The Digital Canvas

We utilized the Scikit-Learn Digits Dataset, a classic collection of 1,797 handwritten digits. Unlike modern high-resolution images, these are tiny: just pixels. With only 64 total pixels per image, the margin for error is razor-thin. Distinguishing a '3' from an '8' at this resolution is already a challenge; adding noise makes it nearly impossible for the human eye and traditional algorithms alike.

The Arsenal: Comparing Three Architectural Philosophies

To understand how different "mathematical brains" handle data, we selected three distinct models:

Gaussian Naive Bayes (GNB): A probabilistic baseline that treats every pixel as if it exists in total isolation.
K-Nearest Neighbors (KNN): A distance-based model that classifies digits based on their "visual similarity" in 64-dimensional Euclidean space.
Multi-Layer Perceptron (MLP): A feedforward neural network designed to learn complex, non-linear interactions and abstract features like curves and loops.

The Attack: Simulating a Low-SNR Environment

To stress-test these models, we introduced an adversarial simulation by adding random Gaussian noise with a scale of 10.0. Considering our pixel values only range from 0 to 16, a standard deviation of 10.0 is catastrophic. This effectively buried the digit signal under a mountain of static, creating a "low Signal-to-Noise Ratio" (SNR) environment.

The result? Total failure. Accuracy for every classifier plummeted from over 95% to roughly 10-20%. The statistical equivalent of random guessing.

The Rescue: Mathematical Denoising via PCA

Our remediation strategy centered on Principal Component Analysis (PCA). The intuition is elegant: in a dataset of digits, the "digit" itself represents structured, high-variance signal, while the "poison" is random, high-frequency noise.

We configured PCA to retain only the components explaining 80% of the variance. By doing this, we essentially told the computer: "Ignore the trailing 20% of the data—that's just random noise—and reconstruct the image using only the most significant structural elements". After the reconstruction, we used a np.clip function to force the resulting pixel values back into the valid [0, 16] range, preventing outliers from skewing our results.

The Bouncing Back: Analyzing the Recovery

The application of PCA yielded a dramatic recovery:

KNN (The Comeback King): Bounced back to **~94% accuracy. Because PCA restored the local clustering structure, the distance-based logic could once again "see" which digits were neighbors.

MLP (The Robust Performer): Reached ~95% accuracy. The neural network effectively learned to classify the "smoothed" versions of the digits generated by the PCA filters.

GNB (The Partial Recovery): Only reached ~82%. While a huge improvement, GNB struggled because the PCA process introduced pixel correlations (smoothing) that violated the model’s core "independence" assumption.

Final Thoughts and the Road Ahead

This project proved that PCA is an incredibly effective tool for "salvaging" usable information from heavily corrupted datasets. However, there is always room to grow. Future iterations of this work could explore Convolutional Neural Networks (CNNs), which use pooling layers to naturally filter noise, potentially removing the need for a separate denoising step altogether.

For those interested in the code, the full implementation is available on my GitHub: https://github.com/akshagg/-Adversarial-Robustness-in-Digit-Classification.

DEV Community

Rescuing the Signal: How PCA Salvages Accuracy from Catastrophic Data Poisoning

Top comments (0)