Sivananda Panda

Posted on Jun 22

I Compressed 784 Dimensions Into 2. Here's What 70,000 Handwritten Digits Actually Look Like

#machinelearning #ai #computerscience #community

PCA Didn't Improve My Model. It Changed How I Think About Data Instead.

When I ran PCA on a dataset I was exploring, I expected a fairly straightforward outcome.

Reduce dimensionality.

Remove noise.

Train the model again.

Get better performance.

That's the story dimensionality reduction is often associated with.

The reality was much less exciting.

The accuracy barely moved.

At first, I treated PCA as a failed experiment.

Looking back, the failed experiment was actually my mental model.

I had been focused on improving the model without understanding something more fundamental:

What did the data actually look like?

That question eventually led me into one of the most interesting machine learning rabbit holes I've explored so far.

The Problem I Didn't Realize I Had

Like most practitioners, I started with exploratory data analysis.

I checked distributions.

Looked for missing values.

Analyzed correlations.

Built baseline models.

Reviewed performance metrics.

All useful activities.

But none of them answered a question that suddenly felt important:

If every sample is represented as hundreds of features, what shape does this data actually have?

Machine learning models operate in spaces that humans can't visualize.

A dataset with 500 features exists in a 500-dimensional space.

A dataset with 1000 features exists in a 1000-dimensional space.

We can calculate distances in those spaces.

We can train models in those spaces.

But we can't intuitively understand them.

And yet many of the questions we care about are geometric in nature.

Are classes naturally separable?

Are there meaningful clusters?

Are there outliers?

Do the samples lie on some hidden structure?

The information already exists in the data.

The challenge is making it visible.

Discovering a Better Playground

While reading about dimensionality reduction, I came across Christopher Olah's brilliant article on visualizing the MNIST dataset.

For anyone unfamiliar with it, MNIST contains handwritten digits from 0 to 9.

Each image is only 28×28 pixels.

That sounds tiny.

But once flattened, every image becomes a point in a 784-dimensional space.

Each handwritten digit is represented by 784 numbers.

Humans can't visualize 784 dimensions.

Dimensionality reduction algorithms can help us project that space into something we can see.

What fascinated me wasn't the mathematics.

It was the possibility that different algorithms might reveal different aspects of the same dataset.

So I downloaded MNIST and started experimenting.

What began as curiosity quickly turned into a project.

I implemented and compared:

PCA
LDA
t-SNE
UMAP
Sammon Mapping
KNN Graph Visualizations
Autoencoders

My expectation was simple.

Different algorithms would produce slightly different versions of the same visualization.

I was completely wrong.

The Most Interesting Question

After generating the first set of visualizations, I found myself staring at the outputs, wondering:

Which one of these is actually correct?

PCA showed one picture.

t-SNE showed another.

UMAP showed something different again.

The Autoencoder latent space looked completely different from everything else.

They couldn't all be right.

Except they were.

The mistake was assuming they were trying to answer the same question.

Each algorithm was optimizing for a different definition of "important structure."

Once I understood that, dimensionality reduction became much more interesting.

What PCA Actually Taught Me

PCA was the first method I explored because it's usually the default recommendation.

The intuition is elegant.

Find the directions where the data varies the most and project onto those directions.

Simple.

Fast.

Interpretable.

What surprised me was that PCA didn't produce the clean digit separation I expected.

Some digits remained heavily mixed together.

Initially, this felt disappointing.

Then I realized PCA had taught me something important.

Variance is not the same thing as class separation.

The largest source of variation in a dataset isn't necessarily the information that distinguishes one class from another.

A thick handwritten "2" and a thin handwritten "2" can contribute substantial variance while still belonging to the same class.

PCA isn't trying to separate classes.

It's trying to preserve variance.

That distinction seems obvious in hindsight, but seeing it visually made the lesson stick.

What LDA Taught Me About Labels

The jump from PCA to LDA was dramatic.

Suddenly, the classes looked much cleaner.

The clusters became easier to distinguish.

The reason wasn't that LDA is universally superior.

The reason is that LDA has access to information PCA never sees.

Labels.

PCA asks:

Where is the variance?

LDA asks:

How can I maximize separation between known classes?

Those are fundamentally different objectives.

Running both methods side by side highlighted something important about machine learning in general.

The information contained in labels is incredibly valuable.

Once an algorithm knows what you want to separate, it can optimize directly for that objective.

Without labels, it has to infer structure on its own.

What t-SNE Taught Me About Beautiful Visualizations

The first visualization that made me stop and stare was t-SNE.

The clusters looked incredible.

Digits formed tight, well-separated groups.

The output was visually satisfying in a way PCA never was.

It almost looked as though the dataset had organized itself.

Then I started reading more about how t-SNE works.

That's when I learned an important lesson.

t-SNE prioritizes preserving local neighborhoods.

Points that are close together remain close together.

Global geometry becomes much less important.

This means something subtle but important.

The clusters themselves are often meaningful.

The distances between clusters are often not.

Humans naturally assume that if Cluster A is closer to Cluster B than Cluster C, then A and B must be more similar.

With t-SNE, that assumption can easily be wrong.

The experience taught me something I now apply beyond dimensionality reduction.

The most visually impressive result isn't always the most informative one.

Why UMAP Felt Different

After seeing the extremes of PCA and t-SNE, UMAP felt like the first algorithm that was trying to strike a balance rather than optimize a single objective.

PCA focuses on preserving variance.

t-SNE focuses heavily on preserving local neighborhoods.

UMAP sits somewhere in between.

The underlying assumption behind UMAP is that high-dimensional data often lies on a lower-dimensional manifold. Instead of asking where the variance is largest, UMAP asks a different question:

Which points genuinely belong together, and what hidden structure could explain those relationships?

For my experiments, I used 15 neighbors, a cosine distance metric, and projected the data into three dimensions. These choices turned out to be important.

Using 15 neighbors meant that each digit considered a reasonably sized local neighborhood when constructing the manifold. If I had chosen a much smaller value, the visualization would have focused almost entirely on local structure, producing tighter but potentially misleading clusters. A much larger value would have emphasized global relationships at the expense of local detail. Fifteen felt like a practical middle ground.

The cosine distance metric was equally interesting. Instead of measuring absolute pixel differences, cosine similarity focuses on whether two images share similar patterns. For handwritten digits, that matters. Two people can write the same digit with different stroke thicknesses and intensities, yet humans immediately recognize them as the same shape. Cosine distance captures that intuition surprisingly well.

What stood out most in the visualization was that the clusters remained well-defined without feeling artificially separated. Unlike t-SNE, where some clusters appeared isolated islands floating in space, UMAP preserved more of the broader organization of the dataset. Digits with similar visual characteristics often occupied nearby regions, and the overall arrangement felt more coherent.

The decision to use three dimensions also revealed something I would have missed in a standard 2D plot. Some groups that appeared partially overlapping in two dimensions unfolded more naturally when given an additional degree of freedom. The manifold had more room to express its structure, making the relationships between digits easier to interpret.

What I appreciated most about UMAP was that it felt less interested in creating the prettiest visualization and more interested in preserving a useful representation of the data. The clusters were slightly less dramatic than t-SNE, but they felt more trustworthy.

If PCA taught me that variance is not the same as separability, and t-SNE taught me to be cautious of beautiful plots, UMAP taught me that understanding data often requires balancing local detail with global structure. That balance is probably why UMAP has become the default visualization tool for many machine learning practitioners today.

The Surprisingly Interesting Sammon Mapping

Before this project, I had barely encountered Sammon Mapping.

Compared to PCA or t-SNE, it's rarely discussed.

After experimenting with it, I think that's unfortunate.

Sammon Mapping attempts to preserve pairwise distances during projection.

In other words, it's trying to stay faithful to the geometry of the original space.

The trade-off becomes obvious immediately.

It's computationally expensive.

The visualizations aren't as dramatic.

The clusters don't explode apart like they do with t-SNE.

But that's exactly the point.

Sammon Mapping is optimizing for honesty rather than visual appeal.

That made it one of the most interesting methods in the project.

When the KNN Graph Changed My Perspective

Most dimensionality reduction techniques represent data as points.

KNN Graphs represent data as relationships.

That sounds like a small difference.

It isn't.

Instead of asking:

Which cluster does this point belong to?

I found myself asking:

Which points connect different regions of the dataset?

The graph exposed bridge points, ambiguous digits, and unusual samples that were much harder to notice in traditional scatter plots.

It shifted my focus away from clusters and toward connectivity.

For exploratory analysis, that perspective can be incredibly valuable.

What Autoencoders Revealed

The Autoencoder was where the project started feeling less like classical machine learning and more like modern AI.

Unlike PCA or t-SNE, the Autoencoder isn't applying a predefined projection rule.

It's learning a representation.

The network compresses the input into a latent space and then attempts to reconstruct the original image.

To succeed, it must learn which information matters.

The resulting latent space felt fundamentally different from the classical methods.

It wasn't simply a compressed version of the original data.

It was a learned representation of the data.

The structure felt smooth.

Continuous.

Almost as though the digits existed on an underlying manifold rather than as isolated clusters.

For the first time, I could see why latent representations became such an important idea in deep learning.

The Lesson I Didn't Expect

I started this journey because PCA didn't improve a model.

I finished it with a completely different appreciation for exploratory data analysis.

The most important insight wasn't that one dimensionality reduction technique is better than another.

It was that every technique reveals a different aspect of the data.

PCA reveals variance.

LDA reveals separability.

t-SNE reveals neighborhoods.

UMAP balances local and global structure.

Sammon Mapping reveals geometry.

KNN Graphs reveal connectivity.

Autoencoders reveal learned representations.

The algorithms weren't competing.

They were answering different questions.

And that's probably the biggest lesson I took away from the project.

Before spending weeks tuning hyperparameters or experimenting with new models, it's worth asking a simpler question:

Do I actually understand the shape of my data?

Sometimes the fastest way to improve a model isn't another optimization trick.

It's developing a better intuition for the space your data lives in.

Explore the Project

I open-sourced the entire project so anyone can experiment with the visualizations themselves.

GitHub:
https://github.com/siva-rgb/Dim_Reduction

If you run it, I'd encourage you to spend less time looking for the "best" dimensionality reduction technique and more time asking:

What is each technique trying to tell me about the data?

That's where the interesting insights usually begin.

DEV Community