DEV Community: Sivananda Panda

The Evaluation Layer: The Part of Your LLM System You Keep Skipping

Sivananda Panda — Thu, 02 Jul 2026 11:43:52 +0000

I've built two agentic AI systems over the past few months, and despite solving very different problems, both exposed the same weakness. The agents worked perfectly during demos. They passed my manual tests. They looked production-ready. But once real users started interacting with them, confidently wrong responses began slipping through. The root cause wasn't the model, the prompts, or the tools. It was much simpler: there was no proper evaluation layer. The system could tell whether it had produced an answer, but it had no way to determine whether that answer was actually good.

This article is about the layer that closes that gap. What it is, why LLM systems specifically can't survive without it, the five ways people actually build it, the tooling, and the mistakes that make an eval layer worse than useless — because a miscalibrated eval gives you false confidence, which is more dangerous than no eval at all.

I'll use a RAG-and-agent stack for the concrete examples (LangGraph, Claude, LangSmith), but nothing here is framework-specific. The principles move to any stack you like.

Why LLM systems need this and normal software doesn't

Here's the uncomfortable property that breaks every habit you brought from traditional engineering: the same input can produce different outputs, and a wrong output usually looks exactly like a right one.

In a normal codebase, add(2, 2) returns 4 or it returns a bug you can see. The failure has a shape — a stack trace, a null, a red test. You write an assertion, it passes forever, you move on. LLM failures don't have that shape. A hallucinated citation, a subtly-off summary, a tool call with the wrong argument that still "sort of" works — these render as fluent, plausible, professional-looking text. The failure is camouflaged inside a success.

Three things follow from that, and they're the whole reason the eval layer exists:

Non-determinism means one passing run proves nothing. You need distributions, not point checks.
The interesting failures live in the middle. An agent that made 12 tool calls and 4 reasoning hops can reach a correct answer through completely broken logic — and reach a wrong one next time from the same code.
Nobody sees the middle by default. The end user gets the final answer. The reasoning, the retrieval, the tool arguments — all invisible unless you deliberately capture them.

So the eval layer isn't a testing afterthought. It's the observability and quality infrastructure that lets you answer three questions on a continuous basis: is the system doing the right thing (correctness), doing it efficiently (cost and latency), and doing it safely (no leaked secrets, no disallowed tools, no confident nonsense). Skip it and you're not shipping a product; you're running an ongoing experiment on your users without reading the results.

What the layer actually is

It's tempting to picture "the eval layer" as one service you bolt on at the end. It isn't. It's a cross-cutting concern that taps into every stage of execution and scores the trace, not just the output.

Three moving parts. Tracers capture what happened at every step — inputs, outputs, tool arguments, retrieved chunks, latencies, token counts. Scorers turn those captured artifacts into numbers — a faithfulness score, a latency measurement, a pass/fail on a schema. Evals are the curated sets and thresholds that give those numbers meaning: is 0.81 faithfulness good, and is it better or worse than last week? Lose any one of the three and the other two stop being useful.

Five ways to build it

These aren't competing options where you pick one. A mature system runs all five, at different frequencies. But you'll add them in roughly this order, cheapest and most objective first.

1. Deterministic evals — start here, not with an LLM

The instinct is to reach for an LLM judge immediately because the outputs are "fuzzy." Resist it. A surprising amount of what you care about is not fuzzy at all, and plain code checks it faster, cheaper, and without any of the reliability problems a judge brings.

Did the agent call only allowed tools? Did the structured output match the schema? Did it stay under the tool-call budget? Did the JSON parse? These have crisp right answers, and a regular function is the correct instrument.

def eval_tool_call_validity(trace: AgentTrace) -> EvalResult:
    """Every tool call must use an allowed tool and respect the budget."""
    allowed_tools = {"search", "calculator", "fetch_document"}
    violations = []

    for step in trace.steps:
        if step.type != "tool_call":
            continue
        if step.tool_name not in allowed_tools:
            violations.append(f"Disallowed tool: {step.tool_name}")
        if step.call_count > MAX_TOOL_CALLS:
            violations.append(f"Exceeded call budget: {step.call_count}")

    return EvalResult(
        passed=not violations,
        score=1.0 if not violations else 0.0,
        details=violations,
    )

These cost microseconds and never disagree with themselves. Run them on every single trace in production, not just in CI. They're your smoke detectors.

2. LLM-as-judge — powerful, and the thing most likely to lie to you

For the genuinely subjective stuff — tone, helpfulness, whether an answer is faithful to its source — you hand the output to a separate model with a rubric and collect structured scores.

from anthropic import Anthropic
import json

client = Anthropic()

JUDGE_PROMPT = """You are evaluating an AI agent's response.

User query: {query}
Agent response: {response}
Retrieved context: {context}

Score each dimension 1-5, and cite the specific text that justifies your score.
- Faithfulness: Is every claim grounded in the provided context?
- Relevance:    Does it address what the user actually asked?
- Completeness: Does it fully answer the question?

Respond ONLY as JSON:
{{"faithfulness": N, "relevance": N, "completeness": N, "evidence": "...", "reasoning": "..."}}"""

def llm_judge(query: str, response: str, context: str) -> dict:
    result = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=600,
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(query=query, response=response, context=context),
        }],
    )
    return json.loads(result.content[0].text)

Notice I made the judge cite evidence and reason, not just emit a number. That's not decoration. A bare score is unfalsifiable; a score with quoted evidence can be audited when you disagree with it.

Here's the part teams underrate: LLM judges have the same weaknesses as the model being judged. They reward length. They're swayed by confident phrasing. They wave through fluent errors — the exact failure mode you built the judge to catch. An uncalibrated judge doesn't measure quality, it launders your existing biases into an official-looking number. So the judge is not the end of the work; it's a component that itself needs validating against human labels before you trust a word it says. More on that in the mistakes section, because it's the one people get wrong most often.

3. Trace-level evaluation — stop grading the black box

Grading only the final answer is like reviewing a math test by checking the last number and ignoring the working. The number can be right for the wrong reasons and wrong for the same reasons next time.

The fix is to instrument every step and evaluate the whole trace. Observability platforms — LangSmith, Langfuse, Arize Phoenix, W&B Weave — exist for exactly this. You wrap your nodes and they capture the tree of execution:

from langsmith import traceable, Client

ls = Client()

@traceable(run_type="chain", name="research_agent")
def run_agent(query: str) -> str:
    plan   = planner.plan(query)          # captured as a child run
    docs   = retriever.fetch(plan.query)  # captured as a child run
    answer = generator.generate(docs)     # captured as a child run
    return answer

def score_run(run_id: str, scores: dict):
    ls.create_feedback(
        run_id=run_id,
        key="faithfulness",
        score=scores["faithfulness"],
        comment=scores["reasoning"],
    )

The payoff is diagnostic, not just descriptive. When faithfulness tanks, the trace tells you where: the retriever pulled garbage, or the retriever was fine and the generator ignored it. Those are two completely different bugs with two completely different fixes, and output-only evals can't tell them apart.

4. Human-in-the-loop — the ground truth everything else calibrates against

Automated evals are necessary and insufficient. Humans catch what code and judges structurally cannot: domain inaccuracies a generalist judge waves through, tone that's wrong for your specific users, edge cases that are technically correct and practically useless. And critically — human labels are the yardstick you use to check whether your LLM judge is any good. Without them, every other layer is measuring against itself.

Four things separate a useful human-eval pipeline from a box-ticking one:

Sample successes, not just failures. "The agent returned an answer" is not evidence the answer was good. Review 5–10% of all production traces weekly, including the ones nobody complained about.
Stratify the sample. Cover query types, user segments, and tool paths. Review only the easy queries and you'll build a rosy, useless picture.
Write annotation rubrics that force a decision. "Was this response good?" produces noise. "Did the response answer the user's stated question without inventing information they didn't ask for?" produces labels you can act on.
Measure agreement between annotators. Run the same trace past two reviewers now and then. If they disagree more than one time in five, the problem is your rubric, not your reviewers — sharpen it before you collect more labels.

5. Regression and red-teaming — so tomorrow's fix doesn't quietly break today's feature

Every prompt tweak is a potential regression, and LLM regressions are invisible without a locked baseline to compare against. A regression suite is a curated set of (input, expected) pairs you run on every deploy. Red-teaming is its adversarial twin: inputs deliberately built to break things — prompt injection, context-stuffing, multi-hop manipulation, tool misuse. In agentic systems these are nastier than in a single call, because one poisoned step can cascade through every step that follows it.

evals/
  regression/
    core_qa.jsonl          # 50 representative Q&A pairs
    tool_use.jsonl         # 30 traces exercising tool-call patterns
    edge_cases.jsonl       # 20 known-difficult inputs
    adversarial.jsonl      # 15 red-team prompts
  run_suite.py             # runner that diffs scores against baseline
  baseline_scores.json     # locked scores from the last known-good release

Run it in CI before every deploy. Alert when any category drops more than ~5% from baseline. The exact number matters less than the fact that there is one and something screams when you cross it.

Six mistakes that turn an eval layer into a liability

An eval layer isn't automatically good. A bad one is worse than none, because "our scores are green" is a sentence that stops people from looking closer. These are the six failure modes I see most.

Grading only the final output. Covered above, but it's the number-one blind spot so it earns repeating: a correct answer reached through a broken trace is a landmine, not a pass. Evaluate every node, not just the terminal one.

Collapsing everything into one score. A composite of 0.72 tells you something is wrong and nothing about what. Is it faithfulness? Latency? Tool selection? Keep the dimensions separate, track them separately, set thresholds separately. You can always average later; you can't un-average.

Evaluating on your training distribution. If your eval set is made from the same queries you used to build and tune the system, you've optimized to the eval, not to reality — Goodhart's Law wearing a lab coat. Hold out a blind set that never touches development, and keep topping it up with real production samples. Treat eval-set contamination as seriously as data leakage.

Trusting an uncalibrated judge. The single most common self-inflicted wound. Before an LLM judge scores anything that matters, run it against 100+ human-labeled examples and compute how often it agrees with humans. Disagreement above ~15%? Fix the judge prompt before it goes anywhere near your dashboard. A judge you haven't validated is a confidence machine, not a measurement.

No baseline, no history. Scores you don't persist can't answer "better or worse than last week?" Every run should write to a durable store with a timestamp, run ID, and git SHA. A SQLite table is enough — the discipline of storing beats the sophistication of the storage.

Treating evals as a one-time setup. Eval suites rot. User behavior drifts, new failure modes appear, and a suite built in month one is riddled with blind spots by month six. Add a case from every production failure within a couple of days of finding it, and audit the whole suite quarterly. It's a living artifact, not a shipped deliverable.

Setting it up: a practical walkthrough

Opinionated defaults for a LangGraph production agent. Swap the tools freely; the sequence is the point.

Step 1 — Define "good" in prose before you write any eval code. For each capability, answer on paper: what does a correct output look like, what are the common failure modes, what constraints must always hold, and what separates "barely acceptable" from "excellent" in a domain expert's eyes? This costs half a day. Skip it and you'll spend weeks precisely measuring the wrong things.

Step 2 — Instrument the graph. Wrap every node. Capture node name, input, output, latency, token counts, and any tool calls with arguments and returns.

from langsmith import traceable

@traceable(run_type="llm", name="planner_node")
def planner_node(state: AgentState) -> AgentState:
    response = llm.invoke(state["messages"])
    return {"plan": response.content, **state}

@traceable(run_type="tool", name="retriever_node")
def retriever_node(state: AgentState) -> AgentState:
    docs = retriever.invoke(state["plan"])
    return {"context": docs, **state}

Step 3 — Build the suite in three tiers, by frequency.

Tier 1 — every trace, real time: schema validation, tool-call constraints, latency thresholds. Cheap deterministic code.
Tier 2 — daily or per release: LLM-judge scores on a sample, retrieval metrics (precision@k, NDCG), end-to-end correctness on the regression suite.
Tier 3 — weekly or per major release: stratified human review, red-team annotation, edge-case triage.

Step 4 — Set thresholds before production, not during the incident. For every metric, define a green zone (normal), a yellow zone (investigate, don't block), and a red zone (block the deploy or page, someone). "We'll know bad when we see it" is the sentence people say right before a bad week.

Step 5 — Wire it into CI/CD. An eval that only runs when someone remembers doesn't run.

# .github/workflows/eval.yml
name: Eval Suite
on: [push]
jobs:
  run-evals:
    steps:
      - name: Run regression suite
        run: python evals/run_suite.py --compare-to baseline_scores.json
      - name: Check score thresholds
        run: python evals/check_thresholds.py --fail-on-regression 5

What to measure and what to keep

Split metrics into three families and never blend them into one number.

Correctness

Metric	What it measures	How to compute
Faithfulness	Is every claim grounded in retrieved context, with nothing hallucinated?	LLM judge or NLI model comparing response to context
Answer relevance	Does the response address the actual question?	LLM judge or embedding similarity between query and response
Context precision	Of the chunks retrieved, what fraction were useful?	Human or LLM label per chunk
Context recall	Did retrieval surface everything needed to answer?	Compare retrieved set against a gold document set
Tool-call accuracy	Right tools, right arguments?	Deterministic diff against an expected tool trace

Efficiency

Metric	What it measures	Target
Latency (p50/p95/p99)	User-perceived speed	Track trends; set SLOs per use case
Token consumption	Cost per query	Input + output tokens per run
Tool-call count	Wasted calls	Compare to the minimum viable count
Retry rate	How often steps fail and rerun	Under ~5% in steady state
Context-window utilization	How full the window runs	High → truncation risk

Safety and reliability

Metric	What it measures
Hallucination rate	% of responses with claims unsupported by context
Refusal rate	% of valid queries wrongly refused
Task-completion rate	% of queries reaching a terminal answer
Error rate by type	Tool failures, timeouts, parse errors — broken out, not summed
Constraint-violation rate	% of runs breaking a defined rule (e.g. a disallowed tool)

And keep the artifacts, because a score with no trace behind it is impossible to debug. For every run, preserve: the full execution trace (at least 30 days of production), per-dimension scores linked to run ID and git SHA, the judge's reasoning (not just its number — this is gold when you contest a score), failure cases tagged by type, the versioned eval-suite definition, the current baseline snapshot, and human annotation logs with annotator ID and timestamp.

Picking your tooling

Tool	Best for	The catch
LangSmith	LangChain/LangGraph shops wanting tight integration	Vendor lock-in; price scales with trace volume
Langfuse	Open-source, self-hostable	More setup; smaller ecosystem
Arize Phoenix	Teams already on Arize for ML monitoring	Stronger on classic ML; newer for LLMs
W&B Weave	Teams already living in Weights & Biases	Natural fit if you also fine-tune
RAGAS	RAG metrics out of the box	Narrow scope — mostly retrieval + generation
Custom (SQLite + an SDK)	Maximum control, minimal dependency	You own the build and the maintenance

My honest default for a LangGraph production system: LangSmith for tracing, RAGAS for RAG-specific metrics, and a small custom Python runner for the deterministic checks. Add human-eval tooling — even a scrappy Streamlit annotation app — once the system is past its first real users. Don't buy the enterprise platform on day one; you don't yet know what you're measuring.

The actual mindset shift

The thing most teams get wrong isn't a tool choice. It's timing. They treat evaluation as a phase that comes after building, and by then the design decisions that would have made the system measurable are already baked in.

Flip it. Evaluation is a lens you hold up while building. When you write a new node, the first question isn't "does this code run?" — it's "how will I know if this node is doing the right thing next Tuesday, in production, on a query I haven't seen?" When you tune a prompt, you don't eyeball three examples and ship on a good feeling; you run the suite and read the diff.

That's the whole difference between a demo and a system you can put your name on. Without an eval layer you're steering on vibes, and vibes don't survive contact with real traffic. With one, every decision has evidence under it. Build it early, treat it as first-class engineering rather than QA cleanup, and never push a change to production without knowing what your scores say about it.

Checklist

Before first deploy

[ ] Eval criteria written down for each capability
[ ] Every node instrumented with tracing
[ ] Tier-1 unit evals live (schema, constraints, latency)
[ ] Regression suite built (50+ curated examples)
[ ] Green/yellow/red thresholds set for every metric
[ ] Regression suite wired into CI/CD
[ ] Baseline scores locked

Weekly

[ ] Human review of a random 5–10% production sample
[ ] Post-mortem on every red-zone incident
[ ] New failure cases folded into the suite

Quarterly

[ ] Suite audit — cut stale cases, close coverage gaps
[ ] Re-calibrate LLM judges against fresh human labels
[ ] Revisit thresholds — still the right lines?
[ ] Run a red-team exercise and act on what breaks

I Thought Dimensionality Reduction Belonged to Classical ML. Then It Changed How I Think About AI.

Sivananda Panda — Tue, 23 Jun 2026 06:48:57 +0000

In my previous article, I explored several dimensionality reduction techniques, including PCA, t-SNE, UMAP, LDA, Sammon Mapping, KNN Graphs, and Autoencoders.

Going into that project, my goal was fairly simple.

I wanted to understand how these algorithms worked, where they were useful, and whether they could improve model performance.

Like many people learning machine learning, I saw dimensionality reduction as a classical ML topic.

Something you learn alongside feature engineering and data preprocessing.

Useful knowledge, but not something I expected to connect to modern AI systems.

I was wrong.

Not because PCA powers Large Language Models.

Not because dimensionality reduction is secretly the most important topic in machine learning.

But because the project forced me to think about a question I hadn't considered before.

The Question I Didn't Expect

While comparing different algorithms, I noticed something strange.

The same dataset could look completely different depending on the technique I used.

PCA produced one view.

t-SNE produced another.

UMAP showed patterns that weren't obvious in either of them.

Autoencoders created their own representation altogether.

At first, I was focused on which visualization looked better.

Then I started asking a different question:

If all of these algorithms are looking at exactly the same data, what are they actually trying to preserve?

That question turned out to be far more interesting than finding the "best" dimensionality reduction algorithm.

PCA Isn't Really About Reducing Dimensions

Most tutorials introduce PCA as a way to reduce features.

For example, a dataset with 100 features might be compressed into 10 principal components while retaining most of the variance.

That explanation is correct.

But while experimenting with PCA, I realized I was focusing on the wrong part of the process.

The interesting part wasn't that 100 dimensions became 10.

The interesting part was that the transformed data still retained much of the structure of the original dataset.

Some information was discarded.

Some information was preserved.

The algorithm had effectively made a decision about what mattered.

And that idea kept showing up across other dimensionality reduction techniques.

Each algorithm compressed the data differently because each algorithm had a different definition of what should be preserved.

The Moment Things Started Clicking

The more I explored these techniques, the less I thought about dimensions and the more I thought about representations.

PCA preserves variance.

t-SNE focuses heavily on local neighborhoods.

UMAP attempts to maintain structural relationships.

Autoencoders learn their own compressed representation from data.

Different algorithms.

Different mathematics.

Different outputs.

But they all seemed to revolve around the same challenge:

How do you transform information into a form that still captures the important patterns?

Once I started thinking that way, I began noticing the same idea outside dimensionality reduction.

Why This Matters Beyond Classical Machine Learning

One thing that often happens when you're learning machine learning is that topics get placed into separate mental boxes.

Classical ML.

Deep Learning.

LLMs.

Recommendation Systems.

Computer Vision.

NLP.

They can feel like completely different worlds.

But sometimes the same ideas appear in all of them.

Take image classification.

A neural network doesn't look at an image the same way throughout the entire model.

Early layers respond to simple patterns.

Edges.

Textures.

Basic shapes.

Deeper layers work with increasingly abstract representations.

By the time the model makes a prediction, it is operating on something very different from the original pixels.

The representation has changed multiple times.

The model has transformed the data into a form that makes the task easier.

That sounded surprisingly familiar.

Then I Started Looking at LLMs

The same thing happened when I started learning more about embeddings.

Consider the words "car" and "vehicle."

They are different words, yet most embedding models place them relatively close together in vector space.

The model isn't storing dictionary definitions.

It is learning a representation that captures part of the relationship between those words.

The exact mechanism is very different from PCA.

The mathematics is different.

The scale is different.

But the underlying idea felt familiar.

Once again, information was being transformed into a representation that preserved what mattered for the task.

That was the connection I hadn't expected when I started learning dimensionality reduction.

A Better Mental Model

Many beginners think about machine learning like this:

Data → Model → Prediction

After working through these dimensionality reduction techniques, I think a more useful mental model is:

Data → Representation → Model → Prediction

Because the way information is represented often determines what patterns a model can learn.

Two systems can work with the same underlying data and arrive at different outcomes simply because they represent that data differently.

That's true for PCA.

It's true for Autoencoders.

It's true for embeddings.

And it's true for many modern AI systems.

Final Thoughts

I started learning dimensionality reduction because I wanted to understand PCA, t-SNE, UMAP, and a few other algorithms.

What I didn't expect was that those techniques would change how I think about machine learning itself.

The biggest lesson wasn't about reducing dimensions.

It wasn't about preprocessing.

And it wasn't about improving model accuracy.

It was realizing that many AI systems, regardless of how different they appear on the surface, spend a significant amount of effort answering the same question:

How should information be represented so that useful patterns become easier to discover?

For me, dimensionality reduction was the first place where that idea became visible.

And once I noticed it, I started seeing it everywhere.

I Compressed 784 Dimensions Into 2. Here's What 70,000 Handwritten Digits Actually Look Like

Sivananda Panda — Mon, 22 Jun 2026 08:59:01 +0000

PCA Didn't Improve My Model. It Changed How I Think About Data Instead.

When I ran PCA on a dataset I was exploring, I expected a fairly straightforward outcome.

Reduce dimensionality.

Remove noise.

Train the model again.

Get better performance.

That's the story dimensionality reduction is often associated with.

The reality was much less exciting.

The accuracy barely moved.

At first, I treated PCA as a failed experiment.

Looking back, the failed experiment was actually my mental model.

I had been focused on improving the model without understanding something more fundamental:

What did the data actually look like?

That question eventually led me into one of the most interesting machine learning rabbit holes I've explored so far.

The Problem I Didn't Realize I Had

Like most practitioners, I started with exploratory data analysis.

I checked distributions.

Looked for missing values.

Analyzed correlations.

Built baseline models.

Reviewed performance metrics.

All useful activities.

But none of them answered a question that suddenly felt important:

If every sample is represented as hundreds of features, what shape does this data actually have?

Machine learning models operate in spaces that humans can't visualize.

A dataset with 500 features exists in a 500-dimensional space.

A dataset with 1000 features exists in a 1000-dimensional space.

We can calculate distances in those spaces.

We can train models in those spaces.

But we can't intuitively understand them.

And yet many of the questions we care about are geometric in nature.

Are classes naturally separable?

Are there meaningful clusters?

Are there outliers?

Do the samples lie on some hidden structure?

The information already exists in the data.

The challenge is making it visible.

Discovering a Better Playground

While reading about dimensionality reduction, I came across Christopher Olah's brilliant article on visualizing the MNIST dataset.

For anyone unfamiliar with it, MNIST contains handwritten digits from 0 to 9.

Each image is only 28×28 pixels.

That sounds tiny.

But once flattened, every image becomes a point in a 784-dimensional space.

Each handwritten digit is represented by 784 numbers.

Humans can't visualize 784 dimensions.

Dimensionality reduction algorithms can help us project that space into something we can see.

What fascinated me wasn't the mathematics.

It was the possibility that different algorithms might reveal different aspects of the same dataset.

So I downloaded MNIST and started experimenting.

What began as curiosity quickly turned into a project.

I implemented and compared:

PCA
LDA
t-SNE
UMAP
Sammon Mapping
KNN Graph Visualizations
Autoencoders

My expectation was simple.

Different algorithms would produce slightly different versions of the same visualization.

I was completely wrong.

The Most Interesting Question

After generating the first set of visualizations, I found myself staring at the outputs, wondering:

Which one of these is actually correct?

PCA showed one picture.

t-SNE showed another.

UMAP showed something different again.

The Autoencoder latent space looked completely different from everything else.

They couldn't all be right.

Except they were.

The mistake was assuming they were trying to answer the same question.

Each algorithm was optimizing for a different definition of "important structure."

Once I understood that, dimensionality reduction became much more interesting.

What PCA Actually Taught Me

PCA was the first method I explored because it's usually the default recommendation.

The intuition is elegant.

Find the directions where the data varies the most and project onto those directions.

Simple.

Fast.

Interpretable.

What surprised me was that PCA didn't produce the clean digit separation I expected.

Some digits remained heavily mixed together.

Initially, this felt disappointing.

Then I realized PCA had taught me something important.

Variance is not the same thing as class separation.

The largest source of variation in a dataset isn't necessarily the information that distinguishes one class from another.

A thick handwritten "2" and a thin handwritten "2" can contribute substantial variance while still belonging to the same class.

PCA isn't trying to separate classes.

It's trying to preserve variance.

That distinction seems obvious in hindsight, but seeing it visually made the lesson stick.

What LDA Taught Me About Labels

The jump from PCA to LDA was dramatic.

Suddenly, the classes looked much cleaner.

The clusters became easier to distinguish.

The reason wasn't that LDA is universally superior.

The reason is that LDA has access to information PCA never sees.

Labels.

PCA asks:

Where is the variance?

LDA asks:

How can I maximize separation between known classes?

Those are fundamentally different objectives.

Running both methods side by side highlighted something important about machine learning in general.

The information contained in labels is incredibly valuable.

Once an algorithm knows what you want to separate, it can optimize directly for that objective.

Without labels, it has to infer structure on its own.

What t-SNE Taught Me About Beautiful Visualizations

The first visualization that made me stop and stare was t-SNE.

The clusters looked incredible.

Digits formed tight, well-separated groups.

The output was visually satisfying in a way PCA never was.

It almost looked as though the dataset had organized itself.

Then I started reading more about how t-SNE works.

That's when I learned an important lesson.

t-SNE prioritizes preserving local neighborhoods.

Points that are close together remain close together.

Global geometry becomes much less important.

This means something subtle but important.

The clusters themselves are often meaningful.

The distances between clusters are often not.

Humans naturally assume that if Cluster A is closer to Cluster B than Cluster C, then A and B must be more similar.

With t-SNE, that assumption can easily be wrong.

The experience taught me something I now apply beyond dimensionality reduction.

The most visually impressive result isn't always the most informative one.

Why UMAP Felt Different

After seeing the extremes of PCA and t-SNE, UMAP felt like the first algorithm that was trying to strike a balance rather than optimize a single objective.

PCA focuses on preserving variance.

t-SNE focuses heavily on preserving local neighborhoods.

UMAP sits somewhere in between.

The underlying assumption behind UMAP is that high-dimensional data often lies on a lower-dimensional manifold. Instead of asking where the variance is largest, UMAP asks a different question:

Which points genuinely belong together, and what hidden structure could explain those relationships?

For my experiments, I used 15 neighbors, a cosine distance metric, and projected the data into three dimensions. These choices turned out to be important.

Using 15 neighbors meant that each digit considered a reasonably sized local neighborhood when constructing the manifold. If I had chosen a much smaller value, the visualization would have focused almost entirely on local structure, producing tighter but potentially misleading clusters. A much larger value would have emphasized global relationships at the expense of local detail. Fifteen felt like a practical middle ground.

The cosine distance metric was equally interesting. Instead of measuring absolute pixel differences, cosine similarity focuses on whether two images share similar patterns. For handwritten digits, that matters. Two people can write the same digit with different stroke thicknesses and intensities, yet humans immediately recognize them as the same shape. Cosine distance captures that intuition surprisingly well.

What stood out most in the visualization was that the clusters remained well-defined without feeling artificially separated. Unlike t-SNE, where some clusters appeared isolated islands floating in space, UMAP preserved more of the broader organization of the dataset. Digits with similar visual characteristics often occupied nearby regions, and the overall arrangement felt more coherent.

The decision to use three dimensions also revealed something I would have missed in a standard 2D plot. Some groups that appeared partially overlapping in two dimensions unfolded more naturally when given an additional degree of freedom. The manifold had more room to express its structure, making the relationships between digits easier to interpret.

What I appreciated most about UMAP was that it felt less interested in creating the prettiest visualization and more interested in preserving a useful representation of the data. The clusters were slightly less dramatic than t-SNE, but they felt more trustworthy.

If PCA taught me that variance is not the same as separability, and t-SNE taught me to be cautious of beautiful plots, UMAP taught me that understanding data often requires balancing local detail with global structure. That balance is probably why UMAP has become the default visualization tool for many machine learning practitioners today.

The Surprisingly Interesting Sammon Mapping

Before this project, I had barely encountered Sammon Mapping.

Compared to PCA or t-SNE, it's rarely discussed.

After experimenting with it, I think that's unfortunate.

Sammon Mapping attempts to preserve pairwise distances during projection.

In other words, it's trying to stay faithful to the geometry of the original space.

The trade-off becomes obvious immediately.

It's computationally expensive.

The visualizations aren't as dramatic.

The clusters don't explode apart like they do with t-SNE.

But that's exactly the point.

Sammon Mapping is optimizing for honesty rather than visual appeal.

That made it one of the most interesting methods in the project.

When the KNN Graph Changed My Perspective

Most dimensionality reduction techniques represent data as points.

KNN Graphs represent data as relationships.

That sounds like a small difference.

It isn't.

Instead of asking:

Which cluster does this point belong to?

I found myself asking:

Which points connect different regions of the dataset?

The graph exposed bridge points, ambiguous digits, and unusual samples that were much harder to notice in traditional scatter plots.

It shifted my focus away from clusters and toward connectivity.

For exploratory analysis, that perspective can be incredibly valuable.

What Autoencoders Revealed

The Autoencoder was where the project started feeling less like classical machine learning and more like modern AI.

Unlike PCA or t-SNE, the Autoencoder isn't applying a predefined projection rule.

It's learning a representation.

The network compresses the input into a latent space and then attempts to reconstruct the original image.

To succeed, it must learn which information matters.

The resulting latent space felt fundamentally different from the classical methods.

It wasn't simply a compressed version of the original data.

It was a learned representation of the data.

The structure felt smooth.

Continuous.

Almost as though the digits existed on an underlying manifold rather than as isolated clusters.

For the first time, I could see why latent representations became such an important idea in deep learning.

The Lesson I Didn't Expect

I started this journey because PCA didn't improve a model.

I finished it with a completely different appreciation for exploratory data analysis.

The most important insight wasn't that one dimensionality reduction technique is better than another.

It was that every technique reveals a different aspect of the data.

PCA reveals variance.

LDA reveals separability.

t-SNE reveals neighborhoods.

UMAP balances local and global structure.

Sammon Mapping reveals geometry.

KNN Graphs reveal connectivity.

Autoencoders reveal learned representations.

The algorithms weren't competing.

They were answering different questions.

And that's probably the biggest lesson I took away from the project.

Before spending weeks tuning hyperparameters or experimenting with new models, it's worth asking a simpler question:

Do I actually understand the shape of my data?

Sometimes the fastest way to improve a model isn't another optimization trick.

It's developing a better intuition for the space your data lives in.

Explore the Project

I open-sourced the entire project so anyone can experiment with the visualizations themselves.

GitHub:
https://github.com/siva-rgb/Dim_Reduction

If you run it, I'd encourage you to spend less time looking for the "best" dimensionality reduction technique and more time asking:

What is each technique trying to tell me about the data?

That's where the interesting insights usually begin.