miriam K

Posted on Dec 17, 2025

SEMNR: Why I Stopped Trusting "Clean" Images (And Treated Metrics as Guardrails)

#datascience #machinelearning #science

This work was carried out as part of an intensive Applied Materials & Extra-Tech bootcamp, where the challenge went far beyond choosing the “right” denoising model.

I would like to thank my mentors Roman Kris and Mor Baram from Applied Materials for their technical guidance, critical questions, and constant push toward practical, production-level thinking, as well as Shmuel Fine and Sara Shimon from Extra-Tech for their support and teaching throughout the process.

In classical image processing, "clean" is a compliment. In semiconductor SEM denoising, "clean" is often a lie.

The obvious goal of a denoiser is to remove noise. But in scientific and industrial imaging, the actual objective is evidence preservation. Microscopic edges of a conductor, the subtle texture of a silicon surface, or a tiny defect—these signals carry critical meaning.

A denoiser can easily make an image look pleasant to the human eye while silently scrubbing away the very details that change the entire analysis.

Building SEMNR taught me a hard lesson: standard evaluation methods were a trap. I didn't need a leaderboard to brag about; I needed engineering guardrails. Here is how I moved from chasing high scores to building a trust profile for my data.

High Score vs. High Trust: The middle image has a better PSNR score but blurred the critical edges of the wafer lines.
The right image (SEMNR) preserves the sharp structure and original texture, even if it's less smoothly "clean".

Defining What I Refuse to Lose

Before training a single model, I defined exactly what I refused to lose. Metric selection became an active engineering decision, not just a passive acceptance of default tools.

I found that aggressive noise reduction often fights directly against preserving structure:

Metrics that reward smoothness (like standard PSNR in many cases) actively encourage over-smoothing. The model learns to blur textures just to get a better score by minimizing pixel error.
Metrics that ignore texture basically give the model permission to "hallucinate" details that aren't there, or worse, wipe out real defects critical for quality control.

To validate this, I ran "stress tests"—applying artificial blur, over-sharpening, and artifacts to SEM samples—to see which metrics flagged issues and which stayed silent. The results were wildly inconsistent. Often, PSNR improved while the image actually became less analytically useful.

I saw PSNR go up while utility went down. That instantly killed the "single hero number" idea for me.

The Stack: Profiles Over Scores

Instead of chasing one perfect number, I built a metric profile. Think of it as a QA toolkit where each metric has a specific job description:

A delicate balance: The goal is to maximize the total area of the chart, not just one spike. Notice how boosting PSNR (Fidelity) often comes at the direct expense of Texture Realism (DISTS).

PSNR (The Anchor): Measures pixel-level fidelity (how close raw pixel values are to the original). It is my baseline, but I never trust it alone.
SSIM (The Structural Engineer): Ensures the "skeleton" of the image remains intact (checking macroscopic structures like contact holes or vias).
FSIM (The Edge Guardian): Critical in SEM. It monitors sharp transitions between materials, flagging if edges are being blurred out.
DISTS (The Texture Specialist): Captures realism using deep learning features. This is the metric that prevents the "plastic" look and preserves natural grain.
CNR (The Pragmatist): Reflects practical Contrast-to-Noise detectability. It asks: Can a computer vision algorithm actually spot a defect easier now against the background?

Where Metrics Disagree – Finding the Debug Signals

The most valuable engineering insights didn't arrive when all metrics went up together. They came when metrics disagreed. I learned to read these conflicts as distinct debugging signals for model behavior:

PSNR ⬆️ / FSIM ⬇️: A clear sign of over-smoothing. The model is aggressively cleaning noise but erasing high-frequency edge information.
SSIM Stable / DISTS ⬇️: The general structure is fine, but I am experiencing texture drift. The surface is losing its authentic material character.
PSNR ⬆️ / CNR ⬇️: I am technically closer to the ground truth pixels, but I have lost local contrast, making features harder to interpret.

The logic behind the scenes: The flowchart I used to flag failures that the human eye (or PSNR alone) might initially miss.

Closing: Shifting from Beauty to Trust

In SEMNR, this process changed my guiding question from "Is this image clean?" to "Is this image trustworthy?"

By building an evaluation stack that uses specific metrics as guardrails against specific failures (like edge blurring), I turned model evolution from a beauty contest into an engineering safety system.

In the world of scientific and industrial data, my job isn't to beautify reality, but to reveal it with minimal interference. Sometimes, that means leaving a little bit of natural "noise" behind—just to make sure the truth stays in the picture.

The difference is in the micro-details: A zoom-in on a defect at the edge of a structure. Left: A standard model erased the defect along with the noise. Right (SEMNR): The noise is cleared, but the critical defect is preserved sharply.

Top comments (1)

Harjot Singh • Jun 1

i totally get your point about "clean" images being misleading in scientific contexts. preserving evidence is key, especially in fields like semiconductor imaging. on a different note, if you're ever looking to build an app, check out moonshift. it lets you deploy a full next.js + postgres + auth setup in about 7 minutes, and you own the code on your github. happy to offer you a free run if you're interested.