plasmon

Posted on Apr 2 • Originally published at qiita.com

ML Hit 99% Accuracy on Yield Prediction — The Factory Floor Ignored It

#semiconductor #machinelearning #manufacturing #deeplearning

ML Hit 99% Accuracy on Yield Prediction — The Factory Floor Ignored It

The pitch to bring ML into semiconductor FAB (fabrication facility) yield prediction has exploded over the past two years. Dig through ArXiv and you'll find N-BEATS+GNN for anomaly prediction, Transformer-based SPC precursor detection, semi-supervised defect segmentation, statistical difference scores for tool matching — no shortage of methods.

Every paper reports high accuracy on test data. Some claim F1 > 0.9, AUC 0.99, classification accuracy in the 99% range. By the numbers, this looks like a solved problem.

But the factory floor won't use them.

Not because accuracy is insufficient. Because how accuracy is achieved doesn't match how production decisions are made. This article dissects 5 ArXiv papers to identify 3 structural reasons ML fails to penetrate semiconductor yield improvement — and the viable breach points despite those walls.

The 5 Papers: What's Being Proposed

1. N-BEATS + GNN for Anomaly Prediction

Unsupervised Anomaly Prediction with N-BEATS and Graph Neural Network in Multi-variate Semiconductor Process Time Series (Sorensen et al., 2025)

N-BEATS is a deep learning architecture for time series forecasting. GNN represents inter-sensor dependencies as graph structures. The system detects anomaly precursors from thousands of multivariate time series parameters. Unsupervised — no labels needed.

Approach A: N-BEATS (each sensor independent)
  Sensors → N-BEATS learns normal patterns → residual > threshold → alert

Approach B: GNN (sensor interdependency)
  Sensors → GNN builds causal graph → graph anomaly score → alert

The key is predicting anomalies before they happen, not detecting them after. FABs have a multi-hour lag between equipment anomaly and wafer scrap. Catching precursors saves wafers.

2. AI-Driven Proactive SPC

Proactive Statistical Process Control Using AI (Seeam, 2025)

Traditional SPC is reactive — responds after control limits are breached. This paper uses Facebook Prophet for time series prediction, forecasting whether the next measurement will exceed limits. (Note: this paper was withdrawn from ArXiv, but the approach analysis remains valuable.)

	Traditional SPC	AI-SPC
Detection	After excursion	Before excursion
Response	Stop → investigate	Preventive maintenance → continue
Data	Univariate	Multivariate capable
Updates	Fixed control limits	Model retraining adapts

3. CdZnTe Semiconductor Defect Segmentation

Harnessing Group-Oriented Consistency Constraints for Semi-Supervised Semantic Segmentation (Li et al., 2025)

Segments defects from CdZnTe (Cadmium Zinc Telluride) wafer images. Low-contrast defect boundaries are the challenge — annotators cross-reference multiple views to judge.

The paper exploits this many-to-one structure through semi-supervised learning. Unlabeled images contribute to training, reducing annotation cost.

4. Tool-to-Tool Matching Analysis

Tool-to-Tool Matching Analysis Based Difference Score Computation (Bharadwaja et al., 2025)

Quantifies processing result differences between tools performing the same process step. Tool A produces good yield, Tool B doesn't — tool-to-tool variation is a permanent reality in FABs.

Traditional approaches compare against a golden reference (ideal tool state baseline), but maintaining golden references in production is impractical. The paper proposes difference scores for relative comparison, applicable even across heterogeneous tools (different manufacturers).

5. LSTM + Sentiment Analysis for Industry Trends

Semiconductor Industry Trend Prediction with Event Intervention Based on LSTM Model (Yen & Chen, 2025)

Combines text data (sentiment analysis from TSMC quarterly reports) with numerical data (financial indicators) via LSTM to predict industry trends. Not direct yield prediction, but it demonstrates multimodal text+numerical input — a concept applicable to process engineer logs and maintenance records.

Wall 1: The Accuracy Paradox — Why 99% Is Useless

Reading all five papers, the first impression is: accuracy is there. The question is whether that accuracy changes decisions on the factory floor.

What does 99% accuracy mean for semiconductor yield prediction? A wafer has thousands of die. At 95% yield, 1,000 die have 50 defectives. Assuming 99% sensitivity and specificity, the model catches 49.5 of 50 defectives but falsely flags 9.5 of 950 good die.

Those 9.5 false positives are the problem.

# 99% accuracy model on a 95% yield line
total_die = 1000
defective = 50       # actual defects
normal = 950         # actual good
accuracy = 0.99

true_positive = defective * accuracy     # 49.5 (correctly caught)
false_positive = normal * (1 - accuracy)  # 9.5 (good die flagged as bad)
false_negative = defective * (1 - accuracy) # 0.5 (missed defects)

precision = true_positive / (true_positive + false_positive)
# 49.5 / 59.0 = 83.9%

# Factory floor reaction: "1 in 6 alerts is wrong? Then none are trustworthy."

Precision 83.9%. Impressive in ML. Unacceptable in a FAB. 17% false positives means re-inspecting flagged die costs nearly as much as screening everything. The real objection is cost asymmetry: a missed defect (false negative) costs a few hundred thousand dollars in scrapped wafers. A false alarm stopping the entire line costs millions per hour in opportunity loss.

Paper 1's N-BEATS+GNN uses unsupervised methods to sidestep this — threshold-based anomaly scores let you control false positive rates. But raising thresholds increases misses. Those misses become yield loss. Same tradeoff, different packaging.

The core issue: FABs don't want "defect prediction." They want "identify and eliminate the root cause of yield loss." Prediction is a means, not the goal. Most papers blur this distinction.

Wall 2: The Data Wall — The Preprocessing Papers Gloss Over Is the Heaviest Lift

All five papers share one assumption: structured data exists.

N-BEATS+GNN expects thousands of time series parameters. Proactive SPC assumes measurement time series. Tool matching assumes standardized process data per tool.

Reality in a FAB:

Scattered data: Equipment logs in CSV, inspection results in Oracle DB, lot info in MES, physical analysis in Excel and PDF. Nothing integrated
Uneven sampling: Some steps do 100% inspection, others sample. The "all parameters for all wafers" that models expect doesn't exist
Ambiguous labels: Root cause attribution is human judgment. Different engineers classify the same failure mode differently
Misaligned timestamps: Tool A logs at 1-second precision, Tool B at 1-minute, MES records process completion time only. Joining is non-trivial

Paper 3's CdZnTe work is interesting because it honestly acknowledges this. "Annotators cross-reference multiple views" — meaning ground truth generation itself requires domain expertise. Semi-supervised learning isn't a clever optimization; it's an admission that labeled data is prohibitively expensive.

Paper assumption:
  Clean dataset → Model training → High accuracy

Factory reality:
  Scattered data → Build ETL pipeline (months) → Data quality improvement (months)
  → Finally train model → Accuracy poor → Collect more data (more months)
  → Retrain → Finally good accuracy → Process change happens
  → Model retraining needed → Back to square one

Paper 4 honestly states that golden references are impractical in production. But its proposed difference score method still assumes standardized data formats across tools. It claims heterogeneous tool support while ignoring the messy reality of data integration.

Wall 3: The Organizational Wall — ML Engineers and Process Engineers Speak Different Languages

The biggest wall that technical papers completely ignore.

In a semiconductor FAB, process engineers own yield. They don't care about F1 scores. They want to know which tool, which parameter, through which physical mechanism, is causing defects.

When N-BEATS+GNN says "anomaly precursor detected," the process engineer's next question is "So what's the root cause?" If GNN builds an inter-sensor dependency graph, can the graph edges be given physical meaning? If not, every alert requires manual root cause investigation — same as today.

The Proactive SPC paper sells "respond before limits are breached," but process engineers have a different concern. SPC charts are audit targets for quality assurance, governed by standardized statistical methods. "We stopped the tool because AI predicted an excursion" doesn't survive an audit.

Process engineer thinking:
  "Why is this ML saying it's abnormal?"
  → Can't explain → Can't trust → Won't use

ML engineer thinking:
  "AUC is 0.99, why won't they use it?"
  → Add explainability → Show SHAP values → "What do these SHAP numbers mean?"
  → Can't explain again → Back to square one

Paper 3's CdZnTe work does better because its output is a segmentation map — a visualization of where defects are on the wafer image. Process engineers can look at images and apply physical interpretation. This output format, closer to human judgment processes than abstract graph structures, lowers the adoption barrier.

Breach Points: 3 Conditions Where ML Actually Improves Yield

Three walls, but I'm not rejecting ML entirely. Under these conditions, ML genuinely contributes to yield improvement.

Condition 1: Use for Prioritization, Not Prediction

Don't use the 99% classifier as final defective/good judgment. Use it for inspection priority ranking.

# Wrong: Use as classifier
if model.predict(wafer) == "defective":
    scrap(wafer)  # high false positive risk

# Right: Use as ranker
risk_scores = model.predict_proba(wafers)
high_risk = wafers[risk_scores > threshold]
# Priority-inspect high_risk only → reduces inspection cost vs. 100% inspection

Paper 1's approach is close to this direction. Prioritizing maintenance on tools with the highest anomaly scores, without full automation, already delivers significant downtime reduction.

Condition 2: Visualizable Output Format

Paper 3's segmentation works because the output is an image. Same logic: visualize anomaly detection results on dashboards that process engineers can intuitively understand.

Paper 4's difference scores also pair well with visualization. Display tool-to-tool differences as heatmaps — Tool A's temperature control is 0.3°C off from Tool B — readable at a glance.

Condition 3: Build MLOps Infrastructure That Tracks Process Changes First

The "process change resets everything" problem from Wall 2 is addressed by continuous retraining pipelines.

Equipment data → Real-time ETL → Feature store
                                    ↓
                              Model serving ← Periodic retraining (weekly)
                                    ↓
                              Anomaly score API → Dashboard
                                                → SPC integration

Build this MLOps foundation first, then make models swappable on top. N-BEATS works? Use N-BEATS. Transformer beats it? Swap in Transformer. Pipeline robustness has higher ROI than model accuracy.

Trying Yield ML at Individual Scale

For readers who don't have a FAB.

Start with the SECOM Dataset

The SECOM Dataset on the UCI repository contains real semiconductor manufacturing sensor data. 590 features, 1,567 samples, binary classification (pass/fail).

import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Load SECOM data
data = pd.read_csv("secom.data", sep=r"\s+", header=None)
labels = pd.read_csv("secom_labels.data", sep=r"\s+", header=None)[0]

# High missing rate (just like real FAB data)
print(f"Missing rate: {data.isnull().sum().sum() / data.size * 100:.1f}%")
# → ~4.5% — this is actually clean by FAB standards

pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("clf", GradientBoostingClassifier(n_estimators=200))
])

scores = cross_val_score(pipe, data, labels, cv=5, scoring="f1")
print(f"F1: {scores.mean():.3f} ± {scores.std():.3f}")
# → Accuracy looks decent. Whether it's usable in production is another story

This dataset lets you experience Wall 1's accuracy paradox firsthand. F1 comes out reasonable, but how to set the precision/recall tradeoff can't be decided without manufacturing process knowledge.

Running GNN Anomaly Detection on RTX 4060 8GB

To reproduce Paper 1's N-BEATS+GNN approach locally, PyTorch Geometric works. SECOM-scale data runs comfortably on an RTX 4060 8GB.

import torch
from torch_geometric.nn import GCNConv
from torch_geometric.data import Data

class SensorGNN(torch.nn.Module):
    def __init__(self, num_features, hidden=64):
        super().__init__()
        self.conv1 = GCNConv(num_features, hidden)
        self.conv2 = GCNConv(hidden, 32)
        self.classifier = torch.nn.Linear(32, 2)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index).relu()
        return self.classifier(x)

# Build graph from sensor correlation matrix
correlation = data.corr().abs()
threshold = 0.7
edges = (correlation > threshold).values
edge_index = torch.tensor(
    [[i, j] for i in range(edges.shape[0])
            for j in range(edges.shape[1]) if edges[i][j] and i != j],
    dtype=torch.long
).t()

# VRAM usage: ~200MB (590 nodes, correlation graph)
# Trainable on RTX 4060 (590-node graph is small)

590 sensors as nodes, edges between highly correlated sensors. Whether this graph structure reflects the physical dependencies of the process is a good exercise to explore at individual scale.

What This Round of Papers Revealed

ML adoption in semiconductor manufacturing is not a model problem — it's an integration problem.

N-BEATS, GNN, Transformer — they all work technically. Accuracy checks out. But between "works" and "gets used" lies an enormous gap: data pipelines, visualization, explainability, organizational decision-making processes.

Bridging this gap requires neither ML engineers nor process engineers alone. It requires people who speak both languages — who understand semiconductor physics and ML architectures, and can design how model outputs connect to which factory floor decisions.

Papers propose models. The factory floor demands answers. The translator between them is what's most scarce.

References

Sorensen, Dey, Hwang, Halder. "Unsupervised Anomaly Prediction with N-BEATS and GNN in Multi-variate Semiconductor Process Time Series" (2025)
Seeam. "Proactive Statistical Process Control Using AI" (2025) [Withdrawn from ArXiv]
Li, Fang, Liu et al. "Harnessing Group-Oriented Consistency Constraints for Semi-Supervised Semantic Segmentation in CdZnTe Semiconductors" (2025)
Bharadwaja, Jandial, Agashe et al. "Tool-to-Tool Matching Analysis Based Difference Score Computation Methods for Semiconductor Manufacturing" (2025)
Yen & Chen. "Semiconductor Industry Trend Prediction with Event Intervention Based on LSTM Model" (2025)

DEV Community

ML Hit 99% Accuracy on Yield Prediction — The Factory Floor Ignored It

ML Hit 99% Accuracy on Yield Prediction — The Factory Floor Ignored It

The 5 Papers: What's Being Proposed

1. N-BEATS + GNN for Anomaly Prediction

2. AI-Driven Proactive SPC

3. CdZnTe Semiconductor Defect Segmentation

4. Tool-to-Tool Matching Analysis

5. LSTM + Sentiment Analysis for Industry Trends

Wall 1: The Accuracy Paradox — Why 99% Is Useless

Wall 2: The Data Wall — The Preprocessing Papers Gloss Over Is the Heaviest Lift

Wall 3: The Organizational Wall — ML Engineers and Process Engineers Speak Different Languages

Breach Points: 3 Conditions Where ML Actually Improves Yield

Condition 1: Use for Prioritization, Not Prediction

Condition 2: Visualizable Output Format

Condition 3: Build MLOps Infrastructure That Tracks Process Changes First

Trying Yield ML at Individual Scale

Start with the SECOM Dataset

Running GNN Anomaly Detection on RTX 4060 8GB

What This Round of Papers Revealed

References

Top comments (0)