I Benchmarked 4 Lightweight Transformers for Fault Detection. Here's What Survived.

#datascience #deeplearning #machinelearning #performance

Everyone talks about deploying ML on edge devices. Very few people show what happens when you actually try.

I ran a full benchmark of four lightweight transformer models - DistilBERT, MobileBERT, TinyBERT-6L, and TinyBERT-4L — against traditional ML baselines on three real-world fault detection datasets.

The Setup

NASA C-MAPSS: Turbofan engine degradation (20,631 samples, 15% failure rate)
SECOM: Semiconductor manufacturing (1,567 samples, 6.6% failure rate)
UCI Predictive Maintenance: Industrial machine failure (10,000 samples, 3.4% failure rate)

All experiments ran on a T4 GPU with consistent hyperparameters.

The Results

Model	F1	Size	CPU Latency
XGBoost	87.9%	0.5 MB	0.002 ms
TinyBERT-4L	87.8%	55 MB	18 ms
DistilBERT	87.6%	255 MB	138 ms

MobileBERT: The Surprise Failure

MobileBERT — specifically designed for mobile deployment — scored 0% F1 on every dataset. It predicted the majority class for every sample across all configurations.

“Designed for mobile” does not mean “works for your use case.”

The Adaptive Pipeline

The most promising result came from combining models:

Quantized TinyBERT-4L handles confident predictions
DistilBERT steps in only for uncertain cases
87.6% F1 with 97.9% of samples handled by the lightweight model
19.5 ms average latency instead of 138 ms

Key Takeaways

Start with XGBoost for tabular data — a 0.5MB model beating 255MB transformers is hard to ignore.
TinyBERT-4L is the edge sweet spot — smallest transformer with near-best accuracy.
Quantize aggressively — INT8 cuts size significantly with minimal loss.
Use adaptive pipelines — route easy predictions through small models, escalate only when needed.
Class imbalance is still unsolved — SECOM remained extremely difficult across all models.

Code

All code and results:
https://github.com/disha8611/edge-fault-detection-benchmark

Previous research on LLM-based anomaly detection:
https://arxiv.org/abs/2604.12218

Disha Patel — Software Engineer & ML Researcher. I write about engineering, on-device ML, and building systems that work in the real world.

machinelearning #ai #python #benchmark

Top comments (1)

Harjot Singh • May 31

"Everyone talks about deploying ML on edge, very few show what happens when you try" is the honest gap, and benchmarking the small transformers against traditional ML baselines is the part most people skip, they reach for the fanciest model and never check whether a boring baseline would've won. That comparison is the whole value: on a lot of real fault-detection problems a well-tuned classical model is competitive with a distilled transformer at a fraction of the footprint, and the only way to know is to run it, which you did. The what-survived framing is right because edge deployment adds constraints that don't show on a leaderboard, latency, memory, the model actually fitting on the device, so the winner is rarely the highest-accuracy one, it's the best accuracy-per-resource that meets the constraint. That's the same routing logic I apply everywhere: the right model is the cheapest one that clears the bar for this specific task and target, not the most capable in the abstract. Measure on your data and your hardware, don't trust the general benchmark. That match-the-model-to-the-constraint instinct is core to how I think in Moonshift. Did a transformer actually beat the classical baseline enough to justify its footprint, or did the boring model win on accuracy-per-byte?