Post-Mortem: INT8 Quantization Failure on a 3.8 kB Smoke Detector Model

#cpp #embeddedsystems #machinelearning #ai

In edge AI, every byte counts. But compressing weights without a proper validation loop can silently break a model that was working fine.

This is a short post-mortem on exactly that.

What I was trying to do

The smoke detector model I wrote about recently — 12 inputs, 3 layers, binary output — weighs 3.8 kB in float32. I wanted to see if INT8 quantization would bring it down further without hurting accuracy.

The answer was no. Here's why that matters, and how the tooling caught it before anything got deployed.

The problem

Mapping float32 weights to INT8 introduces quantization noise. In large models with millions of parameters, that noise tends to average out across layers. In a compact model where every weight carries significant influence over the output, it doesn't.

The last layer uses sigmoid — binary classification, 0 or 1. With a fully quantized INT8 model, the output distribution shifted enough to fail the validation threshold.

Running quantize_test in the Hasaki Workbench made it clear:

Float  mean err: 0.000493
INT8   mean err: 0.140836

Result: FAILED

That's a 24% degradation in mean error — unacceptable for a smoke detector where a false negative is a catastrophe, not a metric.

The test prevented a corrupted binary from reaching the hardware. That's the point of having the test.

What I did about it

Two things.

First, reverted to float32. The 100-byte savings INT8 offered weren't worth the reliability cost.

Second, used the failure as a reason to tighten up the training pipeline. Added --seed for reproducibility — same seed, identical model weights, verified with md5. Added --batch-size for explicit control over mini-batch size without relying on auto-tuning. Removed the experimental OpenMP parallelism that was adding overhead without measurable benefit on this dataset size.

The result: a cleaner pipeline, a stable 3.8 kB float32 model, and a validation gate that works.

The real Tradeoff

There's a real tradeoff here worth naming. Float32 costs more CPU cycles than INT8, and on hardware without an FPU — like the ATtiny85 — that overhead is significant. On an ESP32 it's barely noticeable.

But for this model, on this problem, the tradeoff is clear: a float32 model that detects smoke reliably is better than an INT8 model that introduces enough noise to miss a fire. The cycle cost is acceptable. A missed alarm is not.

What this means in practice

Post-training quantization without quantization-aware training is a gamble on small models. The math that makes it work on ResNet doesn't translate to a 12-input, 96-weight network.

Disk size is a secondary metric. Val loss and real accuracy are what matter.

And quantize_test should run before every export. Not occasionally. Every time.

Hasaki 刃先 is available at GitHub - Codeberg
Smoke detector project: GitHub - Codeberg