QAT vs PTQ on our edge vision model: 6 months of A/B data

#machinelearning #mlops #computervision #pytorch

TL;DR: We ran post-training quantisation (PTQ) and quantisation-aware training (QAT) side by side on the same defect-classification model deployed on a Jetson Orin Nano. After six months in production, QAT recovered 3.1 mAP points over PTQ on rare defect classes, but cost us roughly two engineer-weeks of pipeline work and a 4x slower training cycle.

So, the thing is, every time someone shows me a quantisation benchmark on ImageNet, I want to ask them what their actual deployment looks like. Because ImageNet validation accuracy at INT8 tells you almost nothing about whether your model will still detect the 0.4% of defect samples that pay for the whole project. We learned this the hard way at the end of last year, when the first quarter of production data came back from one of our partner sites and our PTQ model was missing scratches that the FP16 baseline caught fine.

This post is the writeup. Six months, one model architecture (ResNet-18 trunk with a custom anchor-free head), two quantisation paths, two hardware targets. No synthetic benchmarks, no toy datasets.

The model and the constraint

The model is a defect classifier on a steel rolling line. Inference runs on a Jetson Orin Nano, 8GB version, sharing the SoC with a stereo depth pipeline. Latency budget for the classification path is 14ms. Memory budget after the depth pipeline takes its share is around 180MB. Five classes including background, with class imbalance roughly 92/3/2/2/1 percent.

FP16 baseline: 17.3ms, 92.4 mAP, 71MB of activations.
INT8 PTQ: 8.9ms, 88.1 mAP, 38MB.
INT8 QAT: 9.2ms, 91.2 mAP, 38MB.

Numbers look fine in aggregate. But aggregate hides the problem.

Where PTQ falls apart

Post-training quantisation with TensorRT's entropy calibrator does a reasonable job when your dataset is balanced. Ours is not. The calibration set we initially used was drawn proportionally from the training distribution, which meant 92% of the calibration data was clean background. The activation histograms ended up dominated by the background distribution, and the quantisation scales were tuned for that.

The result was that the rare defect classes (cracks at 1% prevalence, embedded particles at 2%) lost between 5 and 9 mAP points each. Hot pixels in the defect feature maps got clipped into the background range. We caught this in a customer review meeting in early March. Not my favourite Tuesday.

What we tried

Approach	Calibration / training	mAP (rare classes)	Engineer-weeks	Notes
PTQ default	5000 random samples	79.2	0.5	Baseline pain
PTQ rebalanced	2000 samples, defects oversampled 10x	84.6	1.0	Better, still gaps
PTQ percentile	99.99 percentile + rebalanced	86.1	1.5	Marginal gain
QAT (10 epochs)	Real training loop, fake-quant ops	89.4	2.5	The keeper

The QAT path used torch.ao.quantization with custom fake-quant observers per layer, exported through ONNX to TensorRT. We had to write a small shim because the default ONNX → TRT path stripped some of our QDQ nodes silently in TensorRT 10.4. The fix was forcing explicit precision on the affected layers.

# Snippet from our QAT training step
from torch.ao.quantization import QConfig, FakeQuantize, MovingAverageMinMaxObserver

per_channel_qconfig = QConfig(
    activation=FakeQuantize.with_args(
        observer=MovingAverageMinMaxObserver,
        quant_min=-128, quant_max=127,
        dtype=torch.qint8,
        qscheme=torch.per_tensor_symmetric,
    ),
    weight=FakeQuantize.with_args(
        observer=MovingAverageMinMaxObserver,
        quant_min=-128, quant_max=127,
        dtype=torch.qint8,
        qscheme=torch.per_channel_symmetric,
    ),
)
model.qconfig = per_channel_qconfig
torch.ao.quantization.prepare_qat(model, inplace=True)

The interesting bit was per-channel weight quantisation. Per-tensor lost us another 1.5 mAP on rare classes. Per-channel costs roughly nothing at inference on the Orin's NVDLA. You almost always want per-channel for vision models with significant filter diversity.

What QAT does not fix

QAT is not free and it's not magic. A few things bit us.

The training cycle goes from 2 hours to 8 hours when you turn on fake-quant. We optimised some of this by only quantising from epoch 4 onwards (training in FP16 first), which the literature calls warm-start QAT. Recovered most of the wall-clock cost.

Operator coverage in TensorRT for QDQ nodes is decent but not complete. Our custom group-norm replacement broke quantisation entirely and we had to fall back to batch-norm for the deployment branch. Annoying. Worth checking before you commit to an architecture.

The eval pipeline itself needed work too. We use an LLM-driven workflow to triage failure modes from our weekly inspection batches (sorting false positives by visual similarity, basically), routed through Bifrost as our internal gateway so the rest of the org can share quota. Found one whole class of failures (specular highlights confused as cracks) we were not tracking before.

Trade-offs and limitations

PTQ is still the right call for prototyping. If you're not sure your architecture is final, do not pay the QAT tax. Iterate with PTQ until the design freezes.
QAT amplifies dataset noise. Mislabeled samples hurt more under QAT than under FP training, because the fake-quant adds noise on top. We re-curated about 800 ambiguous labels before our final run.
The mAP gap shrinks with mixed precision. We tried mixed INT8/FP16 per layer and recovered 1 mAP for a 1.4ms latency hit. For our 14ms budget, full INT8 won. Yours may differ.
Different hardware, different story. The Orin's INT8 throughput is excellent. On a Coral TPU or a Cortex-M7, the analysis changes completely.