Post-training quantization destroyed my ResNet-50 deployment last year — not because INT8 is broken, but because I reached for it in exactly the wrong situation. A 3.1% accuracy drop on a medical imaging classifier isn't a rounding error; it's a project cancellation. The question isn't whether to quantize. It's which quantization path to take, and that depends on factors most tutorials skip entirely.
When PTQ Wins (and When It Quietly Loses)
PTQ is the obvious first move. Load your trained FP32 model, run a calibration dataset through it, collect activation statistics, and emit an INT8 model in under an hour. With PyTorch 2.x, the happy path looks like this:
import torch
from torch.ao.quantization import get_default_qconfig_mapping, prepare, convert
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
# torch 2.2.0 — using FX graph mode (the modern approach)
model = load_your_model() # FP32, eval mode
model.eval()
example_input = torch.randn(1, 3, 224, 224)
qconfig_mapping = get_default_qconfig_mapping("x86") # or "qnnpack" for ARM
prepared_model = prepare_fx(model, qconfig_mapping, example_input)
# Calibration — run ~500-1000 samples, NOT your full training set
with torch.no_grad():
for images, _ in calibration_loader: # batch_size=32, ~500 samples
prepared_model(images)
quantized_model = convert_fx(prepared_model)
print(quantized_model) # Observe QuantizedLinear, QuantizedConv2d nodes
Continue reading the full article on TildAlice
Top comments (0)