QAT vs PTQ: When 3% Accuracy Drop Kills Your Model

#quantization #qat #ptq #pytorch

Post-training quantization destroyed my ResNet-50 deployment last year — not because INT8 is broken, but because I reached for it in exactly the wrong situation. A 3.1% accuracy drop on a medical imaging classifier isn't a rounding error; it's a project cancellation. The question isn't whether to quantize. It's which quantization path to take, and that depends on factors most tutorials skip entirely.

When PTQ Wins (and When It Quietly Loses)

PTQ is the obvious first move. Load your trained FP32 model, run a calibration dataset through it, collect activation statistics, and emit an INT8 model in under an hour. With PyTorch 2.x, the happy path looks like this:

import torch
from torch.ao.quantization import get_default_qconfig_mapping, prepare, convert
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx

# torch 2.2.0 — using FX graph mode (the modern approach)
model = load_your_model()  # FP32, eval mode
model.eval()

example_input = torch.randn(1, 3, 224, 224)
qconfig_mapping = get_default_qconfig_mapping("x86")  # or "qnnpack" for ARM

prepared_model = prepare_fx(model, qconfig_mapping, example_input)

# Calibration — run ~500-1000 samples, NOT your full training set
with torch.no_grad():
    for images, _ in calibration_loader:  # batch_size=32, ~500 samples
        prepared_model(images)

quantized_model = convert_fx(prepared_model)
print(quantized_model)  # Observe QuantizedLinear, QuantizedConv2d nodes

Continue reading the full article on TildAlice

DEV Community

QAT vs PTQ: When 3% Accuracy Drop Kills Your Model

When PTQ Wins (and When It Quietly Loses)

Top comments (0)