The Problem Nobody Warns You About
You've just spent two days pruning your YOLOv8 model down to 60% sparsity, the accuracy is still solid at 0.89 mAP, and you're ready to deploy it as INT8 ONNX for edge inference. You run the quantization script, export to ONNX, fire up ONNX Runtime, and...
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL :
Node (QLinearConv) has input with type float32 but expected int8
The unpruned model quantizes fine. The pruned FP32 model runs fine. But the moment you combine pruning + INT8 quantization, ONNX Runtime chokes.
This isn't a weird edge case. Structural pruning changes layer shapes in ways that confuse quantization-aware training (QAT) and post-training quantization (PTQ) tools. The calibration step expects certain tensor dimensions, pruning violates those assumptions, and the resulting graph has mismatched dtypes that ONNX Runtime refuses to execute.
Here are three fixes that actually work, with code you can run today.
Continue reading the full article on TildAlice

Top comments (0)