DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work

The Problem Nobody Warns You About

You've just spent two days pruning your YOLOv8 model down to 60% sparsity, the accuracy is still solid at 0.89 mAP, and you're ready to deploy it as INT8 ONNX for edge inference. You run the quantization script, export to ONNX, fire up ONNX Runtime, and...

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : 
Node (QLinearConv) has input with type float32 but expected int8
Enter fullscreen mode Exit fullscreen mode

The unpruned model quantizes fine. The pruned FP32 model runs fine. But the moment you combine pruning + INT8 quantization, ONNX Runtime chokes.

This isn't a weird edge case. Structural pruning changes layer shapes in ways that confuse quantization-aware training (QAT) and post-training quantization (PTQ) tools. The calibration step expects certain tensor dimensions, pruning violates those assumptions, and the resulting graph has mismatched dtypes that ONNX Runtime refuses to execute.

Here are three fixes that actually work, with code you can run today.

Detailed view of an electronic music sequencer with buttons and dials, showcasing a sleek design.

Photo by Egor Komarov on Pexels

Continue reading the full article on TildAlice

Top comments (0)