DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

INT8 vs INT4 Quantization: 2x Latency Drop on ARM Cortex-M

INT4 Cuts Inference Time in Half — But There's a Catch

Running a quantized model on a Cortex-M7 at 216MHz, I measured 42ms for INT8 inference and 21ms for INT4 on the same 128×128 MobileNetV2 backbone. That's a clean 2× speedup with virtually no code changes. But here's what the benchmarks don't tell you: INT4 eats an extra 18KB of flash for lookup tables, fails catastrophically on models with batch normalization folded incorrectly, and gives you maybe 1.2% accuracy drop if you're lucky — often closer to 4-6% on edge cases.

Most quantization guides stop at "less bits = faster." True, but incomplete. On ARM Cortex-M, the performance gap comes from SIMD packing (you can fit eight INT4 weights in a 32-bit register vs. four INT8 weights) and reduced memory bandwidth. The M7's AHB bus runs at 216MHz but the actual SRAM access is often bottlenecked by cache misses. Smaller weights = fewer cache evictions = fewer stalls. The math checks out until you hit the edge cases.

Detailed view of an electronic music sequencer with buttons and dials, showcasing a sleek design.

Photo by Egor Komarov on Pexels

Why Cortex-M Loves INT4 (When It Works)


Continue reading the full article on TildAlice

Top comments (0)