TorchAO vs ONNX Runtime: 8-bit Quantization Benchmark

#quantization #llminference #pytorch #onnx

TorchAO Just Beat ONNX Runtime on My M1 MacBook (And I Didn't Expect It)

I ran the same 8-bit quantized Llama 3.2 1B model through TorchAO and ONNX Runtime, expecting ONNX to dominate like it usually does for mobile inference. TorchAO finished 512-token generation in 4.2 seconds. ONNX Runtime took 6.8 seconds.

That's a 38% speed difference on identical hardware with the same quantization scheme. Here's what actually happened when I tried to replicate the "ONNX is always faster" wisdom from half the blog posts out there.

A digital abstract image featuring a 3D geometric shape with a gradient background. — Photo by Steve Johnson on Pexels

The Setup Nobody Talks About: Why Quantization Method Matters More Than Framework

Most benchmarks compare frameworks but ignore that quantization calibration is where you win or lose. I used W8A8 (8-bit weights, 8-bit activations) on Llama 3.2 1B because it's small enough to profile thoroughly but large enough to show real inference patterns.

Here's the quantization formula both frameworks implement:

$$x_q = \text{round}\left(\frac{x}{s}\right) + z$$

Continue reading the full article on TildAlice