TorchAO Just Beat ONNX Runtime on My M1 MacBook (And I Didn't Expect It)
I ran the same 8-bit quantized Llama 3.2 1B model through TorchAO and ONNX Runtime, expecting ONNX to dominate like it usually does for mobile inference. TorchAO finished 512-token generation in 4.2 seconds. ONNX Runtime took 6.8 seconds.
That's a 38% speed difference on identical hardware with the same quantization scheme. Here's what actually happened when I tried to replicate the "ONNX is always faster" wisdom from half the blog posts out there.
The Setup Nobody Talks About: Why Quantization Method Matters More Than Framework
Most benchmarks compare frameworks but ignore that quantization calibration is where you win or lose. I used W8A8 (8-bit weights, 8-bit activations) on Llama 3.2 1B because it's small enough to profile thoroughly but large enough to show real inference patterns.
Here's the quantization formula both frameworks implement:
$$x_q = \text{round}\left(\frac{x}{s}\right) + z$$
Continue reading the full article on TildAlice

Top comments (0)