DEV Community

Jeff Geiser
Jeff Geiser

Posted on

Q8_0 isn't slow because of swap

A complete quantization benchmark for Llama 3.1 8B on Apple M4 16GB — speed and perplexity

I’ve been building an account intelligence model — a fine-tuned system that pulls from Salesforce, Confluence, Slack and some internal systems and grabs everything worth knowing about a customer account into structured JSON. The kind of thing that normally takes some time to generate - and can’t be generated from within any one of those - but takes time. Hoping to make this more efficient. Very isolated use case, but is interesting. I think these highly specialized local models will likely dominate enterprise architectures.. I think..

Part of that project is deciding which local models are viable in production — specifically, which quantization level makes sense for the 7B distilled model I’m eventually releasing.. I decided to use a basic mac mini as the reference architecture.

I built an automated benchmark harness, ran Llama 3.1 8B through all 11 quantization levels on an Apple M4 Mac Mini with 16GB unified memory, and measured three things per quant: token generation speed, perplexity on Wikitext-2, and swap behavior.

I expected a smooth quality/speed tradeoff curve. I got a cliff, a plateau, and a finding I had to re-run twice to believe - thought it was an error..

Q8_0 — the highest-quality quantization I tested — produced 0.13 tokens per second.

Not because it was swapping. The re-run shows swap_any: false. No swap at all. The model fits cleanly in 16GB unified memory. It just doesn't run fast.

The explanation: 8.5GB of 8-bit weights saturates the M4's unified memory bandwidth during inference. The GPU shows 80% utilization — not because it's computing, but because it's moving weights from memory to compute units. At 8-bit precision, the weight transfer bottleneck dominates everything else. There's no swap to blame. The constraint is the memory bus.

This is an architectural property of the hardware, not a configuration problem. You can't fix it by closing other apps or adding more swap. Q8_0 on a 16GB M4 is slow by design.

What swaps and what doesn’t

Five of eleven quants hit disk during the benchmark. The clean ones — no swap, usable speed:

IQ3_XS — 3.5 GB — 13.0 tok/s
Q3_K_M — 4.0 GB — 10.1 tok/s
Q4_K_M — 4.9 GB — 19.7 tok/s ← sweet spot
Q6_K — 6.6 GB — 16.3 tok/s ← quality ceiling

Everything above Q6_K either swaps or hits the bandwidth wall. The jump from Q6_K to Q8_0 (6.6 GB → 8.5 GB) doesn’t just add size — it crosses the threshold where the memory bus can’t keep up.

One weird thing worth calling out: Q5_K_S hit 6.96 tok/s per watt — the best efficiency number in the entire set. But it’s swapping. Low GPU power (2.4W average) plus active swap means the work is happening on CPU and memory bandwidth, not GPU. A misleading number if you only look at efficiency without checking swap.

Now add perplexity

Speed tells you how fast the model runs. Perplexity tells you how much quality you’ve traded away to get there. Lower perplexity = better quality. Q8_0 is the baseline.

Q2_K — PPL 11.15 — +29% worse than Q8_0
Q3_K_M — PPL 9.21 — +6.4% worse
Q4_K_M — PPL 8.80 — +1.7% worse
Q5_K_M — PPL 8.73 — +0.9% worse
Q6_K — PPL 8.68 — +0.4% worse
Q8_0 — PPL 8.66 — baseline

The quality cliff is at Q3, not Q4. Q4_K_M to Q8_0 is essentially flat — a 1.7% perplexity difference you would not notice on any real task. Q3_K_M jumps to +6.4%, which is detectable. Q2_K at +29% is a last resort.

The practical decision range is Q4_K_M to Q6_K. Everything in that band delivers 98–100% of Q8_0 quality at usable speed with no swap.

Q4_K_M gives you 98.3% of Q8_0 quality at 152× the speed.

The account intelligence model I'm building runs on two very different pieces of hardware depending on what it's doing.

The fine-tuning and eval work runs on a DGX Spark — Qwen2.5-32B in FP8, 64GB VRAM. That's not the model I'm releasing.

The distilled 7B model — the one anyone can run locally — needs to work on a Mac Mini, a single-GPU workstation, a team server. For that audience, shipping Q8_0 would be a mistake. Someone tries it on 16GB, gets 0.13 tok/s, and concludes the model is broken. That's a distribution failure I can prevent by choosing the right quantization before release.

Target: Q4_K_M GGUF. Fits in 16GB with room to spare. 19+ tok/s. 1.7% quality loss. No surprises at deployment.

What’s next

This benchmark covers one model on one hardware configuration. The next round is Qwen3.6 — a 27B dense model — on the same M4 setup. The questions are different: at 27B, the “safe” quant range on 16GB is much narrower. And the Qwen3 family has a thinking-mode variable (enable_thinking=false reliability) that adds a dimension Llama doesn’t have.

I’ll also (maybe) run the same harness on the DGX Spark for a direct comparison: what does a Mac Mini get you vs enterprise inference hardware on the same model family? Would be nice but it’s a shared machine so I need to be careful with the workloads.. might just spin up a vps/bmc..

Next post: the synthetic data generator for the account intelligence model — what broke in the first smoke test (62 schema validation errors on run 1), how the retry loop works, and what we learned from building 8 gold examples by hand from real customer data.

Had claude pump out some graphs:

speed and perplexity graphs

Top comments (0)