DEV Community

Cover image for TurboQuant on a MacBook Pro: two findings the upstream discussion missed
Christopher Maher
Christopher Maher

Posted on • Originally published at llmkube.com

TurboQuant on a MacBook Pro: two findings the upstream discussion missed

Originally published at llmkube.com/blog/turboquant-m5-max-long-context. Cross-posted here for the dev.to audience.

A 7-hour overnight bench on an M5 Max, two findings I haven't seen in the upstream community thread, and two PRs back to the LLMKube operator to make TurboQuant a first-class citizen of the InferenceService CRD.


TL;DR

A TurboQuant-enabled llama-server on Apple Silicon runs Qwen3.6-35B-A3B Q8 at up to 1M-token context on a 128 GB MacBook Pro M5 Max. Standard f16 KV cache OOMs at 256K. Two findings worth quoting:

  1. At 128K+ context, the 3-bit KV cache (turbo3) matches or beats the 8-bit cache (q8_0) on prompt processing. Smaller cache means less memory bandwidth pressure during attention, and the throughput gap that exists at short context flips by ~128K depth.
  2. turbo3 and turbo4 split by workload phase. Long-context prefill favors turbo3 (~27% faster than turbo4 at 256K). Long-context decode favors turbo4 (~11% faster than turbo3 at 256K). They are not interchangeable — different attention bottlenecks dominate during prefill and decode.

We built TheTom's feature/turboquant-kv-cache fork of llama.cpp for Metal, validated on M5 Max, and took two PRs back to LLMKube to make TurboQuant first-class on the InferenceService CRD.


Why KV cache, why now

If you're running coding agents locally — single-model or architect+editor combos — the binding constraint isn't model weights. It's KV cache.

Weights you can quantize once, store on disk, and forget. KV cache is generated per token of context at inference time, sized by the model's depth and head dimensions, and held in working memory the entire session. A 35B-class model with flash-attn on uses roughly 256 KB of fp16 KV per token. That sounds small until you do the multiplication:

Context fp16 KV
32K ~8 GB
64K ~16 GB
128K ~32 GB
256K ~64 GB
512K ~128 GB
1M ~256 GB

A 128 GB MacBook with flash-attn and mlock on can fit one 35B model at 128K with f16 KV, just barely. 256K doesn't fit. Co-resident two-model setups (architect + editor) don't fit at all past 64K.

Standard q8_0 quantization halves the KV footprint with sub-1% perplexity penalty. That gets you to 256K with a single model on the Mac.

TurboQuant (Google Research, ICLR 2026, arxiv:2504.19874) compresses further. Randomized Walsh-Hadamard transforms decorrelate KV blocks before scalar quantization, hitting ~3.25 bits per value (turbo3) or ~4.25 bits per value (turbo4) with attention-fidelity loss inside the noise floor of normal sampling variance.

Cache type bits/value Compression vs fp16 KV at 256K
f16 16.0 1.0× ~64 GB
q8_0 8.0 2.0× ~32 GB
turbo4 4.25 3.8× ~17 GB
turbo3 3.25 4.9× ~13 GB

Upstream discussion at ggml-org/llama.cpp#20969. Not yet in main, landing in forks per backend. TheTom's fork is the Metal-supporting variant.


The bench

llama-bench from TheTom's fork build, single Qwen3.6-35B-A3B Q8 model, sweep across cache types and KV-depths.

./build/bin/llama-bench \
  -m Qwen3.6-35B-A3B-Q8_0.gguf \
  -ctk turbo3 -ctv turbo3 \
  -d 0 -d 8192 -d 32768 -d 131072 -d 262144 -d 524288 -d 1048576 \
  -p 512 -n 128 -ngl 99 -fa 1 \
  --threads 6 --batch-size 2048 \
  -r 3 -o md
Enter fullscreen mode Exit fullscreen mode

-d N pre-allocates N tokens of KV cache before measuring throughput. Mean of 3 reps. Metal-agent stopped during the run for clean memory budget. The 1M cell on turbo3 alone took several hours wall-clock; full sweep ran ~7 hours overnight.


The numbers

Generation throughput (tok/s)

Depth f16 q8_0 turbo3 turbo4
0 89.4 87.4 79.5 79.7
8K 84.2 79.2 72.2 71.2
32K 72.6 67.8 61.5 61.8
64K 60.7
128K 44.4 40.7 36.0 37.7
256K OOM 26.6 22.9 25.5
512K OOM OOM 13.3 16.0
1M OOM OOM 6.51 OOM

Prompt processing throughput (tok/s)

Depth f16 q8_0 turbo3 turbo4
0 2962 2948 2904 2854
8K 2098 1623 1653 1439
32K 1063 802 784 678
128K 321 245 253 ← turbo3 ≥ q8_0 206
256K OOM 124 128 ← turbo3 > q8_0 101
512K OOM OOM 66 56
1M OOM OOM 30.1 OOM

Full grid is final. Bench ran 8h 20m wall-clock.


Finding 1: turbo3 beats q8_0 at long context

The framing in the upstream discussion is approximately "turbo3 trades a small (~10%) generation throughput hit for ~2.5× more KV memory headroom." That's true at short context. At long context, the trade flips.

At 128K depth, f16 wins prefill at 321 tok/s, but turbo3 at 253 tok/s edges out q8_0 at 245 tok/s. At 256K (where f16 OOMs), turbo3 at 128 tok/s beats q8_0 at 124 tok/s.

What's happening: at 35B-class model size with deep contexts, the GPU spends most of its time during attention reading KV cache from memory rather than computing on it. Smaller cache → less bandwidth pressure → throughput recovers, even though there's more dequantization work per access. The break-even is somewhere between 32K and 128K on M5 Max.

For coding-agent workloads where context grows monotonically across a session, this is the regime that matters. You're spending most of your tokens at 32K+ depth, not at depth 0.


Finding 2: turbo3 and turbo4 split by workload phase

The 25% extra bits per value in turbo4 (4.25 vs 3.25 bits) buys you something specific, and what it buys depends on the phase.

Prefill (prompt processing) at long context:

Depth turbo3 pp turbo4 pp turbo3 advantage
8K 1653 1439 +15%
32K 784 678 +16%
128K 253 206 +23%
256K 128 101 +27%
512K 66 56 +18%

Smaller cache means less data to read per attention step; during prefill the GPU pulls huge contiguous batches through attention, and the bandwidth-bound regime favors turbo3 cleanly.

Decode (generation) at long context:

Depth turbo3 tg turbo4 tg turbo4 advantage
128K 36.0 37.7 +5%
256K 22.9 25.5 +11%
512K 13.3 16.0 +20%

During decode the dequantization overhead per access matters more than total bytes read. turbo4's simpler representation (4.25 bits has less complex quantization geometry than 3.25 bits) wins at the per-token attention pass — and the gap widens with depth.

Practical implications by workload:

Workload shape Cache type Why
Aider/OpenCode coding agents (deep context, lots of generated tokens) turbo4 Wins decode at depth
RAG-heavy / batch question answering (heavy prefill, short answers) turbo3 Wins prefill at depth
Pure context-window maximization (1M context) turbo3 Only it fits at 1M
Short-context interactive (≤32K) f16 if it fits, else q8_0 Both turbos are ~10% slower

This isn't a framing the upstream community discussion has surfaced clearly. Different bottleneck regimes for different phases, and the right cache type depends on which phase dominates your workload.


What this enables on a MacBook

Three concrete capabilities:

  1. 256K context for two co-resident coding models. turbo3 KV at 256K (~13 GB) plus 37 GB Qwen3.6 weights, alongside Devstral-Small-2-24B at the same context with comparable footprint, totals ~88 GB. Under the 100 GB practical budget.

  2. 1M context for batch / agentic workloads. turbo3 KV at 1M is ~52 GB. We measured 30 tok/s prefill, 6.5 tok/s decode at 1M on Qwen3.6-35B-A3B Q8. Slow — a 4K-token agent response at 1M context is ~10 minutes wall-clock — but it works. Overnight agentic batches that need the full context window are feasible. As far as we can tell, nobody else has demonstrated this on Apple Silicon yet.

  3. More headroom for non-attention buffers. Cutting KV by 5× makes batch buffers, prefix cache, and draft models for speculative decoding actually composable.


Caveats

  • TheTom's fork is research-grade. Pinned to commit 11a241d0d; rebases needed as upstream moves.
  • LLMKube's metal-runtime can't drive turbo3/turbo4 yet because of #349 and #350. PR #353 closes #350; #349 is next.
  • No perplexity numbers in this run. Throughput and memory ceilings only. The +1% perplexity penalty for turbo3 in the upstream discussion is on Qwen 3.5 — we'll re-run on Qwen 3.6 in a follow-up.
  • Single hardware sample. M5 Max only. Crossover point and prefill/decode split likely shift with memory bandwidth (614 GB/s on M5 Max) and GPU core count.

What we contributed back

  • LLMKube PR #351 (merged): cacheTypeCustomK/cacheTypeCustomV on InferenceServiceSpec. Closes #282.
  • LLMKube PR #353 (open): metal-agent respawns on ISVC spec drift; honors replicas: 0. Closes #350.
  • Issues filed: #349, #350.
  • Comment going to llama.cpp discussion #20969 with the M5 Max numbers and the prefill/decode split.

How to try it yourself

# 1. Build TheTom's fork
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

# 2. Run the bench (turbo3 and turbo4 separately to see the split)
./build/bin/llama-bench \
  -m /path/to/your/model.gguf \
  -ctk turbo3 -ctv turbo3 \
  -d 0 -d 32768 -d 131072 -d 262144 \
  -p 512 -n 128 -ngl 99 -fa 1 -r 3 -o md
Enter fullscreen mode Exit fullscreen mode

Memory ceiling depends on your unified-memory budget; sub-64 GB Macs probably can't reach 256K with a 35B-class model at any cache type. M3 Pro/Max territory is more realistic for 13B models at 128K with turbo3.

For NVIDIA: @spiritbuun's CUDA fork is the equivalent path.


Open invitation

If you have non-M5-Max Apple Silicon (M2 Pro/Max, M3 Ultra, M4 Max) and want to run the same bench, we want your numbers. The crossover point and the prefill/decode split likely shift with memory bandwidth.

Drop results in llama.cpp discussion #20969 or open an issue on defilantech/llmkube.

Top comments (0)