Christopher Maher

Posted on Apr 28 • Originally published at llmkube.com

TurboQuant on a MacBook Pro: two findings the upstream discussion missed

#ai #llm #kubernetes #opensource

Originally published at llmkube.com/blog/turboquant-m5-max-long-context. Cross-posted here for the dev.to audience.

A 7-hour overnight bench on an M5 Max, two findings I haven't seen in the upstream community thread, and two PRs back to the LLMKube operator to make TurboQuant a first-class citizen of the InferenceService CRD.

TL;DR

A TurboQuant-enabled llama-server on Apple Silicon runs Qwen3.6-35B-A3B Q8 at up to 1M-token context on a 128 GB MacBook Pro M5 Max. Standard f16 KV cache OOMs at 256K. Two findings worth quoting:

At 128K+ context, the 3-bit KV cache (turbo3) matches or beats the 8-bit cache (q8_0) on prompt processing. Smaller cache means less memory bandwidth pressure during attention, and the throughput gap that exists at short context flips by ~128K depth.
turbo3 and turbo4 split by workload phase. Long-context prefill favors turbo3 (~27% faster than turbo4 at 256K). Long-context decode favors turbo4 (~11% faster than turbo3 at 256K). They are not interchangeable — different attention bottlenecks dominate during prefill and decode.

We built TheTom's feature/turboquant-kv-cache fork of llama.cpp for Metal, validated on M5 Max, and took two PRs back to LLMKube to make TurboQuant first-class on the InferenceService CRD.

Why KV cache, why now

If you're running coding agents locally — single-model or architect+editor combos — the binding constraint isn't model weights. It's KV cache.

Weights you can quantize once, store on disk, and forget. KV cache is generated per token of context at inference time, sized by the model's depth and head dimensions, and held in working memory the entire session. A 35B-class model with flash-attn on uses roughly 256 KB of fp16 KV per token. That sounds small until you do the multiplication:

Context	fp16 KV
32K	~8 GB
64K	~16 GB
128K	~32 GB
256K	~64 GB
512K	~128 GB
1M	~256 GB

A 128 GB MacBook with flash-attn and mlock on can fit one 35B model at 128K with f16 KV, just barely. 256K doesn't fit. Co-resident two-model setups (architect + editor) don't fit at all past 64K.

Standard q8_0 quantization halves the KV footprint with sub-1% perplexity penalty. That gets you to 256K with a single model on the Mac.

TurboQuant (Google Research, ICLR 2026, arxiv:2504.19874) compresses further. Randomized Walsh-Hadamard transforms decorrelate KV blocks before scalar quantization, hitting ~3.25 bits per value (turbo3) or ~4.25 bits per value (turbo4) with attention-fidelity loss inside the noise floor of normal sampling variance.

Cache type	bits/value	Compression vs fp16	KV at 256K
f16	16.0	1.0×	~64 GB
q8_0	8.0	2.0×	~32 GB
turbo4	4.25	3.8×	~17 GB
turbo3	3.25	4.9×	~13 GB

Upstream discussion at ggml-org/llama.cpp#20969. Not yet in main, landing in forks per backend. TheTom's fork is the Metal-supporting variant.

The bench

llama-bench from TheTom's fork build, single Qwen3.6-35B-A3B Q8 model, sweep across cache types and KV-depths.

./build/bin/llama-bench \
  -m Qwen3.6-35B-A3B-Q8_0.gguf \
  -ctk turbo3 -ctv turbo3 \
  -d 0 -d 8192 -d 32768 -d 131072 -d 262144 -d 524288 -d 1048576 \
  -p 512 -n 128 -ngl 99 -fa 1 \
  --threads 6 --batch-size 2048 \
  -r 3 -o md

-d N pre-allocates N tokens of KV cache before measuring throughput. Mean of 3 reps. Metal-agent stopped during the run for clean memory budget. The 1M cell on turbo3 alone took several hours wall-clock; full sweep ran ~7 hours overnight.

The numbers

Generation throughput (tok/s)

Depth	f16	q8_0	turbo3	turbo4
0	89.4	87.4	79.5	79.7
8K	84.2	79.2	72.2	71.2
32K	72.6	67.8	61.5	61.8
64K	60.7	—	—	—
128K	44.4	40.7	36.0	37.7
256K	OOM	26.6	22.9	25.5
512K	OOM	OOM	13.3	16.0
1M	OOM	OOM	6.51	OOM

Prompt processing throughput (tok/s)

Depth	f16	q8_0	turbo3	turbo4
0	2962	2948	2904	2854
8K	2098	1623	1653	1439
32K	1063	802	784	678
128K	321	245	253 ← turbo3 ≥ q8_0	206
256K	OOM	124	128 ← turbo3 > q8_0	101
512K	OOM	OOM	66	56
1M	OOM	OOM	30.1	OOM

Full grid is final. Bench ran 8h 20m wall-clock.

Finding 1: turbo3 beats q8_0 at long context

The framing in the upstream discussion is approximately "turbo3 trades a small (~10%) generation throughput hit for ~2.5× more KV memory headroom." That's true at short context. At long context, the trade flips.

At 128K depth, f16 wins prefill at 321 tok/s, but turbo3 at 253 tok/s edges out q8_0 at 245 tok/s. At 256K (where f16 OOMs), turbo3 at 128 tok/s beats q8_0 at 124 tok/s.

What's happening: at 35B-class model size with deep contexts, the GPU spends most of its time during attention reading KV cache from memory rather than computing on it. Smaller cache → less bandwidth pressure → throughput recovers, even though there's more dequantization work per access. The break-even is somewhere between 32K and 128K on M5 Max.

For coding-agent workloads where context grows monotonically across a session, this is the regime that matters. You're spending most of your tokens at 32K+ depth, not at depth 0.

Finding 2: turbo3 and turbo4 split by workload phase

The 25% extra bits per value in turbo4 (4.25 vs 3.25 bits) buys you something specific, and what it buys depends on the phase.

Prefill (prompt processing) at long context:

Depth	turbo3 pp	turbo4 pp	turbo3 advantage
8K	1653	1439	+15%
32K	784	678	+16%
128K	253	206	+23%
256K	128	101	+27%
512K	66	56	+18%

Smaller cache means less data to read per attention step; during prefill the GPU pulls huge contiguous batches through attention, and the bandwidth-bound regime favors turbo3 cleanly.

Decode (generation) at long context:

Depth	turbo3 tg	turbo4 tg	turbo4 advantage
128K	36.0	37.7	+5%
256K	22.9	25.5	+11%
512K	13.3	16.0	+20%

During decode the dequantization overhead per access matters more than total bytes read. turbo4's simpler representation (4.25 bits has less complex quantization geometry than 3.25 bits) wins at the per-token attention pass — and the gap widens with depth.

Practical implications by workload:

Workload shape	Cache type	Why
Aider/OpenCode coding agents (deep context, lots of generated tokens)	`turbo4`	Wins decode at depth
RAG-heavy / batch question answering (heavy prefill, short answers)	`turbo3`	Wins prefill at depth
Pure context-window maximization (1M context)	`turbo3`	Only it fits at 1M
Short-context interactive (≤32K)	`f16` if it fits, else `q8_0`	Both turbos are ~10% slower

This isn't a framing the upstream community discussion has surfaced clearly. Different bottleneck regimes for different phases, and the right cache type depends on which phase dominates your workload.

What this enables on a MacBook

Three concrete capabilities:

256K context for two co-resident coding models. turbo3 KV at 256K (~13 GB) plus 37 GB Qwen3.6 weights, alongside Devstral-Small-2-24B at the same context with comparable footprint, totals ~88 GB. Under the 100 GB practical budget.
1M context for batch / agentic workloads. turbo3 KV at 1M is ~52 GB. We measured 30 tok/s prefill, 6.5 tok/s decode at 1M on Qwen3.6-35B-A3B Q8. Slow — a 4K-token agent response at 1M context is ~10 minutes wall-clock — but it works. Overnight agentic batches that need the full context window are feasible. As far as we can tell, nobody else has demonstrated this on Apple Silicon yet.
More headroom for non-attention buffers. Cutting KV by 5× makes batch buffers, prefix cache, and draft models for speculative decoding actually composable.

Caveats

TheTom's fork is research-grade. Pinned to commit 11a241d0d; rebases needed as upstream moves.
LLMKube's metal-runtime can't drive turbo3/turbo4 yet because of #349 and #350. PR #353 closes #350; #349 is next.
No perplexity numbers in this run. Throughput and memory ceilings only. The +1% perplexity penalty for turbo3 in the upstream discussion is on Qwen 3.5 — we'll re-run on Qwen 3.6 in a follow-up.
Single hardware sample. M5 Max only. Crossover point and prefill/decode split likely shift with memory bandwidth (614 GB/s on M5 Max) and GPU core count.

What we contributed back

LLMKube PR #351 (merged): cacheTypeCustomK/cacheTypeCustomV on InferenceServiceSpec. Closes #282.
LLMKube PR #353 (open): metal-agent respawns on ISVC spec drift; honors replicas: 0. Closes #350.
Issues filed: #349, #350.
Comment going to llama.cpp discussion #20969 with the M5 Max numbers and the prefill/decode split.

How to try it yourself

# 1. Build TheTom's fork
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

# 2. Run the bench (turbo3 and turbo4 separately to see the split)
./build/bin/llama-bench \
  -m /path/to/your/model.gguf \
  -ctk turbo3 -ctv turbo3 \
  -d 0 -d 32768 -d 131072 -d 262144 \
  -p 512 -n 128 -ngl 99 -fa 1 -r 3 -o md

Memory ceiling depends on your unified-memory budget; sub-64 GB Macs probably can't reach 256K with a 35B-class model at any cache type. M3 Pro/Max territory is more realistic for 13B models at 128K with turbo3.

For NVIDIA: @spiritbuun's CUDA fork is the equivalent path.

Open invitation

If you have non-M5-Max Apple Silicon (M2 Pro/Max, M3 Ultra, M4 Max) and want to run the same bench, we want your numbers. The crossover point and the prefill/decode split likely shift with memory bandwidth.

Drop results in llama.cpp discussion #20969 or open an issue on defilantech/llmkube.