Originally published at llmkube.com/blog/turboquant-m5-max-long-context. Cross-posted here for the dev.to audience.
A 7-hour overnight bench on an M5 Max, two findings I haven't seen in the upstream community thread, and two PRs back to the LLMKube operator to make TurboQuant a first-class citizen of the InferenceService CRD.
TL;DR
A TurboQuant-enabled llama-server on Apple Silicon runs Qwen3.6-35B-A3B Q8 at up to 1M-token context on a 128 GB MacBook Pro M5 Max. Standard f16 KV cache OOMs at 256K. Two findings worth quoting:
-
At 128K+ context, the 3-bit KV cache (
turbo3) matches or beats the 8-bit cache (q8_0) on prompt processing. Smaller cache means less memory bandwidth pressure during attention, and the throughput gap that exists at short context flips by ~128K depth. -
turbo3andturbo4split by workload phase. Long-context prefill favorsturbo3(~27% faster thanturbo4at 256K). Long-context decode favorsturbo4(~11% faster thanturbo3at 256K). They are not interchangeable — different attention bottlenecks dominate during prefill and decode.
We built TheTom's feature/turboquant-kv-cache fork of llama.cpp for Metal, validated on M5 Max, and took two PRs back to LLMKube to make TurboQuant first-class on the InferenceService CRD.
Why KV cache, why now
If you're running coding agents locally — single-model or architect+editor combos — the binding constraint isn't model weights. It's KV cache.
Weights you can quantize once, store on disk, and forget. KV cache is generated per token of context at inference time, sized by the model's depth and head dimensions, and held in working memory the entire session. A 35B-class model with flash-attn on uses roughly 256 KB of fp16 KV per token. That sounds small until you do the multiplication:
| Context | fp16 KV |
|---|---|
| 32K | ~8 GB |
| 64K | ~16 GB |
| 128K | ~32 GB |
| 256K | ~64 GB |
| 512K | ~128 GB |
| 1M | ~256 GB |
A 128 GB MacBook with flash-attn and mlock on can fit one 35B model at 128K with f16 KV, just barely. 256K doesn't fit. Co-resident two-model setups (architect + editor) don't fit at all past 64K.
Standard q8_0 quantization halves the KV footprint with sub-1% perplexity penalty. That gets you to 256K with a single model on the Mac.
TurboQuant (Google Research, ICLR 2026, arxiv:2504.19874) compresses further. Randomized Walsh-Hadamard transforms decorrelate KV blocks before scalar quantization, hitting ~3.25 bits per value (turbo3) or ~4.25 bits per value (turbo4) with attention-fidelity loss inside the noise floor of normal sampling variance.
| Cache type | bits/value | Compression vs fp16 | KV at 256K |
|---|---|---|---|
| f16 | 16.0 | 1.0× | ~64 GB |
| q8_0 | 8.0 | 2.0× | ~32 GB |
| turbo4 | 4.25 | 3.8× | ~17 GB |
| turbo3 | 3.25 | 4.9× | ~13 GB |
Upstream discussion at ggml-org/llama.cpp#20969. Not yet in main, landing in forks per backend. TheTom's fork is the Metal-supporting variant.
The bench
llama-bench from TheTom's fork build, single Qwen3.6-35B-A3B Q8 model, sweep across cache types and KV-depths.
./build/bin/llama-bench \
-m Qwen3.6-35B-A3B-Q8_0.gguf \
-ctk turbo3 -ctv turbo3 \
-d 0 -d 8192 -d 32768 -d 131072 -d 262144 -d 524288 -d 1048576 \
-p 512 -n 128 -ngl 99 -fa 1 \
--threads 6 --batch-size 2048 \
-r 3 -o md
-d N pre-allocates N tokens of KV cache before measuring throughput. Mean of 3 reps. Metal-agent stopped during the run for clean memory budget. The 1M cell on turbo3 alone took several hours wall-clock; full sweep ran ~7 hours overnight.
The numbers
Generation throughput (tok/s)
| Depth | f16 | q8_0 | turbo3 | turbo4 |
|---|---|---|---|---|
| 0 | 89.4 | 87.4 | 79.5 | 79.7 |
| 8K | 84.2 | 79.2 | 72.2 | 71.2 |
| 32K | 72.6 | 67.8 | 61.5 | 61.8 |
| 64K | 60.7 | — | — | — |
| 128K | 44.4 | 40.7 | 36.0 | 37.7 |
| 256K | OOM | 26.6 | 22.9 | 25.5 |
| 512K | OOM | OOM | 13.3 | 16.0 |
| 1M | OOM | OOM | 6.51 | OOM |
Prompt processing throughput (tok/s)
| Depth | f16 | q8_0 | turbo3 | turbo4 |
|---|---|---|---|---|
| 0 | 2962 | 2948 | 2904 | 2854 |
| 8K | 2098 | 1623 | 1653 | 1439 |
| 32K | 1063 | 802 | 784 | 678 |
| 128K | 321 | 245 | 253 ← turbo3 ≥ q8_0 | 206 |
| 256K | OOM | 124 | 128 ← turbo3 > q8_0 | 101 |
| 512K | OOM | OOM | 66 | 56 |
| 1M | OOM | OOM | 30.1 | OOM |
Full grid is final. Bench ran 8h 20m wall-clock.
Finding 1: turbo3 beats q8_0 at long context
The framing in the upstream discussion is approximately "turbo3 trades a small (~10%) generation throughput hit for ~2.5× more KV memory headroom." That's true at short context. At long context, the trade flips.
At 128K depth, f16 wins prefill at 321 tok/s, but turbo3 at 253 tok/s edges out q8_0 at 245 tok/s. At 256K (where f16 OOMs), turbo3 at 128 tok/s beats q8_0 at 124 tok/s.
What's happening: at 35B-class model size with deep contexts, the GPU spends most of its time during attention reading KV cache from memory rather than computing on it. Smaller cache → less bandwidth pressure → throughput recovers, even though there's more dequantization work per access. The break-even is somewhere between 32K and 128K on M5 Max.
For coding-agent workloads where context grows monotonically across a session, this is the regime that matters. You're spending most of your tokens at 32K+ depth, not at depth 0.
Finding 2: turbo3 and turbo4 split by workload phase
The 25% extra bits per value in turbo4 (4.25 vs 3.25 bits) buys you something specific, and what it buys depends on the phase.
Prefill (prompt processing) at long context:
| Depth | turbo3 pp | turbo4 pp | turbo3 advantage |
|---|---|---|---|
| 8K | 1653 | 1439 | +15% |
| 32K | 784 | 678 | +16% |
| 128K | 253 | 206 | +23% |
| 256K | 128 | 101 | +27% |
| 512K | 66 | 56 | +18% |
Smaller cache means less data to read per attention step; during prefill the GPU pulls huge contiguous batches through attention, and the bandwidth-bound regime favors turbo3 cleanly.
Decode (generation) at long context:
| Depth | turbo3 tg | turbo4 tg | turbo4 advantage |
|---|---|---|---|
| 128K | 36.0 | 37.7 | +5% |
| 256K | 22.9 | 25.5 | +11% |
| 512K | 13.3 | 16.0 | +20% |
During decode the dequantization overhead per access matters more than total bytes read. turbo4's simpler representation (4.25 bits has less complex quantization geometry than 3.25 bits) wins at the per-token attention pass — and the gap widens with depth.
Practical implications by workload:
| Workload shape | Cache type | Why |
|---|---|---|
| Aider/OpenCode coding agents (deep context, lots of generated tokens) | turbo4 |
Wins decode at depth |
| RAG-heavy / batch question answering (heavy prefill, short answers) | turbo3 |
Wins prefill at depth |
| Pure context-window maximization (1M context) | turbo3 |
Only it fits at 1M |
| Short-context interactive (≤32K) |
f16 if it fits, else q8_0
|
Both turbos are ~10% slower |
This isn't a framing the upstream community discussion has surfaced clearly. Different bottleneck regimes for different phases, and the right cache type depends on which phase dominates your workload.
What this enables on a MacBook
Three concrete capabilities:
256K context for two co-resident coding models. turbo3 KV at 256K (~13 GB) plus 37 GB Qwen3.6 weights, alongside Devstral-Small-2-24B at the same context with comparable footprint, totals ~88 GB. Under the 100 GB practical budget.
1M context for batch / agentic workloads. turbo3 KV at 1M is ~52 GB. We measured 30 tok/s prefill, 6.5 tok/s decode at 1M on Qwen3.6-35B-A3B Q8. Slow — a 4K-token agent response at 1M context is ~10 minutes wall-clock — but it works. Overnight agentic batches that need the full context window are feasible. As far as we can tell, nobody else has demonstrated this on Apple Silicon yet.
More headroom for non-attention buffers. Cutting KV by 5× makes batch buffers, prefix cache, and draft models for speculative decoding actually composable.
Caveats
-
TheTom's fork is research-grade. Pinned to commit
11a241d0d; rebases needed as upstream moves. - LLMKube's metal-runtime can't drive turbo3/turbo4 yet because of #349 and #350. PR #353 closes #350; #349 is next.
- No perplexity numbers in this run. Throughput and memory ceilings only. The +1% perplexity penalty for turbo3 in the upstream discussion is on Qwen 3.5 — we'll re-run on Qwen 3.6 in a follow-up.
- Single hardware sample. M5 Max only. Crossover point and prefill/decode split likely shift with memory bandwidth (614 GB/s on M5 Max) and GPU core count.
What we contributed back
-
LLMKube PR #351 (merged):
cacheTypeCustomK/cacheTypeCustomVonInferenceServiceSpec. Closes #282. -
LLMKube PR #353 (open): metal-agent respawns on ISVC spec drift; honors
replicas: 0. Closes #350. - Issues filed: #349, #350.
- Comment going to llama.cpp discussion #20969 with the M5 Max numbers and the prefill/decode split.
How to try it yourself
# 1. Build TheTom's fork
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
# 2. Run the bench (turbo3 and turbo4 separately to see the split)
./build/bin/llama-bench \
-m /path/to/your/model.gguf \
-ctk turbo3 -ctv turbo3 \
-d 0 -d 32768 -d 131072 -d 262144 \
-p 512 -n 128 -ngl 99 -fa 1 -r 3 -o md
Memory ceiling depends on your unified-memory budget; sub-64 GB Macs probably can't reach 256K with a 35B-class model at any cache type. M3 Pro/Max territory is more realistic for 13B models at 128K with turbo3.
For NVIDIA: @spiritbuun's CUDA fork is the equivalent path.
Open invitation
If you have non-M5-Max Apple Silicon (M2 Pro/Max, M3 Ultra, M4 Max) and want to run the same bench, we want your numbers. The crossover point and the prefill/decode split likely shift with memory bandwidth.
Drop results in llama.cpp discussion #20969 or open an issue on defilantech/llmkube.
Top comments (0)