I spent this weekend testing TurboQuant KV cache compression on my home lab Kubernetes cluster. The paper (ICLR 2026, Google Research) promises up to 4.57x compression of the KV cache with minimal quality loss. That sounded like exactly what I needed. I'm always bumping up against VRAM limits trying to run larger models or longer contexts on consumer hardware.
Here's what I found: it works, but there are real tradeoffs nobody's talking about yet.
The Problem: KV Cache Eats Your VRAM
If you've run LLMs locally, you know the drill. You load a 32B model that fits in 20GB of VRAM, set the context to 32K, and suddenly you're at 28GB. The model weights didn't change. It's the KV cache growing linearly with context length.
For every token in the context, the model stores key and value vectors for every attention head at every layer. In FP16, that adds up fast. A 32B model at 32K context can burn through 8+ GB of VRAM just for the KV cache.
TurboQuant's approach is to apply a Walsh-Hadamard Transform (WHT) rotation to KV cache vectors before quantizing them to 3 bits. The rotation "gaussianizes" the distribution, making scalar quantization much more effective. The result is TQ3_0: roughly 3 bits per element instead of 16, for a theoretical 4.57x compression.
My Setup
Hardware: ShadowStack, my home inference server
- 2x NVIDIA RTX 5060 Ti (16GB GDDR7 each, 32GB total)
- AMD Ryzen 9 7900X, 64GB DDR5
- Ubuntu 24.04, MicroK8s
Software: LLMKube, an open-source Kubernetes operator I built for managing llama.cpp inference workloads. It handles model downloads, GPU scheduling, multi-GPU sharding, health probes, and Prometheus metrics through Kubernetes CRDs.
TurboQuant build: I used the animehacker/llama-turboquant fork, which has working CUDA kernels for the WHT-based TQ3_0 type. This is a Stage 1 implementation (no QJL residual correction from the full paper). I built it with Kaniko directly on my cluster targeting SM 86 (Ampere) and SM 120 (Blackwell).
The Wrapper Entrypoint Pattern
LLMKube's InferenceService CRD doesn't have a --cache-type flag yet, so I built a custom Docker image with a wrapper entrypoint that injects the TurboQuant flags transparently:
#!/bin/bash
# entrypoint.sh - passes through all LLMKube args, appends TQ flags
TQ_CACHE_TYPE="${TQ_CACHE_TYPE:-tq3_0}"
TQ_ENABLED="${TQ_ENABLED:-true}"
if [ "${TQ_ENABLED}" = "true" ]; then
exec llama-server "$@" --cache-type-k "${TQ_CACHE_TYPE}" --cache-type-v "${TQ_CACHE_TYPE}"
else
exec llama-server "$@"
fi
Using exec is important. It makes llama-server PID 1 so Kubernetes health probes and signal handling work correctly.
Benchmark Methodology
Apples-to-apples. Same model weights, same context size, same concurrency. The only variable was the KV cache type (FP16 vs TQ3_0). Flash attention was enabled for all tests.
Throughput test: 5 minutes of sustained load at 4 concurrent requests, 8K context.
Context sweep: Deploy at each context size (4K through 131K), run a 2-minute stress test, record VRAM via nvidia-smi.
Models tested:
- Llama 3.1 8B (Q5_K_M), small model with lots of headroom
- Qwen 2.5 14B (Q5_K_M), medium model that fills one GPU
- Qwen 2.5 32B (Q4_K_M), large model that requires both GPUs
Results: Throughput
This is where TurboQuant hurts.
| Model | Variant | Gen tok/s | Prompt tok/s | Requests (5min) |
|---|---|---|---|---|
| Llama 8B | FP16 cache | 50.0 | 565.5 | 771 |
| Llama 8B | TQ3_0 cache | 8.4 | 93.4 | 74 |
| Qwen 14B | FP16 cache | 28.1 | 122.0 | 128 |
| Qwen 14B | TQ3_0 cache | 5.3 | 63.4 | 53 |
| Qwen 32B | FP16 cache | 14.3 | 133.3 | 108 |
| Qwen 32B | TQ3_0 cache | 5.5 | 85.5 | 53 |
Generation throughput dropped 5-6x across all models. Prompt processing dropped roughly 2-6x depending on model size. This is consistent with what the PR benchmarks showed on CPU, but I expected Blackwell's tensor cores to help more than they did. The animehacker CUDA kernels were optimized for Ampere (SM 86), not Blackwell (SM 120), so there's likely performance left on the table.
Results: VRAM Usage
This is where it gets interesting.
Llama 3.1 8B, Context Sweep
| Context | FP16 VRAM (total) | TQ3_0 VRAM (total) | Savings |
|---|---|---|---|
| 4K | 6.4 GB | 10.1 GB | -58% (worse) |
| 8K | 6.9 GB | 14.3 GB | -107% (worse) |
| 16K | 8.0 GB | 22.8 GB | -185% (worse) |
| 32K | 10.1 GB | 6.9 GB | 31% better |
| 65K | 14.3 GB | 8.4 GB | 41% better |
| 98K | 18.5 GB | 9.8 GB | 47% better |
| 131K | 22.7 GB | 11.2 GB | 51% better |
Qwen 2.5 14B, Context Sweep
| Context | FP16 VRAM (total) | TQ3_0 VRAM (total) | Savings |
|---|---|---|---|
| 4K | 11.1 GB | 16.7 GB | -50% (worse) |
| 8K | 11.9 GB | 23.0 GB | -93% (worse) |
| 16K | 13.4 GB | 11.0 GB | 18% better |
| 32K | 16.6 GB | 11.8 GB | 29% better |
| 65K | 22.8 GB | 13.7 GB | 40% better |
Qwen 2.5 32B, Context Sweep
| Context | FP16 VRAM (total) | TQ3_0 VRAM (total) | Savings |
|---|---|---|---|
| 2K | 19.9 GB | 23.7 GB | -19% (worse) |
| 4K | 20.5 GB | 27.9 GB | -36% (worse) |
| 8K | 21.6 GB | 19.8 GB | 8% better |
| 16K | 23.7 GB | 20.3 GB | 14% better |
| 32K | 28.0 GB | 21.4 GB | 24% better |
The Surprise: TQ Uses MORE VRAM at Small Contexts
I wasn't expecting this. At 4K-16K context, TQ3_0 consistently used more VRAM than the FP16 baseline. Sometimes dramatically more. Llama 8B at 16K context used 22.8 GB with TQ vs 8.0 GB with FP16.
My theory: the WHT rotation machinery has a fixed overhead (lookup tables, rotation matrices, codebooks) that gets allocated regardless of context size. When the KV cache is small, this overhead dwarfs the compression savings. The crossover point where TQ starts winning varies by model:
- Llama 8B: around 32K context
- Qwen 14B: around 16K context
- Qwen 32B: around 8K context
Larger models cross over earlier because their per-token KV cache is larger (more layers, more attention heads), so the compression pays off sooner.
When Is TurboQuant Worth It?
Use TQ3_0 when:
- You need 32K+ context on consumer GPUs
- You're hitting VRAM limits and can't afford more hardware
- Throughput isn't critical (batch processing, RAG with long documents, analysis tasks)
- You're running a large model (32B+) where the crossover point is lower
Don't use TQ3_0 when:
- Context is under 16K (you'll actually use more VRAM)
- You need interactive throughput (the 5x penalty makes chat unusable)
- You're on Blackwell and want optimal performance (wait for SM 120-optimized kernels)
The sweet spot in my testing was Qwen 32B at 32K context. Baseline uses 28 GB, which is dangerously close to my 32 GB ceiling. One concurrent request could OOM. TQ drops it to 21.4 GB, leaving over 10 GB of headroom for parallel slots or longer contexts.
What's Next
The throughput penalty is the main blocker. The animehacker CUDA kernels use a fused MMVQ approach that avoids dequantization during attention, but the WHT butterfly transform still runs 160 integer ops per element in registers. On Blackwell with its new SM architecture, these kernels likely aren't hitting optimal occupancy.
Things I'm watching:
- PR #21089 on ggml-org/llama.cpp, the only open upstream PR for TurboQuant (CPU-only for now)
- Whether
ggerganovengages with it. If he requests changes rather than closing, it'll eventually land. - SM 120-optimized CUDA kernels. Blackwell has new instructions that could close the throughput gap.
For LLMKube, I'm planning to add cacheTypeK and cacheTypeV fields to the InferenceService CRD so users can configure this without the wrapper entrypoint hack. Also an extraArgs escape hatch for any llama.cpp flag we don't have a typed field for yet.
Try It Yourself
All the benchmarking infrastructure is in the LLMKube repo. The operator is open source (Apache 2.0) and handles the full lifecycle: model downloads, GPU scheduling, multi-GPU sharding, health probes, and Prometheus metrics. If you have a GPU cluster and want to test TurboQuant:
- Build the custom image from
animehacker/llama-turboquantwith-DGGML_CUDA=ON - Set
spec.imageon your InferenceService to point at it - The wrapper entrypoint handles the rest
If you run these benchmarks on different hardware (A100, RTX 3090, etc.), I'd love to see the numbers. Drop a comment or find me on the LLMKube Discord.
Benchmarks run on 2026-03-30 on ShadowStack (2x RTX 5060 Ti, 32GB VRAM, Blackwell SM 12.0, CUDA 13.0).
Top comments (0)