Christopher Maher

Posted on Mar 30

I Tested TurboQuant KV Cache Compression on Consumer GPUs. Here's What Actually Happened.

#llm #kubernetes #gpu #ai

I spent this weekend testing TurboQuant KV cache compression on my home lab Kubernetes cluster. The paper (ICLR 2026, Google Research) promises up to 4.57x compression of the KV cache with minimal quality loss. That sounded like exactly what I needed. I'm always bumping up against VRAM limits trying to run larger models or longer contexts on consumer hardware.

Here's what I found: it works, but there are real tradeoffs nobody's talking about yet.

The Problem: KV Cache Eats Your VRAM

If you've run LLMs locally, you know the drill. You load a 32B model that fits in 20GB of VRAM, set the context to 32K, and suddenly you're at 28GB. The model weights didn't change. It's the KV cache growing linearly with context length.

For every token in the context, the model stores key and value vectors for every attention head at every layer. In FP16, that adds up fast. A 32B model at 32K context can burn through 8+ GB of VRAM just for the KV cache.

TurboQuant's approach is to apply a Walsh-Hadamard Transform (WHT) rotation to KV cache vectors before quantizing them to 3 bits. The rotation "gaussianizes" the distribution, making scalar quantization much more effective. The result is TQ3_0: roughly 3 bits per element instead of 16, for a theoretical 4.57x compression.

My Setup

Hardware: ShadowStack, my home inference server

2x NVIDIA RTX 5060 Ti (16GB GDDR7 each, 32GB total)
AMD Ryzen 9 7900X, 64GB DDR5
Ubuntu 24.04, MicroK8s

Software: LLMKube, an open-source Kubernetes operator I built for managing llama.cpp inference workloads. It handles model downloads, GPU scheduling, multi-GPU sharding, health probes, and Prometheus metrics through Kubernetes CRDs.

TurboQuant build: I used the animehacker/llama-turboquant fork, which has working CUDA kernels for the WHT-based TQ3_0 type. This is a Stage 1 implementation (no QJL residual correction from the full paper). I built it with Kaniko directly on my cluster targeting SM 86 (Ampere) and SM 120 (Blackwell).

The Wrapper Entrypoint Pattern

LLMKube's InferenceService CRD doesn't have a --cache-type flag yet, so I built a custom Docker image with a wrapper entrypoint that injects the TurboQuant flags transparently:

#!/bin/bash
# entrypoint.sh - passes through all LLMKube args, appends TQ flags
TQ_CACHE_TYPE="${TQ_CACHE_TYPE:-tq3_0}"
TQ_ENABLED="${TQ_ENABLED:-true}"

if [ "${TQ_ENABLED}" = "true" ]; then
    exec llama-server "$@" --cache-type-k "${TQ_CACHE_TYPE}" --cache-type-v "${TQ_CACHE_TYPE}"
else
    exec llama-server "$@"
fi

Using exec is important. It makes llama-server PID 1 so Kubernetes health probes and signal handling work correctly.

Benchmark Methodology

Apples-to-apples. Same model weights, same context size, same concurrency. The only variable was the KV cache type (FP16 vs TQ3_0). Flash attention was enabled for all tests.

Throughput test: 5 minutes of sustained load at 4 concurrent requests, 8K context.

Context sweep: Deploy at each context size (4K through 131K), run a 2-minute stress test, record VRAM via nvidia-smi.

Models tested:

Llama 3.1 8B (Q5_K_M), small model with lots of headroom
Qwen 2.5 14B (Q5_K_M), medium model that fills one GPU
Qwen 2.5 32B (Q4_K_M), large model that requires both GPUs

Results: Throughput

This is where TurboQuant hurts.

Model	Variant	Gen tok/s	Prompt tok/s	Requests (5min)
Llama 8B	FP16 cache	50.0	565.5	771
Llama 8B	TQ3_0 cache	8.4	93.4	74
Qwen 14B	FP16 cache	28.1	122.0	128
Qwen 14B	TQ3_0 cache	5.3	63.4	53
Qwen 32B	FP16 cache	14.3	133.3	108
Qwen 32B	TQ3_0 cache	5.5	85.5	53

Generation throughput dropped 5-6x across all models. Prompt processing dropped roughly 2-6x depending on model size. This is consistent with what the PR benchmarks showed on CPU, but I expected Blackwell's tensor cores to help more than they did. The animehacker CUDA kernels were optimized for Ampere (SM 86), not Blackwell (SM 120), so there's likely performance left on the table.

Results: VRAM Usage

This is where it gets interesting.

Llama 3.1 8B, Context Sweep

Context	FP16 VRAM (total)	TQ3_0 VRAM (total)	Savings
4K	6.4 GB	10.1 GB	-58% (worse)
8K	6.9 GB	14.3 GB	-107% (worse)
16K	8.0 GB	22.8 GB	-185% (worse)
32K	10.1 GB	6.9 GB	31% better
65K	14.3 GB	8.4 GB	41% better
98K	18.5 GB	9.8 GB	47% better
131K	22.7 GB	11.2 GB	51% better

Qwen 2.5 14B, Context Sweep

Context	FP16 VRAM (total)	TQ3_0 VRAM (total)	Savings
4K	11.1 GB	16.7 GB	-50% (worse)
8K	11.9 GB	23.0 GB	-93% (worse)
16K	13.4 GB	11.0 GB	18% better
32K	16.6 GB	11.8 GB	29% better
65K	22.8 GB	13.7 GB	40% better

Qwen 2.5 32B, Context Sweep

Context	FP16 VRAM (total)	TQ3_0 VRAM (total)	Savings
2K	19.9 GB	23.7 GB	-19% (worse)
4K	20.5 GB	27.9 GB	-36% (worse)
8K	21.6 GB	19.8 GB	8% better
16K	23.7 GB	20.3 GB	14% better
32K	28.0 GB	21.4 GB	24% better

The Surprise: TQ Uses MORE VRAM at Small Contexts

I wasn't expecting this. At 4K-16K context, TQ3_0 consistently used more VRAM than the FP16 baseline. Sometimes dramatically more. Llama 8B at 16K context used 22.8 GB with TQ vs 8.0 GB with FP16.

My theory: the WHT rotation machinery has a fixed overhead (lookup tables, rotation matrices, codebooks) that gets allocated regardless of context size. When the KV cache is small, this overhead dwarfs the compression savings. The crossover point where TQ starts winning varies by model:

Llama 8B: around 32K context
Qwen 14B: around 16K context
Qwen 32B: around 8K context

Larger models cross over earlier because their per-token KV cache is larger (more layers, more attention heads), so the compression pays off sooner.

When Is TurboQuant Worth It?

Use TQ3_0 when:

You need 32K+ context on consumer GPUs
You're hitting VRAM limits and can't afford more hardware
Throughput isn't critical (batch processing, RAG with long documents, analysis tasks)
You're running a large model (32B+) where the crossover point is lower

Don't use TQ3_0 when:

Context is under 16K (you'll actually use more VRAM)
You need interactive throughput (the 5x penalty makes chat unusable)
You're on Blackwell and want optimal performance (wait for SM 120-optimized kernels)

The sweet spot in my testing was Qwen 32B at 32K context. Baseline uses 28 GB, which is dangerously close to my 32 GB ceiling. One concurrent request could OOM. TQ drops it to 21.4 GB, leaving over 10 GB of headroom for parallel slots or longer contexts.

What's Next

The throughput penalty is the main blocker. The animehacker CUDA kernels use a fused MMVQ approach that avoids dequantization during attention, but the WHT butterfly transform still runs 160 integer ops per element in registers. On Blackwell with its new SM architecture, these kernels likely aren't hitting optimal occupancy.

Things I'm watching:

PR #21089 on ggml-org/llama.cpp, the only open upstream PR for TurboQuant (CPU-only for now)
Whether ggerganov engages with it. If he requests changes rather than closing, it'll eventually land.
SM 120-optimized CUDA kernels. Blackwell has new instructions that could close the throughput gap.

For LLMKube, I'm planning to add cacheTypeK and cacheTypeV fields to the InferenceService CRD so users can configure this without the wrapper entrypoint hack. Also an extraArgs escape hatch for any llama.cpp flag we don't have a typed field for yet.

Try It Yourself

All the benchmarking infrastructure is in the LLMKube repo. The operator is open source (Apache 2.0) and handles the full lifecycle: model downloads, GPU scheduling, multi-GPU sharding, health probes, and Prometheus metrics. If you have a GPU cluster and want to test TurboQuant:

Build the custom image from animehacker/llama-turboquant with -DGGML_CUDA=ON
Set spec.image on your InferenceService to point at it
The wrapper entrypoint handles the rest

If you run these benchmarks on different hardware (A100, RTX 3090, etc.), I'd love to see the numbers. Drop a comment or find me on the LLMKube Discord.

Benchmarks run on 2026-03-30 on ShadowStack (2x RTX 5060 Ti, 32GB VRAM, Blackwell SM 12.0, CUDA 13.0).

DEV Community