ArshTechPro

Posted on Mar 28

TurboQuant: What Developers Need to Know About Google's KV Cache Compression

#ai #python #google

If you've ever run a large language model on your own hardware and watched your GPU memory vanish as the context window grows, TurboQuant is built for exactly that problem.

Published by Google Research on March 24, 2026 and headed to ICLR 2026, TurboQuant is a compression algorithm that shrinks the KV cache -- the biggest memory bottleneck during LLM inference -- down to 3-4 bits per element without any retraining or fine-tuning. The result is roughly a 4-6x reduction in KV cache memory with negligible quality loss.

This article breaks down what TurboQuant actually does, why it matters for anyone deploying or experimenting with LLMs, and how to start using community implementations right now.

The Problem: KV Cache Is Eating Your VRAM

When a transformer model generates text, it computes key and value vectors for every token in the context and stores them so it doesn't have to recompute them on subsequent steps. This is the key-value (KV) cache.

The issue is simple: it grows linearly with context length, and it stores everything in full precision (typically FP16). For an 8B parameter model at 32K context, the KV cache alone can consume around 4.6 GB of VRAM. Scale that to multiple concurrent users or longer contexts, and you're out of memory before the model weights themselves become the bottleneck.

Existing approaches to this problem -- like FP8 quantization in vLLM or the q4_0/q8_0 cache types in Ollama -- either don't compress aggressively enough or introduce quality trade-offs that are hard to predict. TurboQuant aims to do better on both fronts.

How TurboQuant Works (The Short Version)

TurboQuant is a two-stage compression pipeline. It doesn't need any training data, calibration, or model-specific tuning. It works on any vector, which means it slots into any transformer architecture.

Stage 1: PolarQuant (b-1 bits)

The first step is a random orthogonal rotation applied to each KV vector. This rotation spreads the energy of the vector uniformly across all coordinates, which transforms the problem: after rotation, each coordinate follows a predictable statistical distribution (approximately Beta or Gaussian depending on the head dimension). Because the distribution is known in advance, you can compute a mathematically optimal set of quantization buckets (using the Lloyd-Max algorithm) once, ahead of time. No per-model or per-dataset calibration needed. PolarQuant then converts coordinates into polar form -- radius and angle rather than Cartesian x/y/z -- which eliminates the costly per-block normalization constants that traditional quantizers need.

Stage 2: QJL Residual Correction (1 bit)

The second stage takes the tiny quantization error left over from Stage 1, projects it through a random Gaussian matrix using the Johnson-Lindenstrauss transform, and stores only the sign bit (+1 or -1) of each resulting value. This single-bit sketch acts as a bias correction that makes the inner product estimates (i.e., attention scores) mathematically unbiased. The overhead is just 1 extra bit per coordinate.

The combined result: b bits total per coordinate, with provably near-optimal distortion bounds and zero memory overhead from normalization constants.

Why This Matters for Developers

It's training-free and model-agnostic. TurboQuant doesn't require fine-tuning, calibration datasets, or model-specific configuration. The rotation matrix and codebook are derived from math, not data. Point it at any transformer's KV cache and it works.

Compression scales with context length. The benefit is proportional to how much KV cache you have. At 512 tokens the savings are modest (tens of megabytes). At 4K tokens you start saving over 1 GB. At 8K+ tokens the savings reach 2 GB or more on a single model -- and that's when it starts changing what you can actually run on your hardware.

It enables longer contexts on existing hardware. If you're hitting OOM at 16K context on a 16 GB GPU, TurboQuant can push that boundary significantly without buying new hardware.

Speed gains under memory pressure. When FP16 KV cache pushes your GPU into swap territory, inference speed collapses. Community benchmarks show TurboQuant maintaining 2-3x higher token throughput in these regimes because the compressed cache stays in fast GPU memory.

It applies beyond LLMs. The same algorithm works for vector search / nearest-neighbor retrieval, compressing high-dimensional embedding indices with state-of-the-art recall.

Getting Started: The pip-Installable Path

The fastest way to try TurboQuant today is the turboquant Python package, a community implementation that provides a drop-in replacement for HuggingFace's KV cache:

pip install turboquant

Three lines to compress your model's KV cache:

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct",
    dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

# Create compressed cache -- that's it
cache = TurboQuantCache(bits=4)

inputs = tokenizer("Your prompt here", return_tensors="pt").to(model.device)
outputs = model(**inputs, past_key_values=cache, use_cache=True)

There's also a built-in OpenAI-compatible inference server:

turboquant-server --model Qwen/Qwen2.5-3B-Instruct --bits 4 --port 8000

And you can use the core quantizer directly on any vectors:

from turboquant import TurboQuantMSE

tq = TurboQuantMSE(dim=128, bits=4, device='cuda')
indices, norms = tq.quantize(vectors)       # vectors: (N, 128)
vectors_hat = tq.dequantize(indices, norms)  # reconstruct

The llama.cpp Path

If you're running models locally through llama.cpp, there are active community implementations integrating TurboQuant as a KV cache type. One notable fork (turboquant_plus) already works end-to-end on Apple Silicon with Metal GPU kernels:

# Server mode
./build/bin/llama-server \
  -m models/your-model.gguf \
  --cache-type-k turbo3 --cache-type-v turbo3 \
  -ngl 99 -c 262144 -fa on \
  --host 0.0.0.0 --port 8080

# CLI mode
./build/bin/llama-cli \
  -m models/your-model.gguf \
  --cache-type-k turbo3 --cache-type-v turbo3 \
  -ngl 99 -c 2048 -fa on \
  -n 100 -p "Hello world"

There's also an open feature request on the vLLM project to integrate TurboQuant as a native KV cache quantization option. Google's official implementation is expected around Q2 2026.

Practical Notes and Gotchas

A few things the benchmarks and community experiments have surfaced that the paper doesn't emphasize:

4-bit is the sweet spot for most use cases. At 4 bits, quality is essentially indistinguishable from FP16 on 3B+ parameter models. At 3 bits, you get more compression but quality starts degrading noticeably on models smaller than 8B.

Small models are more sensitive. On 0.5B-1.6B parameter models, quantization noise from TurboQuant can produce repetitive or degraded output, especially at 3-bit. If you're running something under 3B parameters, test carefully.

Keys and values have different sensitivities. Community experiments have found that value quantization tends to be the bottleneck -- 2-bit values cause significant cosine similarity degradation (around 0.94), while 4-bit values maintain 0.997. If you're tuning bit allocation, give values more bits than keys.

Short contexts don't benefit much. Below 1K tokens, the KV cache is small enough that compression savings are negligible and the overhead of rotation + quantization can even be a net negative. TurboQuant really shines at 4K+ tokens.

The residual window matters. Most implementations keep the most recent 128-256 tokens in full FP16 precision and only compress older tokens. This is important for output quality since attention focuses heavily on recent context.

Community Implementations at a Glance

Project	Language	Integration	Notes
`back2matching/turboquant`	Python	HuggingFace drop-in	`pip install turboquant`, includes OpenAI-compatible server
`tonbistudio/turboquant-pytorch`	Python/PyTorch	Standalone	From-scratch implementation with detailed validation
`0xSero/turboquant`	Python	vLLM adapter	Triton kernels, vLLM monkey-patch
`TheTom/turboquant_plus`	C/Python	llama.cpp + Metal	Apple Silicon optimized, 500+ tests
`RecursiveIntell/turbo-quant`	Rust	Standalone lib	Embedding + KV cache, no runtime dependencies
`ggml-org/llama.cpp#20969`	C	llama.cpp discussion	Multiple community PRs in progress

The Bigger Picture

TurboQuant is one piece of a larger shift happening in LLM deployment: making inference cheaper and more accessible without sacrificing quality. It pairs well with weight quantization (GPTQ, AWQ, GGUF formats), speculative decoding, and other serving optimizations. The combination of a 4-bit quantized model with a 4-bit TurboQuant KV cache means you can run meaningfully large models on consumer GPUs with long contexts -- something that was impractical a year ago.

Top comments (5)

klement Gunndu • Mar 28

The residual window approach — keeping recent tokens in FP16 while compressing older ones — is the detail that makes this production-viable. That alone saved me from going down a uniform quantization path.

Laurent Laborde • Mar 31

I'm testing it right now, but it doesn't seems to handle concurrent queries. which would completely defeat the purpose of vllm and a kv cache compression. I'm at 18tok/s instead of the usual 3000~5000 tok/s

ContraNebulous • Apr 1 • Edited

Have you tried running two concurrent iterations of the model? Maybe that doesn't work for your scenario, but it's something to try if it does. From my understanding, the way the math works with Turboquant is it chooses which params are relevant and compressing those that aren't. If you're working in two different context -- ie: multiple concurrent prompts -- I could see how this would break it. By the way, how in the world are you getting 3k - 5k tok/s normally?

Laurent Laborde • Apr 1

On a nVidia 5060, using the regular vllm and 64~120 concurrent queries. See dev.to/ker2x/benchmarking-lfm25-th... for a practical exemple

何以 • Mar 31

Great breakdown of the community implementations — the diversity of approaches (Python, C, Rust, Metal) really shows how much appetite there is for this. One practical note worth adding: the VRAM math changes dramatically depending on your attention architecture. For Llama 3 70B running 32K context with GQA, the KV cache drops from ~17GB (FP16) to roughly 2.7GB with TurboQuant 3-bit — that is a qualitative shift, not just incremental savings. Multi-head attention models show even larger deltas. I have been building a free calculator at turbo-quant.com that helps estimate these numbers across different models and context lengths — useful for figuring out whether TurboQuant actually changes your hardware requirements before the official implementation lands.