Google published TurboQuant at ICLR 2026 — a technique that compresses transformer KV caches to 3-4 bits per coordinate with zero accuracy loss. Their paper reports 5-6x memory reduction and 8x attention speedup on H100 GPUs, tested on text-only models like Gemma and Mistral.
I wanted to know: does it work on a vision-language model processing video? On a consumer GPU?
So I implemented it from the paper and ran it on Molmo2-4B analyzing Seinfeld clips (~11,000 visual tokens) on an RTX 4090.
TL;DR Results
| Metric | Baseline | TQ4 Compressed |
|---|---|---|
| KV cache | 1,639 MiB | 435 MiB (3.76x) |
| Output quality | Detailed scene description | Near-identical |
| Decode overhead | — | 1.78x |
The model describes the same Seinfeld scene with the same visual details. Different phrasing on minor points, both equally valid.
Three Things the Paper Doesn't Tell You
1. 4-bit nibble packing beats 3-bit unpacked
The paper focuses on 3-bit quantization (8 centroids). But 3-bit values cross byte boundaries — no existing Python/Triton implementation actually packs them. Every implementation I found stores 3-bit indices in 8-bit bytes, wasting 62.5% of storage.
4-bit indices pack trivially into nibbles:
# Pack two 4-bit indices into one byte
packed = (indices[..., 0::2] << 4) | indices[..., 1::2]
# Unpack
high = packed >> 4
low = packed & 0x0F
And 16 centroids (4-bit) give ~97% cosine similarity vs ~95% for 8 centroids (3-bit). Better quality AND nearly double the compression of unpacked 3-bit, with two lines of code.
| Mode | Storage | Compression | Quality |
|---|---|---|---|
| 3-bit unpacked (uint8) | 132 B/block | 1.94x | ~95% cosine |
| 4-bit nibble-packed | 68 B/block | 3.76x | ~97% cosine |
2. FP16 norms fail silently at scale
I initially stored vector norms as float16 to save 2 bytes per vector. It worked fine on short sequences.
Then I tested with an 11,397-token video clip and got:
Output: "In the video,1.0 0 0 0 0 0 0 0 0 0 0 0..."
The model produced 4 correct tokens ("In the video,") then completely degenerated. The same prompt with 11,385 tokens (12 fewer) worked perfectly.
Root cause: FP16 precision loss (~0.01% per vector) accumulated across 36 transformer layers, shifting attention logits at low-confidence decision boundaries. Token-by-token analysis showed the divergence at step 5 where the logit margin was < 0.5.
Fix: Float32 norms. The 2 extra bytes per vector barely affect the compression ratio (1.97x → 1.94x). No other TurboQuant implementation I've found documents this failure mode.
3. The boring optimization wins
I built a fused Triton kernel that computes Q @ compressed_K^T directly from nibble-packed indices. It achieves 17.8x speedup on the micro-benchmark by:
- Pre-rotating queries once (
q_rot = Q @ rotation_matrix^T) - Having the kernel do simple centroid lookups and dot products
- Eliminating the 128x128 rotation matmul from the inner loop
Sounds great. But when I wired it into all 36 layers of Molmo2, the output degenerated into "pizza pizza pizza pizza."
Root cause: The fused kernel computes in fp32, but the model expects bf16 attention behavior (via SDPA/FlashAttention). The 0.023 per-layer cosine gap between fp32 kernel output and bf16 SDPA output compounds catastrophically across 36 layers.
The fix that actually worked: incremental dequantization. Instead of decompressing the entire 11K-token cache at every layer at every decode step (the 3.36x overhead), decompress only the 1 new token and append it to a running buffer. Standard SDPA handles the attention.
Overhead went from 3.36x to 1.78x. No custom kernels needed.
The Fused Kernel Isn't Wasted
The kernel is correct (1.0 cosine similarity on the micro-benchmark) and fast (17.8x). It just needs to be part of a full Flash Attention-style fusion — computing softmax and V multiplication inside the kernel, not just Q@K^T scores. That's a future project.
Implementation Details
The full implementation is at github.com/Alberto-Codes/turboquant-consumer:
- Core algorithm: Lloyd-Max codebook solver, TurboQuantMSE (Stage 1), TurboQuantProd (Stage 2 with QJL)
- CompressedDynamicCache: Drop-in KV cache wrapper with nibble packing and incremental dequant
- Fused Triton kernel: Nibble unpack + centroid gather + GQA mapping
- Benchmark harness: A/B testing CLI for any HuggingFace model
- 62 tests including long-sequence regression (36 layers, 1024 tokens)
- 5 experiment logs with full results
# Quick start
git clone https://github.com/Alberto-Codes/turboquant-consumer.git
cd turboquant-consumer && uv sync
# Run tests
uv run pytest tests/ -v
# Benchmark (requires GPU)
uv run python -m turboquant_consumer.benchmark \
--model allenai/Molmo2-4B --bits 4 --compressed \
--video /path/to/video.mp4
What's Next
- Molmo2-8B validation — the 8B model recognizes character names
- Flash Attention-style fused kernel — full softmax+V fusion for multi-layer correctness
- vLLM integration — waiting for upstream cache backend API
This is the first TurboQuant implementation validated on a vision-language model with video input. If you're working on KV cache compression, I'd love to hear about your experiences — especially if you've hit the fp16 norms issue.
Top comments (0)