I Implemented Google's TurboQuant and Tested It on a Vision-Language Model — Here's What the Paper Doesn't Tell You

#ai #opensource #machinelearning #gpu

Google published TurboQuant at ICLR 2026 — a technique that compresses transformer KV caches to 3-4 bits per coordinate with zero accuracy loss. Their paper reports 5-6x memory reduction and 8x attention speedup on H100 GPUs, tested on text-only models like Gemma and Mistral.

I wanted to know: does it work on a vision-language model processing video? On a consumer GPU?

So I implemented it from the paper and ran it on Molmo2-4B analyzing Seinfeld clips (~11,000 visual tokens) on an RTX 4090.

TL;DR Results

Metric	Baseline	TQ4 Compressed
KV cache	1,639 MiB	435 MiB (3.76x)
Output quality	Detailed scene description	Near-identical
Decode overhead	—	1.78x

The model describes the same Seinfeld scene with the same visual details. Different phrasing on minor points, both equally valid.

Three Things the Paper Doesn't Tell You

1. 4-bit nibble packing beats 3-bit unpacked

The paper focuses on 3-bit quantization (8 centroids). But 3-bit values cross byte boundaries — no existing Python/Triton implementation actually packs them. Every implementation I found stores 3-bit indices in 8-bit bytes, wasting 62.5% of storage.

4-bit indices pack trivially into nibbles:

# Pack two 4-bit indices into one byte
packed = (indices[..., 0::2] << 4) | indices[..., 1::2]

# Unpack
high = packed >> 4
low = packed & 0x0F

And 16 centroids (4-bit) give ~97% cosine similarity vs ~95% for 8 centroids (3-bit). Better quality AND nearly double the compression of unpacked 3-bit, with two lines of code.

Mode	Storage	Compression	Quality
3-bit unpacked (uint8)	132 B/block	1.94x	~95% cosine
4-bit nibble-packed	68 B/block	3.76x	~97% cosine

2. FP16 norms fail silently at scale

I initially stored vector norms as float16 to save 2 bytes per vector. It worked fine on short sequences.

Then I tested with an 11,397-token video clip and got:

Output: "In the video,1.0 0 0 0 0 0 0 0 0 0 0 0..."

The model produced 4 correct tokens ("In the video,") then completely degenerated. The same prompt with 11,385 tokens (12 fewer) worked perfectly.

Root cause: FP16 precision loss (~0.01% per vector) accumulated across 36 transformer layers, shifting attention logits at low-confidence decision boundaries. Token-by-token analysis showed the divergence at step 5 where the logit margin was < 0.5.

Fix: Float32 norms. The 2 extra bytes per vector barely affect the compression ratio (1.97x → 1.94x). No other TurboQuant implementation I've found documents this failure mode.

3. The boring optimization wins

I built a fused Triton kernel that computes Q @ compressed_K^T directly from nibble-packed indices. It achieves 17.8x speedup on the micro-benchmark by:

Pre-rotating queries once (q_rot = Q @ rotation_matrix^T)
Having the kernel do simple centroid lookups and dot products
Eliminating the 128x128 rotation matmul from the inner loop

Sounds great. But when I wired it into all 36 layers of Molmo2, the output degenerated into "pizza pizza pizza pizza."

Root cause: The fused kernel computes in fp32, but the model expects bf16 attention behavior (via SDPA/FlashAttention). The 0.023 per-layer cosine gap between fp32 kernel output and bf16 SDPA output compounds catastrophically across 36 layers.

The fix that actually worked: incremental dequantization. Instead of decompressing the entire 11K-token cache at every layer at every decode step (the 3.36x overhead), decompress only the 1 new token and append it to a running buffer. Standard SDPA handles the attention.

Overhead went from 3.36x to 1.78x. No custom kernels needed.

The Fused Kernel Isn't Wasted

The kernel is correct (1.0 cosine similarity on the micro-benchmark) and fast (17.8x). It just needs to be part of a full Flash Attention-style fusion — computing softmax and V multiplication inside the kernel, not just Q@K^T scores. That's a future project.

Implementation Details

The full implementation is at github.com/Alberto-Codes/turboquant-consumer:

Core algorithm: Lloyd-Max codebook solver, TurboQuantMSE (Stage 1), TurboQuantProd (Stage 2 with QJL)
CompressedDynamicCache: Drop-in KV cache wrapper with nibble packing and incremental dequant
Fused Triton kernel: Nibble unpack + centroid gather + GQA mapping
Benchmark harness: A/B testing CLI for any HuggingFace model
62 tests including long-sequence regression (36 layers, 1024 tokens)
5 experiment logs with full results

# Quick start
git clone https://github.com/Alberto-Codes/turboquant-consumer.git
cd turboquant-consumer && uv sync

# Run tests
uv run pytest tests/ -v

# Benchmark (requires GPU)
uv run python -m turboquant_consumer.benchmark \
    --model allenai/Molmo2-4B --bits 4 --compressed \
    --video /path/to/video.mp4

What's Next

Molmo2-8B validation — the 8B model recognizes character names
Flash Attention-style fused kernel — full softmax+V fusion for multi-layer correctness
vLLM integration — waiting for upstream cache backend API

This is the first TurboQuant implementation validated on a vision-language model with video input. If you're working on KV cache compression, I'd love to hear about your experiences — especially if you've hit the fp16 norms issue.