I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested

#ai #machinelearning #gpu #python

Google published TurboQuant at ICLR 2026 — a technique that compresses transformer KV caches to 4 bits per coordinate with zero accuracy loss. The paper reports 5-6x memory reduction on H100 GPUs, tested on text models like Gemma and Mistral.

I wanted to know: does it work on a vision-language model processing video? On a consumer GPU?

72 hours later, turboquant-vllm is on PyPI.

Quick Start

pip install turboquant-vllm[vllm]
vllm serve allenai/Molmo2-8B --attention-backend CUSTOM

That's it. The plugin auto-registers via vLLM's entry point system. No code changes, no forking, no monkey-patching.

For HuggingFace users:

from transformers import DynamicCache
from turboquant_vllm import CompressedDynamicCache

cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)
# Pass cache (not wrapper) to model.generate()

Why Vision-Language Models Matter

Every other TurboQuant implementation tests on text-only models with hundreds of tokens. But a 12-second video clip through Molmo2-4B produces ~11,000 visual tokens — 1.6 GB of KV cache on a 24 GB GPU.

That's 10x more memory, 10x more opportunities for precision bugs to compound across 36 transformer layers. The existing VLM compression literature (VL-Cache, Dynamic-LLaVA, ZipVL) is all token pruning — deciding which tokens to discard. TurboQuant compresses the tokens you keep. They're complementary approaches, and nobody had validated whether vector quantization survives the visual token regime.

It does.

Results

Molmo2-4B on RTX 4090, 11K visual tokens from a Seinfeld video clip:

Metric	Baseline	TQ4 Compressed
KV cache	1,639 MiB	435 MiB (3.76x)
Output quality	Detailed scene description	Near-identical (100+ tokens match)
Decode overhead	—	1.78x

Molmo2-8B: same 3.76x ratio, correctly identifies all Seinfeld characters. Full 23-minute episode processed at 24 tok/s.

What I Built Differently

Plugin, not fork

Other vLLM TurboQuant efforts are forks or monkey-patches. turboquant-vllm uses vLLM's official plugin entry point:

[project.entry-points."vllm.general_plugins"]
tq4_backend = "turboquant_vllm.vllm:register_tq4_backend"

Incremental dequantization

The naive approach decompresses the full KV cache at every layer, every step — 3.36x overhead. Incremental dequantization decompresses only the 1 new token per step and appends to a running buffer. Overhead drops to 1.78x. This isn't in Google's paper.

Cross-platform Triton

Fused kernels run on both NVIDIA CUDA and AMD ROCm without code changes. 84/84 GPU tests pass on a Radeon 890M iGPU.

Bugs Nobody Else Has Found

FP16 norms fail at scale. Works at 11,385 tokens, garbles output at 11,397 tokens. The 0.01% error per vector compounds across 36 layers. Always use fp32.
QJL correction is invisible in standard attention. The paper's Stage 2 (2-bit MSE + 1-bit QJL) wastes 1 bit of precision — standard Q @ K^T can't use the correction. Full 3-bit MSE produces identical output.
Multi-layer precision drift in fused kernels. A 0.023 cosine gap per layer between fp32 Triton and bf16 SDPA compounds to produce "pizza pizza pizza" at 36 layers. Flash Attention-style fusion needed.