Google published TurboQuant at ICLR 2026 — a technique that compresses transformer KV caches to 4 bits per coordinate with zero accuracy loss. The paper reports 5-6x memory reduction on H100 GPUs, tested on text models like Gemma and Mistral.
I wanted to know: does it work on a vision-language model processing video? On a consumer GPU?
72 hours later, turboquant-vllm is on PyPI.
Quick Start
pip install turboquant-vllm[vllm]
vllm serve allenai/Molmo2-8B --attention-backend CUSTOM
That's it. The plugin auto-registers via vLLM's entry point system. No code changes, no forking, no monkey-patching.
For HuggingFace users:
from transformers import DynamicCache
from turboquant_vllm import CompressedDynamicCache
cache = DynamicCache()
compressed = CompressedDynamicCache(cache, head_dim=128, bits=4)
# Pass cache (not wrapper) to model.generate()
Why Vision-Language Models Matter
Every other TurboQuant implementation tests on text-only models with hundreds of tokens. But a 12-second video clip through Molmo2-4B produces ~11,000 visual tokens — 1.6 GB of KV cache on a 24 GB GPU.
That's 10x more memory, 10x more opportunities for precision bugs to compound across 36 transformer layers. The existing VLM compression literature (VL-Cache, Dynamic-LLaVA, ZipVL) is all token pruning — deciding which tokens to discard. TurboQuant compresses the tokens you keep. They're complementary approaches, and nobody had validated whether vector quantization survives the visual token regime.
It does.
Results
Molmo2-4B on RTX 4090, 11K visual tokens from a Seinfeld video clip:
| Metric | Baseline | TQ4 Compressed |
|---|---|---|
| KV cache | 1,639 MiB | 435 MiB (3.76x) |
| Output quality | Detailed scene description | Near-identical (100+ tokens match) |
| Decode overhead | — | 1.78x |
Molmo2-8B: same 3.76x ratio, correctly identifies all Seinfeld characters. Full 23-minute episode processed at 24 tok/s.
What I Built Differently
Plugin, not fork
Other vLLM TurboQuant efforts are forks or monkey-patches. turboquant-vllm uses vLLM's official plugin entry point:
[project.entry-points."vllm.general_plugins"]
tq4_backend = "turboquant_vllm.vllm:register_tq4_backend"
Incremental dequantization
The naive approach decompresses the full KV cache at every layer, every step — 3.36x overhead. Incremental dequantization decompresses only the 1 new token per step and appends to a running buffer. Overhead drops to 1.78x. This isn't in Google's paper.
Cross-platform Triton
Fused kernels run on both NVIDIA CUDA and AMD ROCm without code changes. 84/84 GPU tests pass on a Radeon 890M iGPU.
Bugs Nobody Else Has Found
FP16 norms fail at scale. Works at 11,385 tokens, garbles output at 11,397 tokens. The 0.01% error per vector compounds across 36 layers. Always use fp32.
QJL correction is invisible in standard attention. The paper's Stage 2 (2-bit MSE + 1-bit QJL) wastes 1 bit of precision — standard
Q @ K^Tcan't use the correction. Full 3-bit MSE produces identical output.Multi-layer precision drift in fused kernels. A 0.023 cosine gap per layer between fp32 Triton and bf16 SDPA compounds to produce "pizza pizza pizza" at 36 layers. Flash Attention-style fusion needed.
Validation
- 180+ tests, 9 test files, 95%+ coverage
- 16 GPU experiments with documented failures
- Cross-platform: NVIDIA RTX 4090 + AMD Radeon 890M
- End-to-end: installed from PyPI into stock
vllm/vllm-openai:latestcontainer
What's Next
- Upstream contribution to vLLM (issue #38171, 49 upvotes)
- Full Flash Attention fusion for the fused Triton kernels
- Stacking with VL-Cache-style token pruning for multiplicative VLM savings
Top comments (0)