DEV Community

Alberto Nieto
Alberto Nieto

Posted on • Originally published at alberto.codes

From one model to seven — what it took to make TurboQuant model-portable

A KV cache compression plugin that only works on one model is a demo, not a tool. turboquant-vllm v1.0.0 shipped four days ago with one validated architecture: Molmo2. v1.3.0 validates seven — Llama 3.1, Mistral 7B, Qwen2.5, Phi-3-mini, Phi-4, Gemma-2, and Gemma-3. The path between those two points was more interesting than the destination.

What Changed

Fused paged kernels (v1.2.0). The original architecture decompressed KV cache from TQ4 to FP16 in HBM, then ran standard attention on the result. The new fused kernel reads compressed blocks directly from vLLM's page table, decompresses in SRAM, and computes attention in a single pass. HBM traffic: 1,160 → 136 bytes per token.

# One flag. Same as before.
vllm serve meta-llama/Llama-3.1-8B --attention-backend CUSTOM
Enter fullscreen mode Exit fullscreen mode

Non-pow2 head dimensions (v1.3.0). Triton's tl.arange requires power-of-two ranges. Phi-3-mini has head_dim=96. Gemma has head_dim=256. All five Triton kernels needed pad-to-next-power-of-two with boundary masking. 23 new tests cover the three new dimension classes.

Sliding window attention bypass (v1.3.0). Gemma-2 and Gemma-3 mix global and sliding window attention layers. Compressing SWA layers breaks cache eviction. The fix: SWA layers bypass compression automatically via the is_sliding attribute. Global layers compress normally.

Verify CLI. Check any model in thirty seconds:

python -m turboquant_vllm.verify --model google/gemma-2-2b --bits 4
# PASS — all layers, cosine 0.9951
Enter fullscreen mode Exit fullscreen mode

Why This Design

The fused kernel architecture was a prerequisite for everything else. Without it, model expansion would have multiplied a slow path — decompress-to-HBM on every decode step across more models means more wasted bandwidth. Fusing first meant each new model gets the fast path for free.

The non-pow2 fix was not a config change. It was a kernel rewrite across five files, each with different padding constraints depending on whether the kernel reads keys, values, or both. The ~5–15% throughput penalty for non-pow2 dimensions is real and documented — but for head_dim=128 models (the majority), it's zero.

The production hotfixes (v1.2.1, v1.2.2) are worth mentioning because they came from container benchmarking, not unit tests. Running TQ4 inside the vLLM container against real video clips surfaced OOM bugs that synthetic tests never would. Both patches landed within 24 hours.

Getting Started

pip install turboquant-vllm[vllm]>=1.3.0
vllm serve meta-llama/Llama-3.1-8B --attention-backend CUSTOM
Enter fullscreen mode Exit fullscreen mode

Verify compression quality on any supported model:

python -m turboquant_vllm.verify --model <model-id> --bits 4
Enter fullscreen mode Exit fullscreen mode

Validated models: Molmo2-4B, Llama 3.1 8B, Mistral 7B, Qwen2.5-3B, Phi-3-mini, Phi-4, Gemma-2-2b, Gemma-3-4B-it. All pass at cosine ≥0.99.

Benchmarks:

  • VLM (Molmo2-4B, FP16 baseline): 3.76x KV compression
  • Text-only (Llama 3.1 / Mistral, FP8 baseline): 1.88x KV capacity, lossless at temperature=0
  • At 16K context: 6x concurrent requests vs baseline 3x

What's Next

  • Upstream vLLM contribution (vllm#38171 — 49 upvotes)
  • Flash Attention kernel fusion for multi-layer correctness
  • VL-Cache stacking for multiplicative VLM savings

PyPI | Docs | GitHub

Top comments (0)