The Problem
If you've run LLMs locally, you know the pain: a 14B model eats 10+ GB just for the KV cache on long prompts. The model weights fit in memory, but the cache — where attention stores every key and value vector for every token — grows linearly with context length and eventually pushes you into swap or OOM.
The standard approach is to quantize the model weights (Q4, Q8), but nobody touches the KV cache. It sits there in full FP16 precision, quietly eating 30-50% of your total memory.
The Paper
Google Research published TurboQuant at ICLR 2026. The core idea is surprisingly elegant:
- Rotate the KV vectors by a random orthogonal matrix — this spreads information uniformly across all coordinates
- Quantize each coordinate independently using precomputed optimal codebooks
- Store the norm separately in FP16
That's it. No training. No calibration data. No model-specific tuning. The same codebooks work for Llama, Qwen, Mistral — anything.
The key insight is that after rotation, each coordinate follows a known Gaussian distribution (N(0, 1/d) where d is the head dimension). Since you know the distribution in advance, you can precompute the optimal Lloyd-Max quantizer offline. This makes the whole thing data-oblivious — you don't need to see a single token from the model to set up compression.
Why not both stages?
The paper actually has two stages. Stage 2 (QJL) adds a 1-bit residual correction for unbiased inner products. We skip it. Independent research found that QJL's variance amplification actually degrades softmax-based attention. Stage 1 alone produces better results for KV cache compression.
The Library
We turned this into tqai — a pip-installable Python library with two backends (PyTorch and MLX) and a CLI.
Two lines to compress
import tqai
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
# This is the only change
cache = tqai.patch(model, bits_k=4, bits_v=2)
inputs = tokenizer("Explain quantum entanglement:", return_tensors="pt")
output = model.generate(**inputs, past_key_values=cache, max_new_tokens=200)
On Apple Silicon with MLX:
import tqai
import mlx_lm
model, tokenizer = mlx_lm.load("mlx-community/Llama-3.1-8B-Instruct-4bit")
tqai.patch(model, bits_k=4, bits_v=2, backend="mlx")
response = mlx_lm.generate(model, tokenizer, prompt="Explain quantum entanglement:", max_tokens=200)
Compression numbers
| Config | Avg Bits | Memory Saved | Use Case |
|---|---|---|---|
| K4/V2 | 3.0 | 80% | Production |
| K3/V2 | 2.5 | 84% | Extended context |
| K4/V3 | 3.5 | 78% | Quality-sensitive |
Original KV cache: 16 bits per coordinate (FP16). With K4/V2: 512 bytes/token → 100 bytes/token.
Does it actually work?
We tested across model sizes. The results are clear — quality depends on model size, not compression:
| Model | Baseline | + tqai K4/V2 | + tqai K3/V2 |
|---|---|---|---|
| Qwen 0.5B | Good | Degraded | Poor |
| Qwen 3B | Excellent | Good | Degraded |
| Llama 8B | Excellent | Excellent | Excellent |
| Qwen 14B | Excellent | Excellent | Excellent |
On 8B+ models, the compressed output is indistinguishable from baseline. Here's a real example from Qwen 14B Q4:
Baseline: "particles become interconnected so that the state of one particle cannot be described independently of the state of the others"
K4/V2: "particles become interconnected so that the state of one particle cannot be described without including the state of the other"
K3/V2: "two or more particles become interconnected such that the state of one particle can instantly influence the state of another"
All three are coherent, factually correct, grammatically perfect.
The CLI
tqai ships with a CLI tool for quick testing:
# Environment info
tqai info
# Accuracy benchmark (no model needed)
tqai benchmark
# Output:
# Keys (4-bit): NMSE=0.009287, SNR=20.3 dB, cosine sim=0.9954
# Values (2-bit): NMSE=0.115653, SNR=9.4 dB, cosine sim=0.9408
# Generate with compression
tqai run "Explain gravity" -m mlx-community/Llama-3.1-8B-Instruct-4bit
# Side-by-side comparison
tqai compare "Explain gravity" -m mlx-community/Llama-3.1-8B-Instruct-4bit
# Pre-convert for faster startup
tqai convert -m mlx-community/Llama-3.1-8B-Instruct-4bit -o ./llama-tqai/
Under the Hood
The architecture is intentionally simple:
src/tqai/
├── quantizer.py # PolarQuantizer — the core algorithm (~100 lines)
├── backend/ # PyTorch + MLX abstraction (Protocol-based, ~80 lines each)
├── codebook/ # Precomputed Lloyd-Max codebooks (12 .npz files, ~50KB)
├── cache/ # HuggingFace DynamicCache + mlx-lm KVCache wrappers
├── convert.py # Offline model conversion
└── cli.py # CLI tool
Backend abstraction: A Python Protocol with ~15 ops (matmul, qr, norm, argmin, etc.). Each backend is ~80 lines. Adding a new backend (JAX, ONNX) means implementing one file.
Codebooks: Precomputed for head dimensions 64, 96, 128, 256 at 2/3/4 bits. Shipped as package data. If your model uses an unusual head dim, they're generated at runtime (requires scipy).
No monkey-patching of model code: For HuggingFace, we subclass DynamicCache — the model calls cache.update() as normal, we compress transparently. For MLX, we replace the cache factory.
Test Suite
179 tests covering:
- Mathematical guarantees: MSE distortion within the paper's theoretical bound (√3π/2 / 4^b)
- Attention fidelity: Full softmax(Q@K^T/√d)@V simulation with cosine similarity checks
- Inner product preservation: Correlation and absolute error of Q@K^T
- Edge cases: Zero vectors, extreme values, sparse vectors, high dimensions
- Statistical properties: Unbiasedness, rotation distribution validation
- Cross-backend: Torch and MLX produce equivalent results
CI runs on both Linux (PyTorch) and macOS (PyTorch + MLX).
Install
# Just the library
pip install tqai
# With PyTorch
pip install tqai[torch]
# With MLX (Apple Silicon)
pip install tqai[mlx]
# macOS via Homebrew
brew install alphawavesystems/tap/tqai
What's Next
- Bit-packing: Currently indices are stored as uint8. Packing to actual 2/3/4 bits would achieve the full theoretical 5-6x compression in memory (not just on disk).
- Triton kernels: Fused decode kernels that compute attention directly on compressed data without dequantizing.
- vLLM adapter: Production serving integration.
Links
- GitHub: AlphaWaveSystems/tqai
- PyPI: pypi.org/project/tqai
- Paper: arXiv:2504.19874 (TurboQuant, Google Research, ICLR 2026)
- Related: PolarQuant (AISTATS 2026), QJL (AAAI 2025)
MIT licensed. 179 tests. Contributions welcome — DCO sign-off required.
Top comments (0)