Patrick Bertsch

Posted on Apr 4

We Built a Python Library That Cuts LLM Memory Usage by 80%

#ai #machinelearning #opensource #python

The Problem

If you've run LLMs locally, you know the pain: a 14B model eats 10+ GB just for the KV cache on long prompts. The model weights fit in memory, but the cache — where attention stores every key and value vector for every token — grows linearly with context length and eventually pushes you into swap or OOM.

The standard approach is to quantize the model weights (Q4, Q8), but nobody touches the KV cache. It sits there in full FP16 precision, quietly eating 30-50% of your total memory.

The Paper

Google Research published TurboQuant at ICLR 2026. The core idea is surprisingly elegant:

Rotate the KV vectors by a random orthogonal matrix — this spreads information uniformly across all coordinates
Quantize each coordinate independently using precomputed optimal codebooks
Store the norm separately in FP16

That's it. No training. No calibration data. No model-specific tuning. The same codebooks work for Llama, Qwen, Mistral — anything.

The key insight is that after rotation, each coordinate follows a known Gaussian distribution (N(0, 1/d) where d is the head dimension). Since you know the distribution in advance, you can precompute the optimal Lloyd-Max quantizer offline. This makes the whole thing data-oblivious — you don't need to see a single token from the model to set up compression.

Why not both stages?

The paper actually has two stages. Stage 2 (QJL) adds a 1-bit residual correction for unbiased inner products. We skip it. Independent research found that QJL's variance amplification actually degrades softmax-based attention. Stage 1 alone produces better results for KV cache compression.

The Library

We turned this into tqai — a pip-installable Python library with two backends (PyTorch and MLX) and a CLI.

Two lines to compress

import tqai
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

# This is the only change
cache = tqai.patch(model, bits_k=4, bits_v=2)

inputs = tokenizer("Explain quantum entanglement:", return_tensors="pt")
output = model.generate(**inputs, past_key_values=cache, max_new_tokens=200)

On Apple Silicon with MLX:

import tqai
import mlx_lm

model, tokenizer = mlx_lm.load("mlx-community/Llama-3.1-8B-Instruct-4bit")
tqai.patch(model, bits_k=4, bits_v=2, backend="mlx")

response = mlx_lm.generate(model, tokenizer, prompt="Explain quantum entanglement:", max_tokens=200)

Compression numbers

Config	Avg Bits	Memory Saved	Use Case
K4/V2	3.0	80%	Production
K3/V2	2.5	84%	Extended context
K4/V3	3.5	78%	Quality-sensitive

Original KV cache: 16 bits per coordinate (FP16). With K4/V2: 512 bytes/token → 100 bytes/token.

Does it actually work?

We tested across model sizes. The results are clear — quality depends on model size, not compression:

Model	Baseline	+ tqai K4/V2	+ tqai K3/V2
Qwen 0.5B	Good	Degraded	Poor
Qwen 3B	Excellent	Good	Degraded
Llama 8B	Excellent	Excellent	Excellent
Qwen 14B	Excellent	Excellent	Excellent

On 8B+ models, the compressed output is indistinguishable from baseline. Here's a real example from Qwen 14B Q4:

Baseline: "particles become interconnected so that the state of one particle cannot be described independently of the state of the others"

K4/V2: "particles become interconnected so that the state of one particle cannot be described without including the state of the other"

K3/V2: "two or more particles become interconnected such that the state of one particle can instantly influence the state of another"

All three are coherent, factually correct, grammatically perfect.

The CLI

tqai ships with a CLI tool for quick testing:

# Environment info
tqai info

# Accuracy benchmark (no model needed)
tqai benchmark
# Output:
# Keys (4-bit): NMSE=0.009287, SNR=20.3 dB, cosine sim=0.9954
# Values (2-bit): NMSE=0.115653, SNR=9.4 dB, cosine sim=0.9408

# Generate with compression
tqai run "Explain gravity" -m mlx-community/Llama-3.1-8B-Instruct-4bit

# Side-by-side comparison
tqai compare "Explain gravity" -m mlx-community/Llama-3.1-8B-Instruct-4bit

# Pre-convert for faster startup
tqai convert -m mlx-community/Llama-3.1-8B-Instruct-4bit -o ./llama-tqai/

Under the Hood

The architecture is intentionally simple:

src/tqai/
├── quantizer.py     # PolarQuantizer — the core algorithm (~100 lines)
├── backend/         # PyTorch + MLX abstraction (Protocol-based, ~80 lines each)
├── codebook/        # Precomputed Lloyd-Max codebooks (12 .npz files, ~50KB)
├── cache/           # HuggingFace DynamicCache + mlx-lm KVCache wrappers
├── convert.py       # Offline model conversion
└── cli.py           # CLI tool

Backend abstraction: A Python Protocol with ~15 ops (matmul, qr, norm, argmin, etc.). Each backend is ~80 lines. Adding a new backend (JAX, ONNX) means implementing one file.

Codebooks: Precomputed for head dimensions 64, 96, 128, 256 at 2/3/4 bits. Shipped as package data. If your model uses an unusual head dim, they're generated at runtime (requires scipy).

No monkey-patching of model code: For HuggingFace, we subclass DynamicCache — the model calls cache.update() as normal, we compress transparently. For MLX, we replace the cache factory.

Test Suite

179 tests covering:

Mathematical guarantees: MSE distortion within the paper's theoretical bound (√3π/2 / 4^b)
Attention fidelity: Full softmax(Q@K^T/√d)@V simulation with cosine similarity checks
Inner product preservation: Correlation and absolute error of Q@K^T
Edge cases: Zero vectors, extreme values, sparse vectors, high dimensions
Statistical properties: Unbiasedness, rotation distribution validation
Cross-backend: Torch and MLX produce equivalent results

CI runs on both Linux (PyTorch) and macOS (PyTorch + MLX).

Install

# Just the library
pip install tqai

# With PyTorch
pip install tqai[torch]

# With MLX (Apple Silicon)
pip install tqai[mlx]

# macOS via Homebrew
brew install alphawavesystems/tap/tqai

What's Next

Bit-packing: Currently indices are stored as uint8. Packing to actual 2/3/4 bits would achieve the full theoretical 5-6x compression in memory (not just on disk).
Triton kernels: Fused decode kernels that compute attention directly on compressed data without dequantizing.
vLLM adapter: Production serving integration.

DEV Community