Compress your LLM's KV cache 33x with zero training

#machinelearning #llm #python #opensource

Running out of GPU memory at long context lengths? The KV cache grows linearly with sequence length - at 128K tokens, a 7B model accumulates over 60 GB of KV state. That's more than a single A100.

I built NexusQuant, a library that compresses the KV cache 10-33x at inference time. No training, no calibration data, no model changes.

Before

# OOM at 32K tokens on a 24GB GPU
output = model.generate(input_ids, max_new_tokens=512)

After

from nexusquant import nexusquant_evict

with nexusquant_evict(model, quality="balanced"):
    output = model.generate(input_ids, max_new_tokens=512)

128K context now fits in the memory that used to hold 7.5K.

What it does

Six stages, applied once after the prefill pass:

Rank tokens by attention importance
Drop the lowest-scoring tokens (token eviction)
Undo rotary position embeddings on keys
Apply Hadamard rotation to spread energy uniformly
Quantize 8-float groups onto the E8 lattice (densest sphere packing in 8D)
Delta-code consecutive indices and compress with zstd

Eviction reduces token count. Quantization reduces precision. They're nearly orthogonal - 60% eviction (2.5x) combined with 2-bit E8 quantization (~7x) gives ~17x total.

Numbers

Measured on Mistral-7B, A100, FP16, all overhead included:

Preset	Compression	PPL change
`high`	10x	+0.4%
`balanced`	17x	+1.3%
`max`	33x	+2.6%

How it compares

Method	Ratio	Quality	Training needed
NexusQuant	10-33x	+0.4-2.6%	No
TurboQuant (Google)	~5-6x	~0%	No
KVTC (NVIDIA)	up to 20x	<1%	Yes (calibration)
CommVQ (Apple)	~8x	~0%	Yes (training)

NexusQuant is the highest-compression training-free method available. Only KVTC achieves comparable ratios, but it needs calibration data.

Install

pip install nexusquant
pip install "nexusquant[hf]"  # with transformers

GitHub repo | Paper (PDF)

Best regards,
João Marques

DEV Community