Running out of GPU memory at long context lengths? The KV cache grows linearly with sequence length — at 128K tokens, a 7B model accumulates over 60 GB of KV state. That's more than a single A100.
I built NexusQuant, a library that compresses the KV cache 10-33x at inference time. No training, no calibration data, no model changes.
Before
# OOM at 32K tokens on a 24GB GPU
output = model.generate(input_ids, max_new_tokens=512)
After
from nexusquant import nexusquant_evict
with nexusquant_evict(model, quality="balanced"):
output = model.generate(input_ids, max_new_tokens=512)
128K context now fits in the memory that used to hold 7.5K.
What it does
Six stages, applied once after the prefill pass:
- Rank tokens by attention importance
- Drop the lowest-scoring tokens (token eviction)
- Undo rotary position embeddings on keys
- Apply Hadamard rotation to spread energy uniformly
- Quantize 8-float groups onto the E8 lattice (densest sphere packing in 8D)
- Delta-code consecutive indices and compress with zstd
Eviction reduces token count. Quantization reduces precision. They're nearly orthogonal — 60% eviction (2.5x) combined with 2-bit E8 quantization (~7x) gives ~17x total.
Numbers
Measured on Mistral-7B, A100, FP16, all overhead included:
| Preset | Compression | PPL change |
|---|---|---|
high |
10x | +0.4% |
balanced |
17x | +1.3% |
max |
33x | +2.6% |
How it compares
| Method | Ratio | Quality | Training needed |
|---|---|---|---|
| NexusQuant | 10-33x | +0.4-2.6% | No |
| TurboQuant (Google) | ~5-6x | ~0% | No |
| KVTC (NVIDIA) | up to 20x | <1% | Yes (calibration) |
| CommVQ (Apple) | ~8x | ~0% | Yes (training) |
NexusQuant is the highest-compression training-free method available. Only KVTC achieves comparable ratios, but it needs calibration data.
Install
pip install nexusquant
pip install "nexusquant[hf]" # with transformers
Best regards,
João Marques
Top comments (0)