DEV Community

João André Gomes Marques
João André Gomes Marques

Posted on • Originally published at github.com

Compress your LLM's KV cache 33x with zero training

Running out of GPU memory at long context lengths? The KV cache grows linearly with sequence length — at 128K tokens, a 7B model accumulates over 60 GB of KV state. That's more than a single A100.

I built NexusQuant, a library that compresses the KV cache 10-33x at inference time. No training, no calibration data, no model changes.

Before

# OOM at 32K tokens on a 24GB GPU
output = model.generate(input_ids, max_new_tokens=512)
Enter fullscreen mode Exit fullscreen mode

After

from nexusquant import nexusquant_evict

with nexusquant_evict(model, quality="balanced"):
    output = model.generate(input_ids, max_new_tokens=512)
Enter fullscreen mode Exit fullscreen mode

128K context now fits in the memory that used to hold 7.5K.

What it does

Six stages, applied once after the prefill pass:

  1. Rank tokens by attention importance
  2. Drop the lowest-scoring tokens (token eviction)
  3. Undo rotary position embeddings on keys
  4. Apply Hadamard rotation to spread energy uniformly
  5. Quantize 8-float groups onto the E8 lattice (densest sphere packing in 8D)
  6. Delta-code consecutive indices and compress with zstd

Eviction reduces token count. Quantization reduces precision. They're nearly orthogonal — 60% eviction (2.5x) combined with 2-bit E8 quantization (~7x) gives ~17x total.

Numbers

Measured on Mistral-7B, A100, FP16, all overhead included:

Preset Compression PPL change
high 10x +0.4%
balanced 17x +1.3%
max 33x +2.6%

How it compares

Method Ratio Quality Training needed
NexusQuant 10-33x +0.4-2.6% No
TurboQuant (Google) ~5-6x ~0% No
KVTC (NVIDIA) up to 20x <1% Yes (calibration)
CommVQ (Apple) ~8x ~0% Yes (training)

NexusQuant is the highest-compression training-free method available. Only KVTC achieves comparable ratios, but it needs calibration data.

Install

pip install nexusquant
pip install "nexusquant[hf]"  # with transformers
Enter fullscreen mode Exit fullscreen mode

GitHub repo | Paper (PDF)

Best regards,
João Marques

Top comments (0)