João André Gomes Marques

Posted on Apr 7 • Edited on May 4 • Originally published at dev.to

NexusQuant benchmarks: every number, honestly

#machinelearning #llm #performance #opensource

When you build a KV cache compression system and plan to publish a paper, you face a choice: present the best-looking numbers, or present all of them.

We chose all of them. This post is every benchmark result we have, including the ones that did not work.

The pipeline

Quick context. NexusQuant compresses the KV cache of transformer models at inference time, training-free:

Prefill → Key-Key Attention Score → Evict → RoPE-remove → Hadamard → 2-bit E8 VQ → Temporal Delta → zstd

The context manager API:

with nexusquant_evict(model, quality="balanced"):
    output = model.generate(input_ids, max_new_tokens=200)

All numbers below are from an A10G GPU (24 GB). Perplexity delta is measured against the uncompressed baseline on the same passages.

Mistral-7B: the full picture

These are our numbers at different prefix lengths and eviction rates. Every row is real.

Prefix	Evict%	Compression	PPL Delta	Verdict
500 tok	35%	10.1x	+0.90%	Usable for most tasks
1664 tok	35%	10.4x	+0.14%	Near-lossless
1664 tok	60%	16.6x	+0.82%	Strong at long context
1664 tok	80%	32.7x	+2.13%	Maximum compression
2924 tok	35%	10.5x	+1.50%	Text-dependent
2924 tok	60%+	-	+42%+	CATASTROPHIC

The 3K catastrophe is real and we are not hiding it. At 2924-token prefixes, evicting 60% or more of KV tokens causes perplexity to blow up to +42% and beyond. This is not a scorer bug. We tried four different attention-score variants; all fail at this eviction rate for this prefix length. It is a fundamental capacity loss: the model simply cannot attend to the right tokens when that much context is gone.

If you are deploying NexusQuant on prompts longer than ~2K tokens, use quality="conservative" (35% eviction maximum).

Llama-3-8B: the surprise

We validated on a second architecture to make sure Mistral results were not flukes. What we found was unexpected.

Config	Compression	PPL Delta	Notes
2-bit, no evict	6.71x	-1.20%	Lower PPL with compression
2-bit + 35% evict	10.25x	-1.47%	Better than baseline
2-bit + 60% evict	16.48x	-1.35%	Still better
2-bit + 80% evict	32.45x	-0.61%	Still better

All four configurations have negative perplexity delta. Compression is making the model perform better than the uncompressed baseline on the wikitext-2 distribution.

Our hypothesis: Llama-3-8B uses grouped-query attention (GQA). The compression and eviction act as a regularizer that suppresses noise in the attention pattern. The GQA architecture appears to have redundancy in its KV cache that quantization removes beneficially. This is a standalone publishable finding that we did not anticipate.

Domain sensitivity

Not all text compresses equally. We tested three text domains at 500-token prefixes on Mistral-7B:

Domain	35% evict	70% evict	80% evict
Academic	+0.39%	+4.81%	+6.58%
Technical	+0.90%	+3.87%	+6.09%
Creative/narrative	+2.48%	+4.62%	+4.73%

Academic text (dense factual prose, repetitive structure) compresses best. Creative/narrative text is the hardest - at 35% eviction you already lose 2.48%, compared to 0.39% on academic text.

If you are building a RAG system over structured documents, the 35% eviction preset works well. If you are doing creative writing assistance or summarising fiction, use quality="lossless" (no eviction) or test your specific domain before deploying.

Downstream task evaluation (Mistral-7B, 5 tasks)

Perplexity is a proxy. We ran five QA-style downstream tasks with and without compression.

Task type	10x (35% evict)	16x (60% evict)
Factual recall	MATCH	MATCH
Single-hop reasoning	MATCH	PARTIAL
Multi-detail extraction	PARTIAL	PARTIAL

"MATCH" means the compressed model gave the same answer as the uncompressed model. "PARTIAL" means it got the main point but missed a specific detail.

Factual recall is fully preserved at 10x. Nuanced multi-detail questions lose some specificity at both compression levels. Single-hop reasoning holds at 10x but starts slipping at 16x.

What failed (12 approaches we killed)

We tried things that did not work. Here they are:

Variable-rate E8 (per-group water-filling): Allocate more bits to high-energy groups. Result: +3.67% PPL at matched rate - 3.3x worse than fixed-rate. Energy does not equal importance.

Cross-layer KV prediction: Predict layer L+1 from layer L. Result: 0.0 bits saved. Layers project to orthogonal subspaces.

L2 norm as token importance proxy: Use L2 norm to select which tokens to keep. Result: +207% PPL. L2 norm and attention importance are completely different quantities.

Strict E8 parity enforcement: Fix our half-integer parity to be mathematically correct E8. Result: 0.3-0.4% worse quality. The relaxed parity acts as dithering on sub-Gaussian KV data.

Token zeroing for eviction: Set evicted KV to zero. Result: catastrophic attention mass stealing. exp(Q @ 0) = exp(0) = 1 in softmax; evicted tokens steal attention weight. Fix: -inf masking.

Eviction on small models (TinyLlama 1.1B): 80-90% eviction rate. Result: +322% PPL at 1K tokens, +1195% at 2K. TinyLlama's top 10% of tokens capture only 28% of attention mass vs. 65-88% on Mistral-7B. Eviction only works when attention is sharp enough.

Honest compression numbers

We made an early mistake: reporting compression ratios without scale overhead. Here is the honest table.

Config	Naive ratio	Real ratio (with FP16 scales)
E8 3-bit	5.3x	3.2x
E8 3-bit + zstd	-	4.0x
2-bit + 35% evict	-	10.1x
2-bit + 60% evict	-	16.6x

The 16-bit scale per 8-element group adds 2 bits/element overhead. At 3 bits/element stored, the real cost is 5 bits/element, not 3. We got burned by ignoring this early on. Now every ratio is measured with torch.cuda.memory_allocated() before and after.

The number that does not exist

One number appeared in an early draft of our paper that cannot be traced to any experiment log: +0.339% at 8x for Mistral-7B. Our validated 50-passage result for that config is +0.47%. We do not know where +0.339% came from. It is not in any research file. It is gone from the paper.

This kind of thing happens. You run an experiment, write a number down, run more experiments, and the earlier number stops being reproducible. If you cannot trace a number to a specific run with a specific config, it does not belong in a paper.

What is still missing

For full honesty:

No LongBench results yet (needs multi-hour GPU run on proper dataset)
No latency benchmarks (CPU-bound compression loop, needs Triton kernels)
No 16K+ context validation
Downstream evals above are small-N; full MMLU/GSM8K runs in progress

These are gaps we are filling. The numbers above are what we have validated.

Best regards, João Marques

DEV Community