When you build a KV cache compression system and plan to publish a paper, you face a choice: present the best-looking numbers, or present all of them.
We chose all of them. This post is every benchmark result we have, including the ones that did not work.
The pipeline
Quick context. NexusQuant compresses the KV cache of transformer models at inference time, training-free:
Prefill → Key-Key Attention Score → Evict → RoPE-remove → Hadamard → 2-bit E8 VQ → Temporal Delta → zstd
The context manager API:
with nexusquant_evict(model, quality="balanced"):
output = model.generate(input_ids, max_new_tokens=200)
All numbers below are from an A10G GPU (24 GB). Perplexity delta is measured against the uncompressed baseline on the same passages.
Mistral-7B: the full picture
These are our numbers at different prefix lengths and eviction rates. Every row is real.
| Prefix | Evict% | Compression | PPL Delta | Verdict |
|---|---|---|---|---|
| 500 tok | 35% | 10.1x | +0.90% | Usable for most tasks |
| 1664 tok | 35% | 10.4x | +0.14% | Near-lossless |
| 1664 tok | 60% | 16.6x | +0.82% | Strong at long context |
| 1664 tok | 80% | 32.7x | +2.13% | Maximum compression |
| 2924 tok | 35% | 10.5x | +1.50% | Text-dependent |
| 2924 tok | 60%+ | — | +42%+ | CATASTROPHIC |
The 3K catastrophe is real and we are not hiding it. At 2924-token prefixes, evicting 60% or more of KV tokens causes perplexity to blow up to +42% and beyond. This is not a scorer bug. We tried four different attention-score variants; all fail at this eviction rate for this prefix length. It is a fundamental capacity loss: the model simply cannot attend to the right tokens when that much context is gone.
If you are deploying NexusQuant on prompts longer than ~2K tokens, use quality="conservative" (35% eviction maximum).
Llama-3-8B: the surprise
We validated on a second architecture to make sure Mistral results were not flukes. What we found was unexpected.
| Config | Compression | PPL Delta | Notes |
|---|---|---|---|
| 2-bit, no evict | 6.71x | -1.20% | Lower PPL with compression |
| 2-bit + 35% evict | 10.25x | -1.47% | Better than baseline |
| 2-bit + 60% evict | 16.48x | -1.35% | Still better |
| 2-bit + 80% evict | 32.45x | -0.61% | Still better |
All four configurations have negative perplexity delta. Compression is making the model perform better than the uncompressed baseline on the wikitext-2 distribution.
Our hypothesis: Llama-3-8B uses grouped-query attention (GQA). The compression and eviction act as a regularizer that suppresses noise in the attention pattern. The GQA architecture appears to have redundancy in its KV cache that quantization removes beneficially. This is a standalone publishable finding that we did not anticipate.
Domain sensitivity
Not all text compresses equally. We tested three text domains at 500-token prefixes on Mistral-7B:
| Domain | 35% evict | 70% evict | 80% evict |
|---|---|---|---|
| Academic | +0.39% | +4.81% | +6.58% |
| Technical | +0.90% | +3.87% | +6.09% |
| Creative/narrative | +2.48% | +4.62% | +4.73% |
Academic text (dense factual prose, repetitive structure) compresses best. Creative/narrative text is the hardest — at 35% eviction you already lose 2.48%, compared to 0.39% on academic text.
If you are building a RAG system over structured documents, the 35% eviction preset works well. If you are doing creative writing assistance or summarising fiction, use quality="lossless" (no eviction) or test your specific domain before deploying.
Downstream task evaluation (Mistral-7B, 5 tasks)
Perplexity is a proxy. We ran five QA-style downstream tasks with and without compression.
| Task type | 10x (35% evict) | 16x (60% evict) |
|---|---|---|
| Factual recall | MATCH | MATCH |
| Single-hop reasoning | MATCH | PARTIAL |
| Multi-detail extraction | PARTIAL | PARTIAL |
"MATCH" means the compressed model gave the same answer as the uncompressed model. "PARTIAL" means it got the main point but missed a specific detail.
Factual recall is fully preserved at 10x. Nuanced multi-detail questions lose some specificity at both compression levels. Single-hop reasoning holds at 10x but starts slipping at 16x.
What failed (12 approaches we killed)
We tried things that did not work. Here they are:
Variable-rate E8 (per-group water-filling): Allocate more bits to high-energy groups. Result: +3.67% PPL at matched rate — 3.3x worse than fixed-rate. Energy does not equal importance.
Cross-layer KV prediction: Predict layer L+1 from layer L. Result: 0.0 bits saved. Layers project to orthogonal subspaces.
L2 norm as token importance proxy: Use L2 norm to select which tokens to keep. Result: +207% PPL. L2 norm and attention importance are completely different quantities.
Strict E8 parity enforcement: Fix our half-integer parity to be mathematically correct E8. Result: 0.3-0.4% worse quality. The relaxed parity acts as dithering on sub-Gaussian KV data.
Token zeroing for eviction: Set evicted KV to zero. Result: catastrophic attention mass stealing. exp(Q @ 0) = exp(0) = 1 in softmax; evicted tokens steal attention weight. Fix: -inf masking.
Eviction on small models (TinyLlama 1.1B): 80-90% eviction rate. Result: +322% PPL at 1K tokens, +1195% at 2K. TinyLlama's top 10% of tokens capture only 28% of attention mass vs. 65-88% on Mistral-7B. Eviction only works when attention is sharp enough.
Honest compression numbers
We made an early mistake: reporting compression ratios without scale overhead. Here is the honest table.
| Config | Naive ratio | Real ratio (with FP16 scales) |
|---|---|---|
| E8 3-bit | 5.3x | 3.2x |
| E8 3-bit + zstd | — | 4.0x |
| 2-bit + 35% evict | — | 10.1x |
| 2-bit + 60% evict | — | 16.6x |
The 16-bit scale per 8-element group adds 2 bits/element overhead. At 3 bits/element stored, the real cost is 5 bits/element, not 3. We got burned by ignoring this early on. Now every ratio is measured with torch.cuda.memory_allocated() before and after.
The number that does not exist
One number appeared in an early draft of our paper that cannot be traced to any experiment log: +0.339% at 8x for Mistral-7B. Our validated 50-passage result for that config is +0.47%. We do not know where +0.339% came from. It is not in any research file. It is gone from the paper.
This kind of thing happens. You run an experiment, write a number down, run more experiments, and the earlier number stops being reproducible. If you cannot trace a number to a specific run with a specific config, it does not belong in a paper.
What is still missing
For full honesty:
- No LongBench results yet (needs multi-hour GPU run on proper dataset)
- No latency benchmarks (CPU-bound compression loop, needs Triton kernels)
- No 16K+ context validation
- Downstream evals above are small-N; full MMLU/GSM8K runs in progress
These are gaps we are filling. The numbers above are what we have validated.
Best regards, João Marques
Top comments (0)