The common assumption: longer sequences are harder to compress because there's more information to retain. Our experiments show the opposite.
The data
Same model (Mistral-7B), same compression method, same eviction rate. Only the prefix length changes:
| Prefix length | 35% eviction | 60% eviction | 80% eviction |
|---|---|---|---|
| 500 tokens | +0.90% PPL | +4.5% PPL | +6.6% PPL |
| 1,600 tokens | +0.14% PPL | +0.82% PPL | +2.1% PPL |
| 3,500 tokens | +0.43% PPL | +1.3% PPL | +2.6% PPL |
At 1,600 tokens, 60% eviction gives +0.82% degradation. At 500 tokens, the same eviction rate gives +4.5%. That's a 5-6x quality improvement just from having more context.
Why this happens
The importance scorer ranks tokens by how much attention they receive from recent positions. At 500 tokens, the scorer has ~30 query positions to aggregate over. The attention distribution is flat — every token looks roughly equally important. Hard to separate signal from noise.
At 1,600 tokens, the scorer has ~100 query positions. Attention concentrates on genuinely important tokens. The scorer can confidently identify which 40% of tokens contribute nothing and safely evict them.
More data → sharper importance estimates → safer eviction.
What this means in practice
Production LLM inference typically operates at thousands of tokens, not 500. Short-context PPL benchmarks (512-token windows) systematically understate the compression quality of eviction-based methods.
Our NexusQuant library achieves:
- 10x compression at 500-token context (<1% PPL)
- 17x compression at 1,600-token context (<1% PPL)
- 33x compression at any context length (~2.5% PPL)
The safe operating point shifts with context length. If you're compressing 4K+ token contexts, you can be more aggressive than short-context benchmarks suggest.
from nexusquant import nexusquant_evict
with nexusquant_evict(model, quality="balanced"):
output = model.generate(long_input_ids, max_new_tokens=512)
Best regards,
João Marques
Top comments (0)