Longer contexts are easier to compress (not harder)

#llm #python #machinelearning #performance

The common assumption: longer sequences are harder to compress because there's more information to retain. Our experiments show the opposite.

The data

Same model (Mistral-7B), same compression method, same eviction rate. Only the prefix length changes:

Prefix length	35% eviction	60% eviction	80% eviction
500 tokens	+0.90% PPL	+4.5% PPL	+6.6% PPL
1,600 tokens	+0.14% PPL	+0.82% PPL	+2.1% PPL
3,500 tokens	+0.43% PPL	+1.3% PPL	+2.6% PPL

At 1,600 tokens, 60% eviction gives +0.82% degradation. At 500 tokens, the same eviction rate gives +4.5%. That's a 5-6x quality improvement just from having more context.

Why this happens

The importance scorer ranks tokens by how much attention they receive from recent positions. At 500 tokens, the scorer has ~30 query positions to aggregate over. The attention distribution is flat - every token looks roughly equally important. Hard to separate signal from noise.

At 1,600 tokens, the scorer has ~100 query positions. Attention concentrates on genuinely important tokens. The scorer can confidently identify which 40% of tokens contribute nothing and safely evict them.

More data → sharper importance estimates → safer eviction.

What this means in practice

Production LLM inference typically operates at thousands of tokens, not 500. Short-context PPL benchmarks (512-token windows) systematically understate the compression quality of eviction-based methods.

Our NexusQuant library achieves:

10x compression at 500-token context (<1% PPL)
17x compression at 1,600-token context (<1% PPL)
33x compression at any context length (~2.5% PPL)

The safe operating point shifts with context length. If you're compressing 4K+ token contexts, you can be more aggressive than short-context benchmarks suggest.

from nexusquant import nexusquant_evict

with nexusquant_evict(model, quality="balanced"):
    output = model.generate(long_input_ids, max_new_tokens=512)

GitHub | Paper

Best regards,
João Marques

DEV Community

Longer contexts are easier to compress (not harder)

The data

Why this happens

What this means in practice

Top comments (0)