DEV Community

João André Gomes Marques
João André Gomes Marques

Posted on

Longer contexts are easier to compress (not harder)

The common assumption: longer sequences are harder to compress because there's more information to retain. Our experiments show the opposite.

The data

Same model (Mistral-7B), same compression method, same eviction rate. Only the prefix length changes:

Prefix length 35% eviction 60% eviction 80% eviction
500 tokens +0.90% PPL +4.5% PPL +6.6% PPL
1,600 tokens +0.14% PPL +0.82% PPL +2.1% PPL
3,500 tokens +0.43% PPL +1.3% PPL +2.6% PPL

At 1,600 tokens, 60% eviction gives +0.82% degradation. At 500 tokens, the same eviction rate gives +4.5%. That's a 5-6x quality improvement just from having more context.

Why this happens

The importance scorer ranks tokens by how much attention they receive from recent positions. At 500 tokens, the scorer has ~30 query positions to aggregate over. The attention distribution is flat — every token looks roughly equally important. Hard to separate signal from noise.

At 1,600 tokens, the scorer has ~100 query positions. Attention concentrates on genuinely important tokens. The scorer can confidently identify which 40% of tokens contribute nothing and safely evict them.

More data → sharper importance estimates → safer eviction.

What this means in practice

Production LLM inference typically operates at thousands of tokens, not 500. Short-context PPL benchmarks (512-token windows) systematically understate the compression quality of eviction-based methods.

Our NexusQuant library achieves:

  • 10x compression at 500-token context (<1% PPL)
  • 17x compression at 1,600-token context (<1% PPL)
  • 33x compression at any context length (~2.5% PPL)

The safe operating point shifts with context length. If you're compressing 4K+ token contexts, you can be more aggressive than short-context benchmarks suggest.

from nexusquant import nexusquant_evict

with nexusquant_evict(model, quality="balanced"):
    output = model.generate(long_input_ids, max_new_tokens=512)
Enter fullscreen mode Exit fullscreen mode

GitHub | Paper

Best regards,
João Marques

Top comments (0)