DEV Community

João André Gomes Marques
João André Gomes Marques

Posted on

Why attention-aware eviction beats random eviction (with data)

At high eviction rates, choosing which tokens to drop matters enormously. Here is what the numbers show.

The experiment

We ran KV cache eviction at two rates on Llama-3-8B, measuring perplexity degradation (lower is better) versus a full-cache baseline:

Eviction rate Importance-based Random Advantage
70% +2.59% PPL +3.86% PPL 1.27 pp
80% +3.61% PPL +5.13% PPL 1.52 pp

The gap grows as you evict more. At 70% eviction the importance scorer saves you 1.27 percentage points of perplexity. Push to 80% and it saves 1.52 pp. This is not a coincidence.

Why it happens

Random eviction is memoryless — it has the same probability of dropping the single token that unlocks subject-verb agreement across 400 tokens as it does of dropping a filler word. The attention-aware scorer assigns each token an importance weight based on how much accumulated attention mass it has received across all heads. Tokens that many heads consistently attend to survive; tokens that nobody looks at get evicted first.

At low eviction rates there is enough slack that random and importance-based look similar. As you push the eviction rate up, the budget gets tight and every dropped token counts. That is when the scorer earns its keep.

Run it yourself

pip install nexusquant
Enter fullscreen mode Exit fullscreen mode
from nexusquant import NexusQuantConfig, apply_nexusquant
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

# Importance-based eviction at 80%
cfg = NexusQuantConfig(eviction_rate=0.80, eviction_mode="importance")
apply_nexusquant(model, cfg)

# Compare: random eviction at 80%
cfg_rand = NexusQuantConfig(eviction_rate=0.80, eviction_mode="random")
apply_nexusquant(model, cfg_rand)
Enter fullscreen mode Exit fullscreen mode

The full benchmark script is in the NexusQuant repo.

Takeaway

If you are evicting KV cache tokens, use an attention-aware scorer. At 80% eviction the gap is 1.52 pp — and it only widens from here. Random eviction is a baseline, not a strategy.


Best regards, João Marques

NexusQuant — unlimited context windows for every AI model.

Top comments (0)