At high eviction rates, choosing which tokens to drop matters enormously. Here is what the numbers show.
The experiment
We ran KV cache eviction at two rates on Llama-3-8B, measuring perplexity degradation (lower is better) versus a full-cache baseline:
| Eviction rate | Importance-based | Random | Advantage |
|---|---|---|---|
| 70% | +2.59% PPL | +3.86% PPL | 1.27 pp |
| 80% | +3.61% PPL | +5.13% PPL | 1.52 pp |
The gap grows as you evict more. At 70% eviction the importance scorer saves you 1.27 percentage points of perplexity. Push to 80% and it saves 1.52 pp. This is not a coincidence.
Why it happens
Random eviction is memoryless — it has the same probability of dropping the single token that unlocks subject-verb agreement across 400 tokens as it does of dropping a filler word. The attention-aware scorer assigns each token an importance weight based on how much accumulated attention mass it has received across all heads. Tokens that many heads consistently attend to survive; tokens that nobody looks at get evicted first.
At low eviction rates there is enough slack that random and importance-based look similar. As you push the eviction rate up, the budget gets tight and every dropped token counts. That is when the scorer earns its keep.
Run it yourself
pip install nexusquant
from nexusquant import NexusQuantConfig, apply_nexusquant
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
# Importance-based eviction at 80%
cfg = NexusQuantConfig(eviction_rate=0.80, eviction_mode="importance")
apply_nexusquant(model, cfg)
# Compare: random eviction at 80%
cfg_rand = NexusQuantConfig(eviction_rate=0.80, eviction_mode="random")
apply_nexusquant(model, cfg_rand)
The full benchmark script is in the NexusQuant repo.
Takeaway
If you are evicting KV cache tokens, use an attention-aware scorer. At 80% eviction the gap is 1.52 pp — and it only widens from here. Random eviction is a baseline, not a strategy.
Best regards, João Marques
NexusQuant — unlimited context windows for every AI model.
Top comments (0)