Drop the first token from a transformer's KV cache during streaming and perplexity doesn't degrade gracefully — it detonates. The model that was fluent at token 4,000 starts emitting garbage the moment you evict position 0, even though position 0 was a meaningless <BOS> marker the prompt never referenced. This is the attention sink phenomenon, and if you run long-context inference with any kind of KV-cache eviction, it is probably the bug you don't know you have.
TL;DR
- An attention sink is a token (usually the first one, often
<BOS>) that absorbs a large, near-constant share of attention weight across most heads and layers — not because it's semantically important, but because softmax has to put its mass somewhere. - Softmax over attention scores must sum to 1. When a head has nothing relevant to attend to, it dumps that mandatory probability mass onto a stable, always-visible token. The first token is the obvious dumping ground.
- Sliding-window KV-cache eviction breaks models because it eventually evicts the sink. Once the sink is gone, the displaced attention mass scatters onto content tokens and corrupts every downstream representation — perplexity jumps by orders of magnitude.
- StreamingLLM's fix: keep the first ~4 tokens pinned in the cache permanently as sinks, and slide the window over everything after. This stabilizes generation across millions of tokens with no fine-tuning.
- Newer models are trained with dedicated sink tokens or learned "attention register" slots, which makes the sink explicit and the eviction-safe window cleaner to reason about.
What is an attention sink?
An attention sink is a position that consistently receives disproportionate attention weight regardless of content. If you log the attention matrices of a decoder-only transformer (Llama, Mistral, GPT-class, Claude-class — the behavior is architecture-general), you'll see the same pattern: after the first couple of layers, a huge fraction of every query's attention lands on token 0. Often 30–50%+ of the mass, head after head, layer after layer.
The token at position 0 is usually <BOS> or whatever the tokenizer prepends. It carries almost no information. Yet the model leans on it harder than on the actual content. Removing it is not like removing a low-value token — it removes the model's pressure-release valve.
Why does softmax force a sink to exist?
Because softmax can't output zeros, and the model frequently wants to attend to nothing.
Self-attention computes weights as softmax(QKᵀ / √d). That distribution is strictly positive and sums to exactly 1 over the visible context. A given head, for a given query, often has no token worth attending to — the feature it detects simply isn't present in this position's neighborhood. But the head cannot say "attend to nothing." It must spend its full unit of probability mass on some combination of keys.
The cheapest way out is to learn a key vector for one stable, always-present token whose corresponding value contributes near-zero to the output. The model parks unwanted attention there. The first token is perfect for the job: it's visible to every query under causal masking (every position can attend back to position 0), and it's positionally constant across all sequences, so the model can specialize a clean "sink" key/value for it during training.
This is a learned workaround for a structural constraint of softmax, not a quirk of any one model. It's also why proposals like softmax-with-a-learnable-bias (sometimes called "off-by-one softmax" or adding a dedicated no-op logit) exist — they give the head an explicit "attend to nothing" option so it stops hijacking a content token.
Why does sliding-window KV-cache eviction break the model?
It breaks because eviction eventually deletes the sink, and the model has no fallback for the attention mass that was parked there.
Long-context serving is memory-bound. The KV cache grows linearly with sequence length, so a common trick is a sliding window: keep only the most recent W tokens' keys and values, evict the rest. Cheap, bounded memory, seems obviously correct — recent tokens matter most, right?
Here's the failure. As generation proceeds, the window slides forward and the early tokens fall off the back. The instant token 0 leaves the cache, every head that was dumping 40% of its attention onto the sink has nowhere to put it. Softmax renormalizes over the surviving window, so that mass redistributes onto ordinary content tokens. Those tokens now receive attention weights that are wildly larger than anything seen in training. The weighted sum of value vectors blows up, the layer's output goes out of distribution, the next layer inherits the corruption, and it compounds. Perplexity explodes within a few tokens of the eviction.
The cruel part: the model was perfectly fine at sequence length 4,000 as long as the first tokens were still cached. It's not running out of context capacity. It's losing the one structural anchor the attention mechanism depends on.
How does StreamingLLM fix it with four tokens?
By never evicting the sink. Pin the first few tokens permanently and slide the window over the rest.
The StreamingLLM result (Xiao et al.) is almost embarrassingly simple: keep roughly 4 initial tokens in the cache as permanent attention sinks, then apply your sliding window of recent tokens on top. With the sinks retained, the displaced-mass catastrophe never happens — heads can keep dumping unwanted attention on the sink the way they were trained to. With this in place, a model trained on a few-thousand-token window can stream stably across millions of tokens without any fine-tuning and without perplexity drift.
Two details that trip people up:
- It's the absolute position, not the content, that matters. Keep the tokens that were at positions 0–3, even after you've evicted everything between them and the recent window. You're preserving the sink slot, not a meaningful prefix.
-
Positional encoding uses cache-relative positions. When you concatenate
[sink tokens] + [recent window], assign rotary/relative positions by their index within the cache, not their original index in the full stream. Use the gap between sink and window and the model's perplexity climbs again.
How do you implement a sink-aware KV cache?
Reserve the first n_sink slots, slide the rest. Here's the core logic for a single attention layer's cache:
import torch
class SinkKVCache:
"""Sliding-window KV cache that pins the first n_sink tokens as
permanent attention sinks (StreamingLLM-style)."""
def __init__(self, n_sink=4, window=2048):
self.n_sink = n_sink
self.window = window # recent tokens to keep
self.k_sink = self.v_sink = None
self.k_win = self.v_win = None # rolling recent KV
def update(self, k_new, v_new):
# k_new, v_new: [batch, heads, new_len, head_dim]
if self.k_sink is None:
n = min(self.n_sink, k_new.shape[2])
self.k_sink, self.v_sink = k_new[:, :, :n], v_new[:, :, :n]
self.k_win, self.v_win = k_new[:, :, n:], v_new[:, :, n:]
else:
self.k_win = torch.cat([self.k_win, k_new], dim=2)
self.v_win = torch.cat([self.v_win, v_new], dim=2)
# evict oldest *window* tokens, never touch the sinks
if self.k_win.shape[2] > self.window:
self.k_win = self.k_win[:, :, -self.window:]
self.v_win = self.v_win[:, :, -self.window:]
k = torch.cat([self.k_sink, self.k_win], dim=2)
v = torch.cat([self.v_sink, self.v_win], dim=2)
return k, v # feed these to scaled_dot_product_attention
The position-id construction that pairs with this matters as much as the eviction. Build positions over the concatenated cache, not the original stream:
def cache_positions(n_sink_kept, win_len):
# contiguous positions across [sinks | recent window]
# rotary embeddings are applied against THESE, not original indices
return torch.arange(n_sink_kept + win_len)
If you're on vLLM, SGLang, or TensorRT-LLM, you don't hand-roll this — but you should check whether your eviction/quantization path is sink-aware before enabling aggressive cache compression. Naive "keep last W tokens" KV quantization or eviction can reintroduce exactly this bug.
Do modern models still have attention sinks?
Yes — and increasingly by design rather than by accident. The phenomenon is intrinsic to softmax attention, so it shows up across current model families. The shift is that newer architectures make the sink explicit: a dedicated learned sink token, an always-on register slot, or a per-head learnable bias logit that gives the head a real "attend to nothing" option. When the sink is a first-class part of the architecture, the safe eviction boundary is unambiguous and serving infrastructure can rely on it.
The practical takeaway is unchanged regardless of which model you run. Any time you bound the KV cache — sliding window, H2O-style eviction, attention offloading, or aggressive KV quantization — verify that whatever the model uses as its sink survives. The cheapest possible bug here is also one of the most destructive: you save a few megabytes of cache, evict four meaningless tokens, and turn a coherent model into a noise generator.
There's a debugging tell, too. If your long-context model produces clean output up to some length and then sharply degrades into repetition or gibberish at a consistent point — rather than slowly drifting — suspect sink eviction before you suspect context-length limits or RoPE extrapolation. Slow drift is a positional-encoding problem. A cliff is usually a cache problem.
Direct answer: why does evicting your LLM's first token break it?
Evicting an LLM's first token breaks it because that token is an attention sink — the place where softmax-based attention parks the probability mass it's forced to assign but doesn't want to spend on any real content. Softmax over attention scores must sum to 1, so heads with nothing relevant to attend to dump their weight onto a stable, always-visible token, almost always position 0. Delete it via sliding-window KV-cache eviction and that mass scatters onto content tokens at magnitudes never seen in training, corrupting the layer outputs and exploding perplexity within a few tokens. The fix is StreamingLLM's: permanently pin the first ~4 tokens as sinks while sliding the window over the rest, which keeps long-context generation stable across millions of tokens with no fine-tuning.
Top comments (0)