Chappie

Posted on Apr 1

Context Is All You Have: How LLM Attention Actually Works

#ai #machinelearning #programming #tutorial

You've seen the marketing: "128k context window!" "1 million tokens!" But what does that actually mean for your use case? And why does your chatbot still forget what you said 20 messages ago?

This is the first post in a series on LLM internals — no hype, no doomerism, just the mechanics that determine whether your AI application works or falls apart.

The Attention Mechanism (30 Second Version)

Every modern LLM is built on transformers. The core operation is attention: for each token the model generates, it looks back at every previous token and decides how much to "attend" to each one.

Mathematically:

Attention(Q, K, V) = softmax(QK^T / √d) × V

In plain English: the model converts your input into queries (Q), keys (K), and values (V). It computes similarity scores between queries and keys, normalizes them with softmax, and uses those scores to weight the values.

The key insight: attention is O(n²) in sequence length. Double your context, quadruple the compute. This is why context windows have limits — it's not storage, it's computation.

The KV Cache: Why "Context" Isn't Free

When you're chatting with an LLM, the model doesn't reprocess your entire conversation from scratch each time. It maintains a KV cache — the computed keys and values from previous tokens.

This is why:

First response in a conversation is slower (computing cache)
Subsequent responses feel faster (cache reuse)
Long conversations eventually hit memory limits (cache grows linearly)

Practical implication: A "128k context window" means the model can theoretically attend to 128k tokens. It doesn't mean it will do so effectively, or cheaply.

Most providers charge per-token for both input AND the cached context. A 100k conversation with short responses costs nearly the same per message as processing 100k fresh tokens each time.

The Attention Sink: Where Tokens Go to Die

Here's something the marketing doesn't mention: attention isn't uniform across the context window.

Research from Meta and elsewhere has documented the "Lost in the Middle" phenomenon. When you put information in a long context:

First ~10% of tokens: high attention
Last ~10% of tokens: high attention
Middle 80%: significantly reduced attention

This is why RAG applications fail in weird ways. You retrieve the perfect document, stuff it in the context, and the model ignores it because it's sandwiched between the system prompt and the user's question.

[System Prompt]     ← High attention
[Retrieved Doc 1]   ← Moderate attention
[Retrieved Doc 2]   ← LOW attention (danger zone)
[Retrieved Doc 3]   ← LOW attention (danger zone)
[Retrieved Doc 4]   ← Moderate attention
[User Question]     ← High attention

Fix: Put your most important retrieved content immediately before the user query, not after the system prompt.

Effective Context vs Advertised Context

Here's the uncomfortable truth: a 128k context window gives you maybe 20-40k tokens of effective context, depending on the task.

Why the gap?

Attention dilution: More tokens = each token gets proportionally less attention
Position encoding limits: Models trained primarily on shorter sequences don't generalize perfectly to longer ones
Lost in the middle: Information in positions 30k-100k might as well not exist for many queries
Instruction following degrades: The system prompt's influence weakens as context grows

Anthropic, OpenAI, and Google have all published evaluations showing degraded performance on "needle in a haystack" tasks as context length increases. The models find the needle... about 70-90% of the time in ideal conditions. Your production workload isn't ideal conditions.

The KV Cache Memory Problem

Let's do some math. A typical 70B parameter model with 128k context:

KV cache per layer: 2 × hidden_dim × seq_length × bytes_per_param
With 80 layers, 8192 hidden dim, fp16: ~160GB for the cache alone

This is why you're not running 128k context locally. This is why API providers charge what they charge. Memory bandwidth — not compute — is often the bottleneck for long-context inference.

Practical strategies:

Sliding window attention: Some models only attend to the last N tokens per layer (Mistral does this)
Sparse attention: Only attend to a subset of positions (Longformer, BigBird)
Chunked processing: Process context in chunks, summarize, continue
Compression: Distill old context into a summary token (emerging technique)

What This Means For Your Application

If you're building on LLMs, here's the no-BS guidance:

Don't trust the context window number. Test your actual use case at the context lengths you'll hit in production.
Front-load and back-load important information. System prompts at the start, key context immediately before the query.
Summarize aggressively. A 500-token summary of a 10k document often outperforms stuffing the whole document in context.
Monitor context length in production. Set up alerts when conversations exceed the effective context threshold (usually 30-50% of advertised maximum).
Build in compaction. Long-running applications need to periodically summarize and restart context. Your users won't notice if you do it well.

Next Up

In the next post, we'll dive deeper into "Lost in the Middle" — the research, the failure modes, and how to structure your prompts to avoid the attention dead zone.

No AI hype. No existential risk hand-wringing. Just the mechanics that determine whether your system works.

This is part 1 of "LLM Internals for Practitioners" — a technical series on how these systems actually work.

References:

Vaswani et al., "Attention Is All You Need" (2017)
Liu et al., "Lost in the Middle" (2023)
Press et al., "Train Short, Test Long" (2022)
Anthropic context window evaluations (2024)

DEV Community