Why Your LLM Doesn't Re-Read the Prompt: The KV-Cache

#ai #llm #machinelearning #beginners

The KV-cache is the single most important optimisation in LLM inference — and the reason real-time chat with a model is even feasible. Here's what it is and why it matters.

Generation is autoregressive

An LLM produces text one token at a time: emit a token, append it, run the whole model again for the next. Inside each attention layer, every token becomes a Query, a Key, and a Value. To produce the newest token, its Query is scored against the Keys of all previous tokens, and those weights blend their Values. So generating token t needs the K and V of tokens 1…t.

The naïve approach is quadratic

Without a cache, each step re-encodes the entire prefix to rebuild K/V for tokens 1…t. Step 1 processes 1 token, step 2 processes 2, …, step N processes N. Total work ≈ 1+2+…+N = N(N+1)/2 — quadratic. Token 1's K/V gets recomputed on every single step even though it never changes.

The key insight: past K/V never change

LLMs use causal masking — a token attends only to earlier tokens. So adding a new token at the end can't change the Keys and Values of earlier tokens. They're constant. Recomputing them is pure waste.

Cache them → linear generation

Store each token's K/V the first time. Each step computes K/V for just the one new token, appends it, and attends over the whole cache:

K_cache, V_cache = [], []
for t in range(N):
    k, v = kv(token_t)              # ONE token's work
    K_cache.append(k); V_cache.append(v)
    out = attend(Q_t, K_cache, V_cache)   # reuse all prior K/V

Per-step work is now constant → O(N) total instead of O(N²).

Prefill vs decode

This splits inference into two phases. Prefill: ingest the whole prompt in one parallel pass, filling the cache — compute-heavy, and why the first token can take a moment on a long prompt. Decode: generate output tokens one at a time, each a cheap cache-append. That's why "time to first token" and "time per output token" are different numbers.

The memory price of long context

The cache stores K and V for every token, every layer, every head. Its size grows linearly with context length — which is exactly why a 128k-token context is expensive: the cache can eat many gigabytes of GPU memory, often becoming the limit on how many users a GPU can serve. Tricks like paged attention (vLLM), grouped-query attention, quantised caches, and prompt caching all exist to tame it.

Watch a "no cache" vs "cached" generation diverge, op by op:
https://dev48v.infy.uk/ai/days/day22-kv-cache.html