The KV cache, why LLM inference is memory-bound, not compute-bound

#gpu #llm #inference #performance

A lot of 2026's headline AI infrastructure wins, claims like "6x less inference memory", weren't about doing more math. They were about moving less data. That's because when a large language model generates text, it is usually memory-bound, not compute-bound: the GPU spends more time waiting on memory than doing arithmetic. The structure at the center of this is the KV cache, and once you understand it, token pricing, context-window limits, and the recent optimization wave all make sense.

The one idea: generation re-reads the whole past, every step

LLMs generate one token at a time. To produce the next token, attention compares the current token's query against the key and value vectors of every previous token. Generate token 1000, and it must attend to tokens 1 through 999.

The naive approach recomputes the keys and values for all previous tokens at every single step. That's enormous, redundant work: the keys and values for token 5 don't change when you're generating token 900. So you compute them once and cache them. That cache, the stored key and value vectors for every token so far, is the KV cache.

Why the cache is the bottleneck

The KV cache trades compute for memory, and the memory bill is large. Its size is roughly:

KV cache size  ≈  2 (K and V) × layers × tokens × hidden_size × bytes_per_number

Three things blow this up:

It grows linearly with sequence length. A long context (a big prompt + long output) means a big cache. This is a major reason long context windows are expensive and capped, the KV cache for tens of thousands of tokens is gigabytes.
It's per request. Every concurrent user has their own KV cache, so serving many users at once is limited by GPU memory, not GPU math.
Every generation step must read the entire cache from GPU memory to compute attention. As the cache grows, that read dominates, and memory bandwidth, not FLOPs, sets the speed.

That's the punchline: the GPU's arithmetic units often sit idle waiting for the KV cache to stream in. The model is memory-bound.

Why this explains things you've noticed

Why output tokens cost more than input tokens. Processing your prompt happens in one parallel pass. Generating output is sequential, and each generated token does a full KV-cache read. Output is the memory-bound part, so it's priced higher.
Why context windows have limits and a price. A longer context is a bigger KV cache, more memory and more bandwidth per step. The window isn't capped by the model's cleverness; it's capped by memory.
Why throughput drops with many users. Each session holds its own KV cache in GPU RAM. Memory, not compute, caps how many you can serve at once.

Why 2026's wins are memory tricks

Once you see inference as memory-bound, the optimization frontier makes sense, it's mostly about shrinking or moving the KV cache less:

KV-cache quantization. Store the cached keys and values in 8-bit or 4-bit instead of 16-bit (the same quantization idea used to shrink weights). Less data to store and stream, directly more speed and capacity. This is the kind of change behind "Nx less inference memory" claims.
Smarter attention layouts (grouped-query / multi-query attention) that let many attention heads share keys and values, shrinking the cache several-fold with minimal quality loss.
Paged KV caches (managing cache memory in pages, like an OS) so memory isn't wasted on fragmentation and more requests fit, the idea behind high-throughput serving systems.

All three attack memory, not math, because memory is the bottleneck.

Why this is worth understanding

If you reason about LLMs only in terms of parameters and FLOPs, their real-world behavior, pricing, latency, context limits, throughput, looks arbitrary. Through the KV cache it becomes predictable: generation re-reads a growing per-request memory structure every step, and that read is the bottleneck. Optimizing inference means optimizing that structure.

This is the recurring lesson of performance work in general: the constraint is usually data movement, not arithmetic, which is the same reason cache-friendly code can be 10x faster with identical math. Understanding where the bytes go, on a CPU or a GPU, is the heart of the GPU programming track.