Take a 70B model in fp16 on an H100. Feed it a 4,000-token prompt and it chews through the whole thing in a few hundred milliseconds. Then it emits tokens one at a time at maybe 25–30 per second. Same weights, same GPU, same kernels — roughly a 100x difference in per-token throughput between reading the prompt and writing the answer.
That gap is not a bug or a bad kernel. It is the defining fact of LLM inference: prefill is compute-bound, decode is memory-bandwidth-bound. Once you internalize the arithmetic intensity behind that sentence, most of what confuses people about latency, batching, and serving cost stops being mysterious.
TL;DR
- Prefill (processing the prompt) is a matrix–matrix multiply over all prompt tokens at once. High arithmetic intensity → compute-bound → limited by GPU FLOP/s.
- Decode (generating each new token) is a matrix–vector multiply at batch 1. Low arithmetic intensity (~1 FLOP/byte) → memory-bound → limited by HBM bandwidth, not FLOP/s.
- The hard ceiling on single-request decode speed is
model_bytes / HBM_bandwidth. A 70B fp16 model (~140 GB) on ~3.35 TB/s HBM caps at ~24 tokens/sec no matter how fast the tensor cores are. - Batching is the fix for decode: reading the weights once to serve B requests raises arithmetic intensity roughly B×, so throughput scales with batch until you hit the compute ridge or run out of KV-cache bandwidth.
- This is why TTFT (time-to-first-token) and TPOT (time-per-output-token) are governed by different hardware limits and must be optimized separately.
What is arithmetic intensity and why does it decide LLM speed?
Arithmetic intensity is the ratio of floating-point operations to bytes moved from memory: FLOPs / bytes. The roofline model says a kernel is memory-bound when its intensity is below the hardware's ridge point (peak FLOP/s ÷ peak bandwidth) and compute-bound above it.
For an H100 SXM: roughly 989 TFLOP/s dense FP16 tensor throughput and about 3.35 TB/s of HBM3 bandwidth. Ridge point:
ridge = 989e12 FLOP/s / 3.35e12 B/s ≈ 295 FLOP/byte
So a kernel must do ~295 float ops per byte it reads to saturate the tensor cores. Below that, the tensor cores sit idle waiting on memory. Hold onto 295 — it is the number the whole post revolves around.
Why is decode memory-bound?
Because generating one token is a matrix–vector product, and matrix–vector products have arithmetic intensity near 1.
During autoregressive decode you process exactly one new token per step. The dominant work is multiplying that token's hidden vector by each weight matrix W of shape d_in × d_out:
- FLOPs:
2 · d_in · d_out(a multiply and an add per element). - Bytes read:
2 · d_in · d_outfor the fp16 weights themselves (the input/output vectors are negligible by comparison).
intensity = (2 · d_in · d_out) / (2 · d_in · d_out) ≈ 1 FLOP/byte
One FLOP per byte against a ridge point of 295. Decode runs at roughly 1/295th of peak compute — the GPU is almost entirely waiting on HBM. You cannot fix this with a faster kernel or lower-precision matmul math, because the bottleneck is the bytes, not the ops.
The practical ceiling follows directly. Every decode step reads every weight once. So:
min time per token = model_size_in_bytes / HBM_bandwidth
For a 70B fp16 model, weights are ~140 GB:
140e9 B / 3.35e12 B/s ≈ 0.042 s → ~24 tokens/sec upper bound
That is a hardware wall for a single request on one H100, before attention, sampling, or Python overhead. It matches what you actually measure, and it explains why FP8 or 4-bit quantization speeds up decode so dramatically: halving the bytes read roughly doubles the token rate. Quantization is a bandwidth optimization for decode far more than a compute one.
Why is prefill compute-bound instead?
Prefill processes the entire prompt at once, so the matrix–vector multiply becomes a matrix–matrix multiply, and intensity scales with the number of tokens.
With S prompt tokens, the input is an S × d_in matrix, not a single vector. The same weight matrix W is now read once but reused across all S tokens:
- FLOPs:
2 · S · d_in · d_out - Bytes read: still
~2 · d_in · d_out(weights loaded once, streamed against all S columns)
intensity ≈ S FLOP/byte
Once S exceeds ~295, prefill crosses the ridge point and becomes compute-bound. A 4,000-token prompt sits comfortably above it, so prefill runs the tensor cores near their FLOP ceiling. This is why prefill throughput is measured in tens of thousands of tokens/sec while decode is measured in tens.
It is the same weights and the same GPU. The only thing that changed is how many tokens share each weight read.
Why does batching speed up generation but not a single request?
Batching raises decode's arithmetic intensity the same way a long prompt raises prefill's — by amortizing one weight read over many token computations.
Stack B concurrent requests and each decode step multiplies a B × d_in matrix by W. Weights are read once, reused across all B rows:
intensity ≈ B FLOP/byte
At B = 1 you are at intensity ~1, deeply memory-bound. Push B toward ~295 and decode approaches the compute ridge, at which point you are extracting near-peak throughput from the hardware. This is the entire economic argument for continuous batching in vLLM, TensorRT-LLM, and SGLang: it is how a serving stack turns a bandwidth-bound workload into a compute-bound one.
The catch every engineer trips over: batching improves throughput (tokens/sec across all users), not latency (tokens/sec for one user). A single request cannot go faster than its ~24 tokens/sec wall. Batching just serves many requests near that rate simultaneously, so aggregate output climbs while any individual user sees the same per-token speed. If a product manager asks why "the GPU utilization is high but my response still feels slow," this is the answer.
What breaks the batching win: the KV cache
There is a second memory stream during decode that does not amortize across the batch: the KV cache. Attention at each step reads the keys and values for every previous token of every request in the batch, and those bytes are per-request, not shared like weights.
KV-cache size per token scales with 2 · n_layers · n_kv_heads · head_dim · precision. For a long-context, high-batch workload the cache can rival or exceed the weights in bytes moved per step. Two consequences:
- Long contexts slow decode even at fixed batch size, because attention's memory traffic grows with sequence length while the weight traffic stays constant.
- KV-cache bytes cap how large your batch can grow before you run out of HBM capacity and bandwidth — so you often hit a memory wall well before reaching the compute ridge at
B ≈ 295.
Grouped-query attention exists largely to shrink this stream. So does KV-cache quantization. Both are attacking the same bottleneck: decode is starved for bandwidth, and the KV cache is the part of that bandwidth budget batching cannot rescue.
How do I turn this into serving config?
Because prefill and decode hit different hardware limits, modern servers schedule them separately — and you tune them with different knobs. Chunked prefill is the key lever: split long prompts into fixed-size chunks and interleave them with decode steps so a giant prefill doesn't stall everyone else's token generation.
Here is the roofline calculation as code, then the vLLM knobs it informs:
# H100 SXM class numbers
PEAK_FLOPS = 989e12 # dense FP16 tensor FLOP/s
HBM_BW = 3.35e12 # bytes/s
ridge = PEAK_FLOPS / HBM_BW # ~295 FLOP/byte
print(f"ridge point: {ridge:.0f} FLOP/byte")
def regime(tokens_sharing_weight_read):
# S for prefill, B for decode batch
intensity = tokens_sharing_weight_read # ~1 FLOP/byte per token
return "compute-bound" if intensity >= ridge else "memory-bound"
print(regime(1)) # decode, batch 1 -> memory-bound
print(regime(4000)) # prefill, 4k prompt -> compute-bound
print(regime(256)) # decode, batch 256 -> still memory-bound, but close
# vLLM: interleave prefill chunks with decode so TTFT spikes
# don't block ongoing generation (TPOT).
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--enable-chunked-prefill \
--max-num-batched-tokens 2048 \ # prefill chunk / batch token budget
--max-num-seqs 256 \ # decode batch ceiling (KV-cache bound)
--kv-cache-dtype fp8 # halve KV bandwidth + capacity
--max-num-batched-tokens governs how much prefill work lands in a step; too high and a long prompt monopolizes the GPU and spikes everyone's TPOT. --max-num-seqs and --kv-cache-dtype govern the decode side, where you are fighting for KV-cache bandwidth and capacity, not FLOPs. If you tune these as one knob you will always be sacrificing TTFT for TPOT or vice versa without understanding why.
The direct answer
LLM decoding is memory-bound because generating one token at a time is a matrix–vector multiply with arithmetic intensity around 1 FLOP/byte — far below an H100's ~295 FLOP/byte ridge point — so every decode step is limited by HBM bandwidth reading the weights, not by tensor-core throughput. Prefill escapes this because it processes the whole prompt at once, turning the operation into a compute-bound matrix–matrix multiply. That single asymmetry explains the rest: single-request decode is capped at model_bytes / HBM_bandwidth (about 24 tokens/sec for a 70B fp16 model), quantization speeds decode by cutting bytes moved, batching raises intensity to reclaim compute-bound efficiency for throughput but never for one user's latency, and the KV cache is the per-request memory stream that batching can't amortize — which is why TTFT and TPOT are separate problems that need separate knobs.
Top comments (0)