Prefill/Decode Disaggregation: Stop Serving LLMs on One GPU

Put a single 32K-token prompt into a busy inference server and watch every other user's token stream stutter. That hiccup is not a bug. It is two fundamentally different workloads — prefill and decode — being forced through the same GPU at the same time. Prefill/decode disaggregation is the serving architecture that stops them from stepping on each other by running them on separate GPU pools.

If you run vLLM, SGLang, or TensorRT-LLM at any real concurrency, this is the single biggest latency lever you are probably not pulling.

Key takeaways

Prefill (processing the prompt) is compute-bound; decode (generating tokens one at a time) is memory-bandwidth-bound. They have opposite hardware bottlenecks.
Colocating them means a long prefill monopolizes the GPU and spikes the inter-token latency of every sequence currently decoding — the classic TTFT-vs-ITL tug of war.
Prefill/decode disaggregation runs prefill on one GPU pool and decode on another, then ships the KV cache between them over NVLink/RDMA.
It raises goodput (requests served within SLO), not raw FLOPs. The cost is KV-cache transfer bandwidth and a harder-to-tune provisioning ratio.
Skip it for low concurrency, short prompts, or single-GPU deployments — the transfer overhead dominates. It pays off at scale with long, uneven prompts.

Why do prefill and decode have opposite bottlenecks?

A transformer forward pass does the same math in both phases, but the shape of that math is completely different.

Prefill processes all N prompt tokens in parallel. Every weight matrix is multiplied against an N × d_model activation block — a big, dense GEMM. Arithmetic intensity (FLOPs per byte of memory traffic) scales with N. For a multi-thousand-token prompt you are firmly compute-bound: the tensor cores are the bottleneck and HBM bandwidth is mostly idle.

Decode generates one token per sequence per step. Each step is effectively a GEMV — a matrix times a single vector — plus a read of the entire KV cache accumulated so far. You reload the full weight matrices from HBM to produce a handful of output tokens. Arithmetic intensity scales only with batch size, and it is low. Decode is memory-bandwidth-bound. This is why decode is slow even when the GPU shows "100% utilization" — the ALUs are stalled waiting on memory.

Here is the split in napkin math:

# Rough arithmetic intensity (FLOPs per byte moved) for a dense layer.
# Weights live in HBM; we reload them every step in decode.

def intensity(tokens, d_model=8192, bytes_per_param=2):
    # One weight matrix: d_model x d_model
    flops  = 2 * tokens * d_model * d_model      # GEMM: scales with tokens
    wbytes = d_model * d_model * bytes_per_param  # weight read is fixed
    return flops / wbytes

print(intensity(tokens=4096))  # prefill: ~8192  -> compute-bound
print(intensity(tokens=1))     # decode:  ~2     -> memory-bound

An A100/H100-class GPU has a roofline "ridge point" in the hundreds of FLOPs/byte. Prefill sits far to the right of it (compute-bound); decode sits far to the left (bandwidth-bound). One workload wants FLOPs, the other wants bytes. No single scheduling policy makes both happy.

Why does colocating prefill and decode hurt latency?

Because the two phases compete for the same GPU, and prefill wins by brute force.

Your users care about two latency numbers:

TTFT (time to first token) — dominated by prefill.
ITL / TPOT (inter-token latency, or time per output token) — dominated by decode.

In a colocated server using continuous batching, the scheduler interleaves a new request's prefill with the ongoing decode steps of everyone else. When a 32K-token prefill lands, it saturates the tensor cores for tens of milliseconds. Every sequence that was happily emitting a token every ~20 ms now waits behind that prefill. Their ITL spikes. The token stream visibly stutters.

Chunked prefill — slicing a long prompt into pieces and interleaving them with decode batches — softens this but does not remove it. You are still time-sharing one set of tensor cores between a compute-hungry job and a latency-sensitive one. Tune the chunk size down to protect ITL and you inflate TTFT; tune it up to protect TTFT and you wreck ITL. You are moving pain around a fixed budget, not eliminating it.

Disaggregation eliminates it by giving each phase its own hardware. Prefill nodes run large batches at high compute utilization without any decode job to protect. Decode nodes run tight, bandwidth-optimal batches with predictable per-token latency because no prefill ever lands on them.

How does disaggregation actually move the KV cache?

The prefill node computes the prompt's KV cache, then transfers it to a decode node, which continues generation from there. The KV cache is the handoff artifact.

That transfer is the whole ballgame. A KV cache is 2 × num_layers × num_kv_heads × head_dim × seq_len × dtype_bytes per request. For a long prompt on a large model this is hundreds of megabytes to gigabytes. Move it over PCIe and you have just recreated the stall you were trying to avoid. So disaggregated stacks assume a fast interconnect — NVLink within a node, RDMA/InfiniBand across nodes — and overlap the transfer with computation layer by layer, streaming each layer's KV as soon as it is produced instead of waiting for the full prompt.

Grouped-query attention helps enormously here: fewer KV heads means a smaller cache to move. It is not a coincidence that disaggregation and GQA rose together.

Conceptually the config looks like this (vLLM-style, using a KV connector; the exact API is still evolving, so treat this as illustrative):

# Prefill worker: produces KV, pushes it to the transfer buffer
vllm serve MODEL \
  --kv-transfer-config '{"kv_connector":"PyNcclConnector",
                         "kv_role":"kv_producer",
                         "kv_rank":0}'

# Decode worker: pulls KV, continues generation
vllm serve MODEL \
  --kv-transfer-config '{"kv_connector":"PyNcclConnector",
                         "kv_role":"kv_consumer",
                         "kv_rank":1}'

A front-end router sends the request to a prefill worker, waits for the KV handoff, then routes the decode stream through a decode worker. Systems like DistServe and Mooncake formalized this; production stacks (vLLM's disaggregated-prefill path, SGLang, TensorRT-LLM, NVIDIA Dynamo) now ship variants of it.

When is prefill/decode disaggregation not worth it?

Most of the time, if you are small. Disaggregation is a scale optimization with real fixed costs, and below a threshold the costs win.

Skip it when:

You run a single GPU or a single small node. There is nowhere to send the KV cache. Colocation with chunked prefill is correct here.
Prompts are short and uniform. If prefill is cheap relative to decode, it never monopolizes the GPU long enough to hurt. The interference you are paying to remove barely exists.
Your interconnect is PCIe. Without NVLink or RDMA, KV transfer latency eats the entire benefit. Measure your KV-cache size against your link bandwidth before committing.
Concurrency is low. With a handful of in-flight requests there is little contention to resolve; you would just be adding a network hop to every request.

Disaggregation pays off in the opposite regime: high concurrency, long and highly variable prompt lengths (RAG, agents, long documents), tight ITL SLOs for streaming UX, and a fat interconnect. That is exactly the shape of a production chat or agent backend.

How do you provision the prefill and decode pools?

You size the two pools independently to hit your TTFT and ITL targets, and the ratio is workload-specific — this is the new tuning knob disaggregation buys you.

Colocated serving forces one resource pool to satisfy both SLOs; you over-provision to protect the tighter one. Disaggregation lets you scale each phase to its own bottleneck. A RAG workload with 16K-token prompts and short answers is prefill-heavy — you might run more prefill workers than decode workers. A chatbot with short prompts and long, chatty responses is decode-heavy — invert the ratio.

The metric to optimize is goodput: requests per second served within SLO, not raw throughput. A colocated server can post high aggregate tokens/sec while missing its ITL target on half the requests. Disaggregation trades a little peak throughput (you pay for KV transfer and can't opportunistically pack a decode step into idle prefill cycles) for far higher goodput, because both latency distributions get tighter and more predictable.

Practical starting recipe:

Measure your real prompt-length and output-length distributions — not averages, the p50/p95/p99.
Estimate prefill-FLOPs vs decode-FLOPs from those distributions to get a rough pool ratio.
Deploy with a fast interconnect, enable layer-wise KV streaming, and confirm the KV transfer overlaps compute (profile it — a non-overlapped transfer is the classic misconfiguration).
Load-test and tune the ratio against goodput, not tokens/sec.

The bottom line

Prefill and decode are not two stages of one job; they are two different jobs with opposite hardware demands — one compute-bound, one memory-bound — that a single-GPU server can only serve by making one suffer for the other. Prefill/decode disaggregation ends that compromise by running each phase on its own GPU pool and shipping the KV cache between them over a fast interconnect. You trade a slice of peak throughput and some transfer bandwidth for tight, independent control over TTFT and inter-token latency, and for the ability to scale each phase to its own bottleneck. For a single GPU or short uniform prompts, don't bother — colocation with chunked prefill is simpler and just as good. At production scale with long, uneven prompts and strict streaming SLOs, disaggregation is how you keep one big prompt from freezing everyone else's token stream.