jidonglab

Posted on Jul 3

Chunked Prefill: Why One Long Prompt Freezes Your LLM Server

#ai #llm #performance #systemdesign

You ship an LLM service. p50 latency looks great. Then a user pastes a 40-page contract into the chat, and for the next 400 milliseconds every other user's tokens stop arriving. Their streams freeze, then catch up in a burst. Your dashboards show inter-token latency spikes with no obvious cause. Nothing crashed. Nothing is rate-limited. One long prompt did it.

This is prefill-decode interference, and the fix — chunked prefill — is one of the highest-leverage knobs in an LLM serving stack that almost nobody tunes deliberately. Here is the mechanism and the config.

TL;DR

Prefill is compute-bound and runs in one giant forward pass; decode is memory-bound and runs one token at a time. A naive scheduler runs a long prefill as a single batch step, and every in-flight decode request stalls until it finishes.
Chunked prefill splits the prompt into fixed-size chunks and interleaves them with decode tokens in the same forward step, bounding step time so decode latency stays smooth.
The trade-off is TTFT vs ITL. Smaller chunks smooth inter-token latency but raise time-to-first-token for the long prompt and cut total throughput; bigger chunks do the reverse.
In vLLM the lever is max_num_batched_tokens. Lower (~2048) for latency-sensitive chat, higher (8192+) for throughput/batch workloads.
If you need to isolate both fully, use disaggregated prefill — separate GPU pools for prefill and decode — instead of interleaving.

Why does one long prompt freeze the whole server?

Because prefill and decode are two different kinds of work fighting for the same GPU, and by default the long one wins the whole timestep.

Prefill processes every token in the prompt in parallel. For a 32K-token prompt that is a batch of 32K query positions running through every layer's attention and MLP — dense matmuls, high arithmetic intensity, compute-bound. It saturates the GPU's FLOPs for as long as it takes.

Decode is the opposite. Each running generation produces one token per step. That step loads the full model weights and the request's KV cache to compute a single new position — almost no arithmetic per byte moved, so it is memory-bandwidth-bound. Decode steps are individually cheap and want to happen often, at a steady cadence, because that cadence is the user's streaming experience.

A naive continuous-batching scheduler treats a forward pass as one atomic step. When a big prefill arrives, it schedules that prefill as its own step (or lumped with a few decodes). That single step might take 300–500 ms on a long prompt. During those 300–500 ms, no decode step runs, so every streaming user sees their tokens pause. When the prefill finishes, decode resumes and the queued tokens flush out in a burst. That is your inter-token latency (ITL) spike. One request degraded all of them — a classic head-of-line blocking problem, just at the GPU-scheduling layer.

What is chunked prefill?

Chunked prefill splits a prompt's prefill into fixed-size chunks and interleaves those chunks with ongoing decode tokens inside a single forward pass, so no step is dominated by one long prompt.

Instead of "prefill all 32K tokens in one step," the scheduler defines a per-step token budget — say 2048 tokens. Each step it first schedules the decode tokens for every running request (one query position each), then fills the remaining budget with a chunk of the waiting prefill. A 32K prompt becomes ~16 chunks spread across ~16 steps, and each of those steps also carries the decode tokens for everyone else.

The result: step time is bounded by the token budget, not by the largest prompt. Decode tokens now ride along in every step, so their cadence stays roughly constant. The ITL spike flattens into a small, steady tax.

The kernel cost of this is real: each step now runs a mixed batch — some positions are prefill (attending over their chunk plus all previously cached KV) and some are single-query decode positions. This is why modern serving stacks lean on variable-length FlashAttention kernels that handle prefill and decode positions in one launch. Without mixed-batch attention support, chunked prefill isn't possible.

How do you configure chunked prefill in vLLM?

In vLLM, chunked prefill is controlled by a flag and a token budget. On recent versions it is on by default for many models, but you should set it explicitly.

from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enable_chunked_prefill=True,
    # The per-step token budget. Decode tokens are scheduled first;
    # the remainder is filled with a prefill chunk.
    max_num_batched_tokens=2048,
    max_num_seqs=256,        # cap on concurrent sequences
    gpu_memory_utilization=0.90,
)

Or on the server:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-chunked-prefill \
  --max-num-batched-tokens 2048 \
  --max-num-seqs 256

The scheduling logic each step is roughly:

budget = max_num_batched_tokens
batch  = []

# 1. Decode-priority: keep streams alive first.
for req in running_decodes:
    batch.append(req.next_token())   # 1 token each
    budget -= 1

# 2. Fill the rest with a slice of a prefilling request.
for req in waiting_prefills:
    take = min(budget, req.remaining_prompt_tokens)
    batch.append(req.prefill_chunk(take))
    budget -= take
    if budget == 0:
        break

run_forward_pass(batch)   # one mixed prefill+decode step

Decode-priority is the key detail. By scheduling decodes before prefill chunks, vLLM guarantees running generations always advance at least one token per step, which is what actually protects ITL. The prefill just soaks up whatever budget is left.

What is the TTFT vs ITL trade-off?

Chunked prefill converts a latency spike for everyone into slightly slower first-token latency for the long prompt. That is the whole trade, and max_num_batched_tokens is the dial.

Set the budget low (e.g. 2048):

Each step is short, so decode ITL is smooth and jitter is low. Good for interactive chat.
The long prompt's prefill is spread across more steps, so its time-to-first-token (TTFT) rises.
Total throughput drops: smaller chunks mean lower arithmetic intensity per step and more repeated weight loads, so you leave GPU FLOPs on the table.

Set the budget high (e.g. 8192 or 16384):

Prefill finishes in fewer, fatter steps — better GPU utilization, higher tokens/sec throughput, lower TTFT for long prompts.
But each step is longer, so decode tokens that share those steps see more jitter. The freeze comes partway back.

There is no free lunch here — you are choosing which SLO to protect. If your product is a streaming chat assistant, ITL smoothness is the felt experience, so bias small. If you run offline batch summarization where nobody watches tokens stream, bias large and harvest throughput.

One non-obvious effect: with chunked prefill enabled, decode and prefill share a step, so raw decode-only throughput can dip slightly versus a decode-pure batch. You are paying a small steady-state tax to eliminate the tail spikes. For latency-SLO-bound services that trade is almost always correct.

How do you tune max_num_batched_tokens?

Start from your latency target, not from a throughput number. Pick the budget that keeps your worst-case single-step time under your ITL SLO, then raise it only until jitter reappears.

A practical procedure:

Measure a single forward-step time at candidate budgets (2048, 4096, 8192) for your model and GPU. Step time scales roughly with token count until you saturate compute.
Set your ITL SLO — e.g. "no user waits more than 50 ms between tokens at p99." Your per-step time must fit inside that, because in the worst case a decode token waits one full step.
Pick the largest budget whose step time still fits the SLO. That maximizes throughput without breaking the latency promise.
Watch p99 ITL under load with real long prompts mixed in, not just synthetic short ones. The spike only shows up when a genuinely long prefill collides with active decodes.
Cap concurrency with max_num_seqs so decode tokens alone don't blow the budget — if you have 256 running sequences, that's 256 decode tokens per step before any prefill chunk fits.

Rule of thumb: interactive chat, 2048–4096; mixed traffic, 4096–8192; throughput-first offline, 8192+ and stop caring about ITL.

When should you use disaggregated prefill instead?

When you need to protect both TTFT and ITL at high load, and interleaving forces you to compromise one, move prefill and decode onto separate GPU pools — disaggregated prefill.

Chunked prefill shares one GPU between the two phases, so they still contend. Disaggregation runs prefill on one set of GPUs and decode on another, streaming the computed KV cache from prefill nodes to decode nodes over the interconnect. Prefill GPUs run compute-bound and stay saturated; decode GPUs run memory-bound at a steady cadence with zero prefill interference. Each phase gets hardware tuned to its bottleneck.

The cost is real complexity: you now move KV cache across the network (bandwidth- and latency-sensitive), you provision two pools whose ratio you must balance to traffic, and you add a transfer hop to TTFT. It pays off at scale — large deployments with strict, separate TTFT and ITL SLOs — and is overkill for a single-node service. For most teams, chunked prefill with a well-chosen max_num_batched_tokens captures the majority of the benefit for none of the operational cost. Reach for disaggregation only when a single GPU genuinely can't serve both phases within SLO.

So why does one long prompt freeze your LLM server?

Because prefill and decode compete for the same GPU, and a naive scheduler runs a long prefill as one atomic forward step that blocks every in-flight decode until it completes — head-of-line blocking at the batching layer. Chunked prefill fixes it by slicing the prompt into fixed-size chunks and interleaving them with decode tokens in each step, bounding step time so streaming stays smooth. You tune it with max_num_batched_tokens: smaller values (~2048) smooth inter-token latency for interactive chat at some cost to TTFT and throughput, larger values (8192+) maximize throughput for batch work. When one GPU can't satisfy both TTFT and ITL SLOs even with chunking, disaggregate prefill and decode onto separate pools. Set the budget from your latency target first, then raise it until jitter returns — that boundary is your answer.

Top comments (1)

speed engineer • Jul 4

The decode-priority scheduling detail is the part most chunked-prefill writeups skip, and it's the actual mechanism protecting ITL — everything else is budget arithmetic. Also appreciate the head-of-line-blocking framing: it's the same failure shape as one slow consumer starving a shared worker pool, which makes this very teachable to people who know classic backend systems but are new to LLM serving.

One thing worth adding for people tuning this: prefix caching changes the math. With long shared system prompts, cache hits shrink the effective prefill, so you can often afford a smaller max_num_batched_tokens than raw prompt lengths suggest.

And a question — how do you think about fairness BETWEEN long prefills? With several 30K prompts queued, FCFS chunk-filling stacks the second prompt's TTFT behind the first. Round-robin across waiting prefills smooths TTFT tails but delays everyone's first token. Curious where you land.