byeongsoo kang

Posted on Jun 10 • Originally published at bric.pe.kr

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

#llm #performance #machinelearning #rag

My MTP post showed multi-token prediction roughly doubling Qwen3.6-27B's generation on a 3090. A reader asked the question I'd skipped: what about prompt processing at long context? So I measured it — and that turns out to be the real wall, the one MTP can't climb.

TL;DR

On a single RTX 3090, prefill (prompt processing) for Qwen3.6-27B drops from ~1,575 tok/s at 1k context to ~852 at 128k — so a 64k-token prompt takes ~59 seconds before the first token appears, and 128k takes ~2.5 minutes. MTP speeds the decode phase, not prefill, so on a long-context / short-answer request (the typical RAG shape) its 2× generation win shrinks to ~3% of total latency. MTP is real; it just stops mattering exactly where long-context RAG lives.

Prefill vs context size

llama-bench, Qwen3.6-27B IQ4_XS, prefill only (-n 0), flash-attention on, single RTX 3090:

context	prefill tok/s	time to first token	peak VRAM	fits 24 GB?
1,024	1,575	0.65 s	16.0 GB	yes
16,384	1,432	11.4 s	16.5 GB	yes
65,536	1,111	59.0 s	19.6 GB	yes
131,072	852	153.8 s	23.6 GB	barely (98.5%)

Prefill throughput falls ~46% from 1k to 128k as the attention cost grows with sequence length, and time-to-first-token climbs roughly linearly with prompt size. These numbers are rock-stable (CV < 0.5%) — prefill is compute-bound, unlike the noisier MTP generation numbers from last time.

MTP speeds decode, not prefill

Speculative decoding (MTP) works during generation: a draft proposes several tokens ahead and the main model verifies them in one pass. Prefill is a different phase — a single forward pass over the whole prompt to build the KV cache, before any token is generated. MTP doesn't touch that pass, so it can't reduce time-to-first-token. What it reduces is the per-token cost of everything after the first token.

That's not the same as "MTP doesn't help long context." If you generate a lot of tokens, MTP still cuts the generation portion. The honest question is: how big is the generation portion relative to prefill?

When MTP actually matters: the latency math

Using the measured prefill above and generation from the last post (~75 tok/s with MTP vs ~45 without), total latency = time-to-first-token + generation time:

request shape	prefill (TTFT)	generation (MTP / off)	total (MTP / off)	MTP saves
1k context, 200-token answer	0.65 s	2.7 s / 4.4 s	3.3 s / 5.1 s	~35%
64k context, 200-token answer	59 s	2.7 s / 4.4 s	61.7 s / 63.4 s	~3%
64k context, 2,000-token answer	59 s	27 s / 44 s	86 s / 103 s	~17%

So MTP's value is entirely a function of the generation-to-prefill ratio. Short prompt, long answer → MTP shines (a third off total latency). Long prompt, short answer → prefill swallows everything and the 2× barely registers. Same speedup on the decoder; completely different impact on the wall clock.

What this means for RAG

The middle row — long prompt, short answer — is exactly the shape of retrieval-augmented generation: you stuff thousands of tokens of retrieved context in and ask for a short, grounded answer. That's the case where MTP helps least, and it's why a fat-context RAG can feel sluggish even on a setup that benchmarks fast on generation. The thing you actually wait for is the one-time prefill of the context, once per query.

This is directly relevant to my own local paper-RAG: the lever that improves its latency isn't a faster decoder — it's keeping the retrieved context tight (good chunking and reranking so you pass fewer, better tokens), which keeps prefill cheap. A reranker that lets you send 4k of relevant context instead of 40k of marginal context buys more real-world latency than MTP does.

The 24 GB wall

128k context fit — barely. At 23.6 GB it used 98.5% of the card, leaving ~380 MiB of headroom and nothing for anything else. The model's native context goes higher (~256k), but on a 24 GB 3090 this quant tops out around 128k before the KV cache spills or OOMs. So if you're planning long-context work on a single 3090: ~128k is the practical ceiling, and the prefill at that point is a 2.5-minute wait before the model says a word.

Honest caveats

Single RTX 3090, single request, Qwen3.6-27B IQ4_XS. Batching / concurrency is a different story and changes the prefill economics (chunked prefill, prefix caching, etc.).
The generation figures (~75 / ~45 tok/s) carry the run-to-run variance from the last post (MTP CV ~5–7%), so the latency-math rows are illustrative round numbers, not claimed to ±0.1 s. The pattern — MTP's share collapsing as context grows — is the robust part.
Prefill numbers themselves are tight (CV < 0.5%).
"Time to first token" here is pure prompt processing; real TTFT also includes a little sampling and setup overhead.

Reproduce it

RTX 3090 24 GB (sm86), llama.cpp commit e3471b3, Qwen3.6-27B IQ4_XS (bartowski).
Prefill sweep: llama-bench -m <iq4xs.gguf> -p 1024,16384,65536,131072 -n 0 -ngl 99 -fa 1 -r 2
Time-to-first-token = context_size ÷ prefill_tok_s.

Wrap-up

MTP is still the best single lever for generation speed on this card — but "generation speed" and "how long until I see an answer" are different questions, and at long context they diverge hard. If your workload is long-context RAG, the number that owns your latency is prefill, and no amount of speculative decoding will move it. The cheapest win there isn't a faster decoder; it's sending fewer, better tokens. Thanks to the reader who asked the question that made me measure it.

DEV Community