My MTP post showed multi-token prediction roughly doubling Qwen3.6-27B's generation on a 3090. A reader asked the question I'd skipped: what about prompt processing at long context? So I measured it — and that turns out to be the real wall, the one MTP can't climb.
TL;DR
On a single RTX 3090, prefill (prompt processing) for Qwen3.6-27B drops from ~1,575 tok/s at 1k context to ~852 at 128k — so a 64k-token prompt takes ~59 seconds before the first token appears, and 128k takes ~2.5 minutes. MTP speeds the decode phase, not prefill, so on a long-context / short-answer request (the typical RAG shape) its 2× generation win shrinks to ~3% of total latency. MTP is real; it just stops mattering exactly where long-context RAG lives.
Prefill vs context size
llama-bench, Qwen3.6-27B IQ4_XS, prefill only (-n 0), flash-attention on, single RTX 3090:
| context | prefill tok/s | time to first token | peak VRAM | fits 24 GB? |
|---|---|---|---|---|
| 1,024 | 1,575 | 0.65 s | 16.0 GB | yes |
| 16,384 | 1,432 | 11.4 s | 16.5 GB | yes |
| 65,536 | 1,111 | 59.0 s | 19.6 GB | yes |
| 131,072 | 852 | 153.8 s | 23.6 GB | barely (98.5%) |
Prefill throughput falls ~46% from 1k to 128k as the attention cost grows with sequence length, and time-to-first-token climbs roughly linearly with prompt size. These numbers are rock-stable (CV < 0.5%) — prefill is compute-bound, unlike the noisier MTP generation numbers from last time.
MTP speeds decode, not prefill
Speculative decoding (MTP) works during generation: a draft proposes several tokens ahead and the main model verifies them in one pass. Prefill is a different phase — a single forward pass over the whole prompt to build the KV cache, before any token is generated. MTP doesn't touch that pass, so it can't reduce time-to-first-token. What it reduces is the per-token cost of everything after the first token.
That's not the same as "MTP doesn't help long context." If you generate a lot of tokens, MTP still cuts the generation portion. The honest question is: how big is the generation portion relative to prefill?
When MTP actually matters: the latency math
Using the measured prefill above and generation from the last post (~75 tok/s with MTP vs ~45 without), total latency = time-to-first-token + generation time:
| request shape | prefill (TTFT) | generation (MTP / off) | total (MTP / off) | MTP saves |
|---|---|---|---|---|
| 1k context, 200-token answer | 0.65 s | 2.7 s / 4.4 s | 3.3 s / 5.1 s | ~35% |
| 64k context, 200-token answer | 59 s | 2.7 s / 4.4 s | 61.7 s / 63.4 s | ~3% |
| 64k context, 2,000-token answer | 59 s | 27 s / 44 s | 86 s / 103 s | ~17% |
So MTP's value is entirely a function of the generation-to-prefill ratio. Short prompt, long answer → MTP shines (a third off total latency). Long prompt, short answer → prefill swallows everything and the 2× barely registers. Same speedup on the decoder; completely different impact on the wall clock.
What this means for RAG
The middle row — long prompt, short answer — is exactly the shape of retrieval-augmented generation: you stuff thousands of tokens of retrieved context in and ask for a short, grounded answer. That's the case where MTP helps least, and it's why a fat-context RAG can feel sluggish even on a setup that benchmarks fast on generation. The thing you actually wait for is the one-time prefill of the context, once per query.
This is directly relevant to my own local paper-RAG: the lever that improves its latency isn't a faster decoder — it's keeping the retrieved context tight (good chunking and reranking so you pass fewer, better tokens), which keeps prefill cheap. A reranker that lets you send 4k of relevant context instead of 40k of marginal context buys more real-world latency than MTP does.
The 24 GB wall
128k context fit — barely. At 23.6 GB it used 98.5% of the card, leaving ~380 MiB of headroom and nothing for anything else. The model's native context goes higher (~256k), but on a 24 GB 3090 this quant tops out around 128k before the KV cache spills or OOMs. So if you're planning long-context work on a single 3090: ~128k is the practical ceiling, and the prefill at that point is a 2.5-minute wait before the model says a word.
Honest caveats
- Single RTX 3090, single request, Qwen3.6-27B IQ4_XS. Batching / concurrency is a different story and changes the prefill economics (chunked prefill, prefix caching, etc.).
- The generation figures (~75 / ~45 tok/s) carry the run-to-run variance from the last post (MTP CV ~5–7%), so the latency-math rows are illustrative round numbers, not claimed to ±0.1 s. The pattern — MTP's share collapsing as context grows — is the robust part.
- Prefill numbers themselves are tight (CV < 0.5%).
- "Time to first token" here is pure prompt processing; real TTFT also includes a little sampling and setup overhead.
Reproduce it
- RTX 3090 24 GB (sm86), llama.cpp commit
e3471b3, Qwen3.6-27BIQ4_XS(bartowski). - Prefill sweep:
llama-bench -m <iq4xs.gguf> -p 1024,16384,65536,131072 -n 0 -ngl 99 -fa 1 -r 2 - Time-to-first-token = context_size ÷ prefill_tok_s.
Wrap-up
MTP is still the best single lever for generation speed on this card — but "generation speed" and "how long until I see an answer" are different questions, and at long context they diverge hard. If your workload is long-context RAG, the number that owns your latency is prefill, and no amount of speculative decoding will move it. The cheapest win there isn't a faster decoder; it's sending fewer, better tokens. Thanks to the reader who asked the question that made me measure it.
Top comments (0)