MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is Hardware-Dependent

#llm #machinelearning #performance #gemma

In my MTP post, speculative decoding roughly doubled Qwen3.6-27B generation on a 3090. It's tempting to read that as "turn on MTP, go faster." So I measured it on a different model — Gemma 4 12B QAT — and it's a big win on my 3090. But the same model with the same MTP draft runs slower on an M1 Max. MTP isn't a free switch; it's a hardware-dependent lever.

My 3090 numbers

Gemma 4 12B QAT (UD-Q4_K_XL) + an MTP draft head (Q8_0-MTP, a 0.47 GB nextn head, not a full second model), single RTX 3090, decode tok/s, 3 runs each:

config	mean tok/s	speedup	draft acceptance
baseline (no MTP)	85.9	1.00×	—
MTP `n-max 2`	159.4	1.86×	0.77
MTP `n-max 3`	167.4	1.95×	0.69

A clean ~1.9× on the 3090, and unusually stable — run-to-run CV was under 0.5% (my earlier Qwen3.6-27B MTP runs were far noisier at 5–7%, needing a dozen runs; Gemma here settled in three). Same n-max 3 sweet spot as Qwen, same counterintuitive shape: deeper speculation has lower per-token acceptance (0.69 vs 0.77) but higher throughput, because more tokens land per verify step. Per-category, the win ranged 1.8×–2.2× (RAG and coding best at ~2.2×). The whole thing fit in ~8 GB of VRAM.

The same model, slower on an M1 Max

Here's the part worth the post. A cross-hardware benchmark (another tester's runs, same speed_bench.py harness and --jinja settings, so it's comparable) put Gemma 4 12B QAT+MTP at:

hardware	MTP speedup (n-max 2)
RTX 3090 (mine)	1.86×
RTX 5070 Ti laptop	1.74×
M1 Max (16", 64 GB)	0.87× — slower

Same model, same MTP draft, and the M1 Max actually loses ~13% by turning MTP on.

Why MTP can make you slower

Speculative decoding wins when verifying a batch of drafted tokens is cheap relative to generating them one at a time. The draft head proposes several tokens, the main model checks them in one parallel pass, and accepted drafts give you multiple tokens for about the cost of one verify. That only pays off when you have spare compute and the verify pass is cheap:

Capable CUDA GPU (3090, 5070 Ti): lots of compute headroom, the parallel verify is cheap, the drafts land → 1.7–1.9×.
Apple Silicon (M1 Max), unified memory: running the draft adds compute the architecture doesn't have to spare relative to its memory bandwidth, and that overhead outweighs the parallel-verify gain. Net result: slower than just decoding normally.

So MTP's speedup is a function of the draft-cost-to-verify-cost ratio on your specific hardware, not a property of the technique. Capable CUDA GPU: yes. Apple Silicon: measure first — it can backfire.

Honest caveats

Only the 3090 numbers are mine; the M1 Max / 5070 Ti figures are another tester's runs. Same harness and settings, so the comparison is fair, but it isn't a single controlled rig — read the cross-machine table as directional.
Gemma 4 12B it runs as a thinking model under --jinja (output routes to a reasoning channel); this doesn't affect the decode-tok/s measurement, which comes from the server's own token timings, and all machines used the same --jinja.
MTP throughput can vary run to run; the 3090 numbers were stable (CV < 0.5%), but the M1 Max's 0.87× is close enough to 1.0 that it's worth re-running before treating it as exact — the direction (a net loss) is the point.

Reproduce it (3090)

llama.cpp mainline, commit e3471b3 (already accepts the Gemma MTP draft — no special build), CUDA sm86.
Models from unsloth/gemma-4-12B-it-qat-GGUF: main gemma-4-12B-it-qat-UD-Q4_K_XL.gguf, draft gemma-4-12B-it-Q8_0-MTP.gguf.
Server: llama-server -m <main> --model-draft <draft> --spec-type draft-mtp --spec-draft-n-max 3 -np 1 -ub 512 -c 16384 -ngl 99 -fa on --jinja
Client: speed_bench.py --bench qualitative --category all --osl 1024 --concurrency 1, compared with speed_bench_compare.py.

Takeaway

MTP is one of the best generation-speed levers on a capable CUDA GPU — ~1.9× here on a 3090, for free quality-wise (the verify keeps the output exact). But "speculative decoding makes you faster" is a hardware claim, not a universal one. On Apple Silicon it can be pure overhead. Measure it on your own box before you commit to it.