Last week, I spent two days banging my head against a wall. I had just spun up a fresh llama.cpp build with multi-token prediction (MTP) support, loaded a quantized Qwen3 model, and ran my benchmark suite expecting that sweet 2-3x speedup everyone keeps talking about.
The result? Roughly the same tokens per second. Sometimes slower. After a lot of profiling, I figured out what was happening — and it turns out the issue is more common than the celebratory benchmark posts suggest.
This post is for anyone who's enabled MTP, expected a speedup, and got nothing.
What MTP actually does (the short version)
Multi-token prediction is a form of speculative decoding baked into the model itself. Instead of running a separate, smaller draft model to guess the next few tokens, the main model emits multiple candidate tokens per forward pass. The verifier (usually the same model with a slightly different head) accepts or rejects them in one shot.
The theory is simple. If acceptance rate is high, you get 2-3 tokens per forward pass instead of one, with roughly the same latency per pass. In practice, MTP can make things worse if any of three things go wrong.
The three reasons MTP fails to speed things up
Here are the actual root causes I hit, in order of frequency:
1. Low acceptance rate
This is the big one. MTP only helps if predictions are accepted. If your acceptance rate is below ~60%, you're paying the extra compute cost of generating drafts without getting tokens back. Wall-clock time goes up.
I see this most often when:
- The prompt is unusual (specific code style, niche domain)
- Temperature is too high (anything above ~0.7 starts hurting)
- The model was quantized aggressively and the MTP head suffered more than the main weights
2. KV cache thrashing
When you generate multiple candidates per step, you churn the KV cache more aggressively. On consumer GPUs with limited VRAM, this can spill into slower memory or cause re-allocation. The forward pass speedup gets eaten by memory stalls.
3. CUDA graph capture failures
This one bit me hard. llama.cpp tries to capture CUDA graphs for the inference loop. If MTP introduces dynamic shapes (variable number of accepted tokens per step), the graph gets re-captured every step. You lose the performance win of graphs entirely, and the per-step overhead actually goes up.
Step-by-step: diagnosing your setup
Here's the order I work through now whenever MTP doesn't seem to help.
Step 1: Measure the actual acceptance rate
llama.cpp surfaces speculation metrics with verbose logging. Build with CUDA support and run with -v:
# Build llama.cpp with CUDA support
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
# Run with verbose stats so we can see acceptance numbers
./build/bin/llama-cli \
-m models/qwen3-quantized.gguf \
-p "Write a Python function for binary search" \
--n-predict 256 \
-ngl 99 \
-v 2>&1 | tee run.log
Then grep the log for the speculation stats. You're looking for an n_accept ratio. Below 0.6 means MTP is actively hurting throughput on your workload.
Step 2: Check VRAM headroom
If acceptance is fine but throughput is still bad, you're probably memory-bound. Watch VRAM usage during inference in a separate terminal:
# Poll memory and GPU utilization once per second
nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu \
--format=csv -l 1
If you're sitting at >95% VRAM utilization while running, MTP's extra KV cache pressure is pushing you over the edge. The fix is usually to reduce context length, drop to a more aggressive quant (Q4_K_M instead of Q5_K_M), or shorten the draft window.
Step 3: Disable CUDA graphs as a control
To check whether graph re-capture is killing you, force graphs off and re-run:
# Disable CUDA graphs to test if they're being re-captured each step
GGML_CUDA_DISABLE_GRAPHS=1 ./build/bin/llama-cli \
-m models/qwen3-quantized.gguf \
-p "Write a Python function for binary search" \
--n-predict 256 \
-ngl 99
If throughput is roughly the same with graphs disabled, capture isn't your problem. If throughput goes up with this flag set, that's the smoking gun — graphs were being re-captured every step under MTP and the overhead was worse than not using them at all.
The actual fix
Once you've identified which of the three issues you're hitting, the fix is usually simple:
- Low acceptance — shorten the draft window. Most MTP implementations let you set a draft length of 1-4 tokens. Dropping from 4 to 2 often pushes acceptance above 70% because the model has to commit to fewer guesses in a row.
-
VRAM pressure — reduce context length or quantize more aggressively. KV cache size scales linearly with context, so cutting
--ctx-sizein half buys you real headroom. - Graph capture churn — pull the latest llama.cpp. The speculation code path changes frequently and padded graph capture has improved a lot recently.
Here's the config that finally worked for me on a quantized Qwen3 model with around 24 GB of VRAM available:
# Final working config — moderate draft length, conservative context
./build/bin/llama-cli \
-m models/qwen3-quantized.gguf \
-p "$PROMPT" \
--n-predict 512 \
--ctx-size 8192 \
--draft-max 2 \
--draft-min 1 \
-ngl 99
That gave me roughly 1.7x throughput over the no-MTP baseline on my workload. Not the magical 3x some posts claim, but a real, repeatable win that I could ship.
Prevention tips
A few things I now do by default whenever I touch MTP:
- Always benchmark with and without MTP. Don't trust that it's helping just because it's enabled. Run both, measure both, save the numbers.
- Pin your llama.cpp version. The MTP code path changes frequently. A config that works today can regress between commits.
- Match quantization to the head carefully. Some MTP heads are sensitive to aggressive quantization. If acceptance rate suddenly tanks after a re-quant, that's usually why.
- Log acceptance rate as a metric, not just throughput. Throughput tells you the symptom; acceptance rate tells you the cause. When you can see both side by side, regressions become obvious.
The honest takeaway is that MTP is a real win when the conditions line up, but it isn't free. If you've enabled it and gotten nothing, you're not doing it wrong — you've just hit one of the failure modes nobody talks about in the benchmark threads. Walk the three steps above and you'll usually find the culprit within an hour.
Top comments (0)