Everything changed for local Gemma 4 inference on June 7, 2026. PR #23398 by am17an landed in llama.cpp b9549, bringing official Multi-Token Prediction (MTP) support. Users are reporting 140 tok/s on a 12GB RTX 4070 Super and 2x speedups on dual 3090s.
Here's exactly how to set it up.
What is Multi-Token Prediction?
Instead of generating one token at a time, MTP predicts multiple future tokens simultaneously using a lightweight drafter model, then verifies them in a single forward pass.
"I spent my Friday night & Saturday getting my agentic setup to mostly use local models (Gemma 4 31B and Qwen 3.6 35B-A3B) - local models are really good these days."
— Mike Masnick (@masnick.com) on Bluesky
Gemma 4 ships with dedicated assistant drafter models that were co-trained with the base model. Unlike generic speculative decoding, these drafters share activations with the target model, making them unusually efficient.
Prerequisites
Hardware:
- NVIDIA GPU with 12GB+ VRAM (RTX 4070, 3090, etc.) or Apple Silicon Mac
- 16GB+ system RAM
Software:
- llama.cpp b9549 or later
- Gemma 4 model in GGUF format
- Gemma 4 MTP assistant drafter in GGUF format
- Hugging Face CLI
Step 1: Build llama.cpp with MTP
MTP support is in mainline as of b9549. Build from source:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. -DGGML_CUDA=ON
cmake --build . --target llama-server --parallel
For Apple Silicon, replace -DGGML_CUDA=ON with -DGGML_METAL=ON.
Step 2: Download the Models
You need both the base model and the MTP assistant drafter:
# Gemma 4 12B (fits 12GB VRAM at Q4)
huggingface-cli download google/gemma-4-12B-it-GGUF --include "*Q4*" --local-dir ~/models/gemma-4
huggingface-cli download google/gemma-4-12B-it-assistant-GGUF --include "*Q8*" --local-dir ~/models/gemma-4
Pro tip: Use the QAT (Quantization-Aware Training) checkpoints. Google designed them specifically to preserve MTP speedup even when quantized.
For the 31B dense model (needs 24GB+ VRAM):
huggingface-cli download google/gemma-4-31B-it-GGUF --include "*Q4*" --local-dir ~/models/gemma-4
huggingface-cli download google/gemma-4-31B-it-assistant-GGUF --include "*Q8*" --local-dir ~/models/gemma-4
Step 3: Run with MTP
Launch llama-server with both models:
llama-server \
-m ~/models/gemma-4/gemma-4-12B-it-Q4_K_M.gguf \
--model-draft ~/models/gemma-4/gemma-4-12B-it-assistant-Q8_0.gguf \
--spec-type draft-mtp \
--spec-draft-n-max 4 \
--flash-attn on \
--ctx-size 32768 \
--host 0.0.0.0 \
--port 8080
Key flags explained:
-
--spec-type draft-mtp— Use Gemma 4's native MTP pipeline -
--spec-draft-n-max 4— How many tokens to speculate ahead (4 is the sweet spot) -
--model-draft— Path to the assistant drafter GGUF
Performance Benchmarks
Real community results from the first weekend with the merge:
| Configuration | Without MTP | With MTP | Speedup |
|---|---|---|---|
| Gemma 4 12B, RTX 4070 Super (Q4) | ~55 tok/s | ~140 tok/s | 2.5x |
| Gemma 4 31B, RTX PRO 6000 (Q8) | ~40 tok/s | ~83 tok/s | 2.0x |
| Gemma 4 31B, Dual RTX 3090 | ~35 tok/s | ~62 tok/s | 1.8x |
| Gemma 4 26B MoE, RTX 4070 | ~44 tok/s | ~58 tok/s | 1.3x |
| Gemma 4 31B, DGX Spark (Q8) | ~6 tok/s | ~14.3 tok/s | 2.4x |
Important Caveats
1. MoE models see smaller gains
MTP was designed for dense architectures. The 26B MoE variant gets only 1.2-1.3x. Consider EAGLE3 for MoE.
2. Q8 KV cache breaks MTP
There's a known bug where -ctk q8_0 -ctv q8_0 drops the acceptance rate to exactly 0%. Fixed in a later commit. Workaround: use f16 KV cache.
"I can reproduce the 0% acceptance rate when the main model's KV cache is quantized to q8_0. With f16 KV cache, the acceptance rate seems normal."
— Contributor theo77186
3. n=3 sometimes beats n=4
On Q4 quantized models, speculating 3 tokens ahead can outperform 4. The marginal gain from the fourth token doesn't justify the overhead.
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
| 0% acceptance rate | Q8 KV cache bug | Remove -ctk q8_0 -ctv q8_0
|
| No speedup on MoE | Architectural mismatch | Use EAGLE3 or traditional speculative decoding |
| Out of memory | Drafter adds VRAM overhead | Drop to lower quantization |
| Vision not speeding up | Drafter overhead on 12GB cards | Fine on high-end hardware |
What's Next
The llama.cpp team is working on EAGLE3 integration for better MoE support. If you're running Gemma 4 on consumer hardware right now, MTP is the single biggest performance upgrade available.
This guide was originally published on everylocalai.com, where we track the best ways to run AI models on your own hardware.
Top comments (0)