EveryLocalAI

Posted on Jun 10

How to Get 2x Speed on Gemma 4 with Multi-Token Prediction in llama.cpp

#machinelearning

Everything changed for local Gemma 4 inference on June 7, 2026. PR #23398 by am17an landed in llama.cpp b9549, bringing official Multi-Token Prediction (MTP) support. Users are reporting 140 tok/s on a 12GB RTX 4070 Super and 2x speedups on dual 3090s.

Here's exactly how to set it up.

What is Multi-Token Prediction?

Instead of generating one token at a time, MTP predicts multiple future tokens simultaneously using a lightweight drafter model, then verifies them in a single forward pass.

"I spent my Friday night & Saturday getting my agentic setup to mostly use local models (Gemma 4 31B and Qwen 3.6 35B-A3B) - local models are really good these days."
— Mike Masnick (@masnick.com) on Bluesky

Gemma 4 ships with dedicated assistant drafter models that were co-trained with the base model. Unlike generic speculative decoding, these drafters share activations with the target model, making them unusually efficient.

Prerequisites

Hardware:

NVIDIA GPU with 12GB+ VRAM (RTX 4070, 3090, etc.) or Apple Silicon Mac
16GB+ system RAM

Software:

llama.cpp b9549 or later
Gemma 4 model in GGUF format
Gemma 4 MTP assistant drafter in GGUF format
Hugging Face CLI

Step 1: Build llama.cpp with MTP

MTP support is in mainline as of b9549. Build from source:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. -DGGML_CUDA=ON
cmake --build . --target llama-server --parallel

For Apple Silicon, replace -DGGML_CUDA=ON with -DGGML_METAL=ON.

Step 2: Download the Models

You need both the base model and the MTP assistant drafter:

# Gemma 4 12B (fits 12GB VRAM at Q4)
huggingface-cli download google/gemma-4-12B-it-GGUF --include "*Q4*" --local-dir ~/models/gemma-4

huggingface-cli download google/gemma-4-12B-it-assistant-GGUF --include "*Q8*" --local-dir ~/models/gemma-4

Pro tip: Use the QAT (Quantization-Aware Training) checkpoints. Google designed them specifically to preserve MTP speedup even when quantized.

For the 31B dense model (needs 24GB+ VRAM):

huggingface-cli download google/gemma-4-31B-it-GGUF --include "*Q4*" --local-dir ~/models/gemma-4
huggingface-cli download google/gemma-4-31B-it-assistant-GGUF --include "*Q8*" --local-dir ~/models/gemma-4

Step 3: Run with MTP

Launch llama-server with both models:

llama-server \
  -m ~/models/gemma-4/gemma-4-12B-it-Q4_K_M.gguf \
  --model-draft ~/models/gemma-4/gemma-4-12B-it-assistant-Q8_0.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 4 \
  --flash-attn on \
  --ctx-size 32768 \
  --host 0.0.0.0 \
  --port 8080

Key flags explained:

--spec-type draft-mtp — Use Gemma 4's native MTP pipeline
--spec-draft-n-max 4 — How many tokens to speculate ahead (4 is the sweet spot)
--model-draft — Path to the assistant drafter GGUF

Performance Benchmarks

Real community results from the first weekend with the merge:

Configuration	Without MTP	With MTP	Speedup
Gemma 4 12B, RTX 4070 Super (Q4)	~55 tok/s	~140 tok/s	2.5x
Gemma 4 31B, RTX PRO 6000 (Q8)	~40 tok/s	~83 tok/s	2.0x
Gemma 4 31B, Dual RTX 3090	~35 tok/s	~62 tok/s	1.8x
Gemma 4 26B MoE, RTX 4070	~44 tok/s	~58 tok/s	1.3x
Gemma 4 31B, DGX Spark (Q8)	~6 tok/s	~14.3 tok/s	2.4x

Important Caveats

1. MoE models see smaller gains

MTP was designed for dense architectures. The 26B MoE variant gets only 1.2-1.3x. Consider EAGLE3 for MoE.

2. Q8 KV cache breaks MTP

There's a known bug where -ctk q8_0 -ctv q8_0 drops the acceptance rate to exactly 0%. Fixed in a later commit. Workaround: use f16 KV cache.

"I can reproduce the 0% acceptance rate when the main model's KV cache is quantized to q8_0. With f16 KV cache, the acceptance rate seems normal."
— Contributor theo77186

3. n=3 sometimes beats n=4

On Q4 quantized models, speculating 3 tokens ahead can outperform 4. The marginal gain from the fourth token doesn't justify the overhead.

Troubleshooting

Problem	Cause	Fix
0% acceptance rate	Q8 KV cache bug	Remove `-ctk q8_0 -ctv q8_0`
No speedup on MoE	Architectural mismatch	Use EAGLE3 or traditional speculative decoding
Out of memory	Drafter adds VRAM overhead	Drop to lower quantization
Vision not speeding up	Drafter overhead on 12GB cards	Fine on high-end hardware

What's Next

The llama.cpp team is working on EAGLE3 integration for better MoE support. If you're running Gemma 4 on consumer hardware right now, MTP is the single biggest performance upgrade available.

This guide was originally published on everylocalai.com, where we track the best ways to run AI models on your own hardware.

DEV Community