<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: EveryLocalAI</title>
    <description>The latest articles on DEV Community by EveryLocalAI (@everylocalai).</description>
    <link>https://dev.to/everylocalai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3976359%2F71d4622f-e778-4e5d-9bb3-63d38db82c18.png</url>
      <title>DEV Community: EveryLocalAI</title>
      <link>https://dev.to/everylocalai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/everylocalai"/>
    <language>en</language>
    <item>
      <title>How to Get 2x Speed on Gemma 4 with Multi-Token Prediction in llama.cpp</title>
      <dc:creator>EveryLocalAI</dc:creator>
      <pubDate>Wed, 10 Jun 2026 06:42:00 +0000</pubDate>
      <link>https://dev.to/everylocalai/how-to-get-2x-speed-on-gemma-4-with-multi-token-prediction-in-llamacpp-1b8e</link>
      <guid>https://dev.to/everylocalai/how-to-get-2x-speed-on-gemma-4-with-multi-token-prediction-in-llamacpp-1b8e</guid>
      <description>&lt;p&gt;Everything changed for local Gemma 4 inference on June 7, 2026. PR #23398 by am17an landed in llama.cpp b9549, bringing official Multi-Token Prediction (MTP) support. Users are reporting &lt;strong&gt;140 tok/s on a 12GB RTX 4070 Super&lt;/strong&gt; and &lt;strong&gt;2x speedups on dual 3090s&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here's exactly how to set it up.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is Multi-Token Prediction?
&lt;/h2&gt;

&lt;p&gt;Instead of generating one token at a time, MTP predicts multiple future tokens simultaneously using a lightweight drafter model, then verifies them in a single forward pass.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I spent my Friday night &amp;amp; Saturday getting my agentic setup to mostly use local models (Gemma 4 31B and Qwen 3.6 35B-A3B) - local models are really good these days."&lt;br&gt;
— Mike Masnick (@masnick.com) on Bluesky&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Gemma 4 ships with dedicated assistant drafter models that were co-trained with the base model. Unlike generic speculative decoding, these drafters share activations with the target model, making them unusually efficient.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Hardware:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NVIDIA GPU with 12GB+ VRAM (RTX 4070, 3090, etc.) or Apple Silicon Mac&lt;/li&gt;
&lt;li&gt;16GB+ system RAM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Software:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;llama.cpp b9549 or later&lt;/li&gt;
&lt;li&gt;Gemma 4 model in GGUF format&lt;/li&gt;
&lt;li&gt;Gemma 4 MTP assistant drafter in GGUF format&lt;/li&gt;
&lt;li&gt;Hugging Face CLI&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1: Build llama.cpp with MTP
&lt;/h2&gt;

&lt;p&gt;MTP support is in mainline as of b9549. Build from source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ggerganov/llama.cpp
&lt;span class="nb"&gt;cd &lt;/span&gt;llama.cpp
&lt;span class="nb"&gt;mkdir &lt;/span&gt;build &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;build
cmake .. &lt;span class="nt"&gt;-DGGML_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;--target&lt;/span&gt; llama-server &lt;span class="nt"&gt;--parallel&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Apple Silicon, replace &lt;code&gt;-DGGML_CUDA=ON&lt;/code&gt; with &lt;code&gt;-DGGML_METAL=ON&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Download the Models
&lt;/h2&gt;

&lt;p&gt;You need both the base model and the MTP assistant drafter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Gemma 4 12B (fits 12GB VRAM at Q4)&lt;/span&gt;
huggingface-cli download google/gemma-4-12B-it-GGUF &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"*Q4*"&lt;/span&gt; &lt;span class="nt"&gt;--local-dir&lt;/span&gt; ~/models/gemma-4

huggingface-cli download google/gemma-4-12B-it-assistant-GGUF &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"*Q8*"&lt;/span&gt; &lt;span class="nt"&gt;--local-dir&lt;/span&gt; ~/models/gemma-4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Use the QAT (Quantization-Aware Training) checkpoints. Google designed them specifically to preserve MTP speedup even when quantized.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For the 31B dense model (needs 24GB+ VRAM):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;huggingface-cli download google/gemma-4-31B-it-GGUF &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"*Q4*"&lt;/span&gt; &lt;span class="nt"&gt;--local-dir&lt;/span&gt; ~/models/gemma-4
huggingface-cli download google/gemma-4-31B-it-assistant-GGUF &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"*Q8*"&lt;/span&gt; &lt;span class="nt"&gt;--local-dir&lt;/span&gt; ~/models/gemma-4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3: Run with MTP
&lt;/h2&gt;

&lt;p&gt;Launch llama-server with both models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; ~/models/gemma-4/gemma-4-12B-it-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model-draft&lt;/span&gt; ~/models/gemma-4/gemma-4-12B-it-assistant-Q8_0.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--spec-type&lt;/span&gt; draft-mtp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--spec-draft-n-max&lt;/span&gt; 4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; on &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 32768 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key flags explained:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--spec-type draft-mtp&lt;/code&gt; — Use Gemma 4's native MTP pipeline&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--spec-draft-n-max 4&lt;/code&gt; — How many tokens to speculate ahead (4 is the sweet spot)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--model-draft&lt;/code&gt; — Path to the assistant drafter GGUF&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Performance Benchmarks
&lt;/h2&gt;

&lt;p&gt;Real community results from the first weekend with the merge:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Without MTP&lt;/th&gt;
&lt;th&gt;With MTP&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 12B, RTX 4070 Super (Q4)&lt;/td&gt;
&lt;td&gt;~55 tok/s&lt;/td&gt;
&lt;td&gt;~140 tok/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.5x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 31B, RTX PRO 6000 (Q8)&lt;/td&gt;
&lt;td&gt;~40 tok/s&lt;/td&gt;
&lt;td&gt;~83 tok/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.0x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 31B, Dual RTX 3090&lt;/td&gt;
&lt;td&gt;~35 tok/s&lt;/td&gt;
&lt;td&gt;~62 tok/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.8x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 26B MoE, RTX 4070&lt;/td&gt;
&lt;td&gt;~44 tok/s&lt;/td&gt;
&lt;td&gt;~58 tok/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.3x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 31B, DGX Spark (Q8)&lt;/td&gt;
&lt;td&gt;~6 tok/s&lt;/td&gt;
&lt;td&gt;~14.3 tok/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.4x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Important Caveats
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. MoE models see smaller gains
&lt;/h3&gt;

&lt;p&gt;MTP was designed for dense architectures. The 26B MoE variant gets only 1.2-1.3x. Consider EAGLE3 for MoE.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Q8 KV cache breaks MTP
&lt;/h3&gt;

&lt;p&gt;There's a known bug where &lt;code&gt;-ctk q8_0 -ctv q8_0&lt;/code&gt; drops the acceptance rate to &lt;strong&gt;exactly 0%&lt;/strong&gt;. Fixed in a later commit. Workaround: use f16 KV cache.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I can reproduce the 0% acceptance rate when the main model's KV cache is quantized to q8_0. With f16 KV cache, the acceptance rate seems normal."&lt;br&gt;
— Contributor theo77186&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3. n=3 sometimes beats n=4
&lt;/h3&gt;

&lt;p&gt;On Q4 quantized models, speculating 3 tokens ahead can outperform 4. The marginal gain from the fourth token doesn't justify the overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0% acceptance rate&lt;/td&gt;
&lt;td&gt;Q8 KV cache bug&lt;/td&gt;
&lt;td&gt;Remove &lt;code&gt;-ctk q8_0 -ctv q8_0&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No speedup on MoE&lt;/td&gt;
&lt;td&gt;Architectural mismatch&lt;/td&gt;
&lt;td&gt;Use EAGLE3 or traditional speculative decoding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Out of memory&lt;/td&gt;
&lt;td&gt;Drafter adds VRAM overhead&lt;/td&gt;
&lt;td&gt;Drop to lower quantization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vision not speeding up&lt;/td&gt;
&lt;td&gt;Drafter overhead on 12GB cards&lt;/td&gt;
&lt;td&gt;Fine on high-end hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The llama.cpp team is working on EAGLE3 integration for better MoE support. If you're running Gemma 4 on consumer hardware right now, MTP is the single biggest performance upgrade available.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This guide was originally published on &lt;a href="https://everylocalai.com" rel="noopener noreferrer"&gt;everylocalai.com&lt;/a&gt;, where we track the best ways to run AI models on your own hardware.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
