<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Martin Andrews</title>
    <description>The latest articles on DEV Community by Martin Andrews (@mdda).</description>
    <link>https://dev.to/mdda</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3735726%2Fb2dda2c7-8429-4442-9244-4a60118e26a1.jpg</url>
      <title>DEV Community: Martin Andrews</title>
      <link>https://dev.to/mdda</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mdda"/>
    <language>en</language>
    <item>
      <title>Running Gemma 4 26B on an Old GTX 1080 with llama.cpp</title>
      <dc:creator>Martin Andrews</dc:creator>
      <pubDate>Sun, 24 May 2026 19:36:21 +0000</pubDate>
      <link>https://dev.to/mdda/running-gemma-4-26b-on-an-old-gtx-1080-with-llamacpp-4fi5</link>
      <guid>https://dev.to/mdda/running-gemma-4-26b-on-an-old-gtx-1080-with-llamacpp-4fi5</guid>
      <description>&lt;p&gt;&lt;em&gt;How to get Google's Gemma 4 26B-A4B Mixture-of-Experts model running locally — including speculative decoding — on hardware that has no business running it.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Google's &lt;strong&gt;Gemma 4 26B-A4B&lt;/strong&gt; is a Mixture-of-Experts model: 25.2 billion total parameters, but only 3.8 billion are active per token. That distinction matters enormously for running it locally, because it means you can keep the cold expert weights in system RAM and stream them over PCIe, while a much smaller working set lives on the GPU.&lt;/p&gt;

&lt;p&gt;This post walks through getting Gemma 4 running on a GeForce GTX 1080 — a 2016-vintage card with 8 GiB of VRAM — on Fedora 42, achieving &lt;strong&gt;~24.5 tokens/second&lt;/strong&gt; with 128k context, including fully-GPU-resident speculative decoding via Gemma 4's MTP assistant head.&lt;/p&gt;

&lt;p&gt;For comparison: I also ran the Qwen 3.6 35B-A3B model through the same process. It produced slightly slower output at the same context length, and was much more verbose given the same prompts — so for typical assistant workloads, Gemma 4 ends up faster end-to-end regardless of tok/s.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hardware
&lt;/h2&gt;

&lt;p&gt;The full system spec matters here, because the CPU and RAM are as important as the GPU when streaming MoE weights over PCIe:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Intel i7-6700 (Skylake, 4c/8t, 2015)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;32 GiB system RAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;NVIDIA GeForce GTX 1080, 8 GiB VRAM (Pascal, 2016)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fedora 42&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Nothing here is new : I bought the GPU second-hand in 2025 for under $200 USD.&lt;/p&gt;

&lt;p&gt;The key bottleneck to understand upfront:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check PCIe link state while the model is generating&lt;/span&gt;
lspci &lt;span class="nt"&gt;-vv&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; 01:00.0 | &lt;span class="nb"&gt;grep &lt;/span&gt;LnkSta
&lt;span class="c"&gt;#   LnkSta: Speed 8GT/s, Width x16&lt;/span&gt;
&lt;span class="c"&gt;# i.e. running at PCIe 3.0 maximum&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At the same time, &lt;code&gt;nvidia-smi&lt;/code&gt; shows the GPU at roughly 40–50% utilisation. &lt;strong&gt;PCIe maxed out + GPU half-idle = bandwidth-limited, not compute-limited.&lt;/strong&gt; This is the single most important fact for this setup: anything that reduces the volume of weight data crossing the PCIe bus per token helps; just having a faster GPU wouldn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  Gemma 4 26B-A4B: What You're Working With
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total parameters&lt;/td&gt;
&lt;td&gt;25.2B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Active parameters per token&lt;/td&gt;
&lt;td&gt;3.8B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Layers&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trained context&lt;/td&gt;
&lt;td&gt;256K tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The trick: with MoE models, only a few experts activate per token. &lt;code&gt;llama.cpp&lt;/code&gt; exposes this directly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--n-cpu-moe N&lt;/code&gt; — keep the MoE weights of the first N layers on the CPU&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--n-gpu-layers 999&lt;/code&gt; — everything else on the GPU&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On this card the sweet spot for 128k context turns out to be &lt;code&gt;--n-cpu-moe 21&lt;/code&gt; (with MTP) or &lt;code&gt;--n-cpu-moe 20&lt;/code&gt; (without).&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Pin the NVIDIA Driver to the 580xx Branch
&lt;/h2&gt;

&lt;p&gt;Pascal (GTX 1080) is approaching legacy status. On Fedora 42 you need to pin &lt;code&gt;akmod-nvidia&lt;/code&gt; to the 580xx branch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dnf swap akmod-nvidia akmod-nvidia-580xx &lt;span class="nt"&gt;--allowerasing&lt;/span&gt; &lt;span class="nt"&gt;--releasever&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;44
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--releasever=44&lt;/code&gt; is necessary to pull 580xx packaging from the newer repo metadata, even though the running system is Fedora 42.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: CUDA Toolkit and a Working &lt;code&gt;nvcc&lt;/code&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dnf reinstall cuda-nvcc-12-9.x86_64
find / | &lt;span class="nb"&gt;grep &lt;/span&gt;nvcc
&lt;span class="c"&gt;# /usr/local/cuda-12.9/bin/nvcc&lt;/span&gt;

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;CUDACXX&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/usr/local/cuda-12.9/bin/nvcc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 3: Force &lt;code&gt;gcc-14&lt;/code&gt; for the CUDA Build
&lt;/h2&gt;

&lt;p&gt;CUDA 12.9 doesn't accept the newest gcc that Fedora 42 ships by default:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dnf &lt;span class="nb"&gt;install &lt;/span&gt;gcc14 gcc14-c++
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The straightforward &lt;code&gt;-DCMAKE_C_COMPILER&lt;/code&gt; CMake flags don't work here — somewhere inside the NVIDIA/CUDA CMake modules, plain &lt;code&gt;gcc&lt;/code&gt; is hard-coded. The least-bad workaround is a symlink early on PATH:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; ~/.local/bin
&lt;span class="nb"&gt;pushd&lt;/span&gt; ~/.local/bin/
  &lt;span class="nb"&gt;ln&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; /usr/bin/gcc-14 gcc
  &lt;span class="nb"&gt;ln&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; /usr/bin/g++-14 g++
&lt;span class="nb"&gt;popd&lt;/span&gt;

&lt;span class="c"&gt;# Confirm ~/.local/bin is at the front of PATH&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$PATH&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Remember to remove these symlinks afterwards if you don't want every other build using gcc-14.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Patch CUDA's &lt;code&gt;math_functions.h&lt;/code&gt; for glibc 2.41
&lt;/h2&gt;

&lt;p&gt;CUDA 12.9 headers were written against an older &lt;code&gt;glibc&lt;/code&gt;. On Fedora 42 (&lt;code&gt;glibc 2.41&lt;/code&gt;) some inline definitions collide. Gentoo has a clean patch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/stefantalpalaru/gentoo-overlay/blob/75b7793fdc88314165ca28c75b557defcb38f6d8/dev-util/nvidia-cuda-toolkit/files/nvidia-cuda-toolkit-glibc-2.41-r1.patch" rel="noopener noreferrer"&gt;nvidia-cuda-toolkit-glibc-2.41-r1.patch&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Edit by hand, applying the patch:&lt;/span&gt;
&lt;span class="nv"&gt;$EDITOR&lt;/span&gt; /usr/local/cuda-12.9/targets/x86_64-linux/include/crt/math_functions.h
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The core change: replace &lt;code&gt;rsqrt(double x);&lt;/code&gt; with &lt;code&gt;rsqrt(double x) noexcept (true);&lt;/code&gt;, and &lt;code&gt;__func__(double rsqrt(double a));&lt;/code&gt; with &lt;code&gt;__func__(double rsqrt(double a)) throw();&lt;/code&gt;, for these six functions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="nf"&gt;rsqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="nf"&gt;sinpi&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="nf"&gt;cospi&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="nf"&gt;rsqrtf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;   &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="nf"&gt;sinpif&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;   &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="nf"&gt;cospif&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 5: Choose the Right &lt;code&gt;llama.cpp&lt;/code&gt; Fork
&lt;/h2&gt;

&lt;p&gt;Vanilla &lt;code&gt;llama.cpp&lt;/code&gt; works for most cases, but for Gemma 4 on an 8 GiB card you need two things standard &lt;code&gt;llama.cpp&lt;/code&gt; doesn't have:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;RotorQuant&lt;/strong&gt; — a Gemma-specific KV-cache quantisation scheme that makes the difference between fitting at 16k context and fitting at 128k context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MTP speculative decoding support&lt;/strong&gt; — for Gemma 4's assistant head&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The right fork is &lt;a href="https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant" rel="noopener noreferrer"&gt;AtomicBot-ai/atomic-llama-cpp-turboquant&lt;/a&gt;, which combines RotorQuant, TurboQuant KV cache, and Gemma 4 MTP support.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant.git
&lt;span class="nb"&gt;cd &lt;/span&gt;atomic-llama-cpp-turboquant/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 6: Build &lt;code&gt;llama.cpp&lt;/code&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;CUDACXX&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/usr/local/cuda-12.9/bin/nvcc

cmake &lt;span class="nt"&gt;--fresh&lt;/span&gt; &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="nt"&gt;-DCMAKE_CUDA_ARCHITECTURES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;native
&lt;span class="c"&gt;# 'native' picks up the compute capability of the installed GPU automatically.&lt;/span&gt;

cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release
&lt;span class="c"&gt;# NB: --parallel seemed to cause problems, so leave it off&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sanity check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ./build/bin

./llama-cli &lt;span class="nt"&gt;--list-devices&lt;/span&gt;
&lt;span class="c"&gt;# ggml_cuda_init: found 1 CUDA devices (Total VRAM: 8107 MiB):&lt;/span&gt;
&lt;span class="c"&gt;#   Device 0: NVIDIA GeForce GTX 1080, compute capability 6.1, VMM: yes, VRAM: 8107 MiB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's also worth dumping the full help text — there are a lot of flags and you'll be grepping it constantly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./llama-server &lt;span class="nt"&gt;--help&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; llama.cpp-man.txt
&lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt; llama.cpp-man.txt
&lt;span class="c"&gt;# 570 llama.cpp-man.txt&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 7: Download Gemma 4
&lt;/h2&gt;

&lt;p&gt;You need two GGUFs: the main model and the MTP assistant head.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Via huggingface-cli:&lt;/span&gt;
hf download AtomicChat/gemma-4-26B-A4B-it-assistant-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"*Q4_K_M.gguf"&lt;/span&gt; &lt;span class="nt"&gt;--local-dir&lt;/span&gt; ./Models

hf download unsloth/gemma-4-26B-A4B-it-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"*Q4_K_M*.gguf"&lt;/span&gt; &lt;span class="nt"&gt;--local-dir&lt;/span&gt; ./Models
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or directly via wget:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget https://huggingface.co/AtomicChat/gemma-4-26B-A4B-it-assistant-GGUF/resolve/main/gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf
wget https://huggingface.co/ggml-org/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-Q4_K_M.gguf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Move them to &lt;code&gt;~/Models/&lt;/code&gt; for sanity.&lt;/p&gt;

&lt;p&gt;Model cards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/google/gemma-4-26B-A4B-it#mixture-of-experts-moe-model" rel="noopener noreferrer"&gt;google/gemma-4-26B-A4B-it&lt;/a&gt; — main model&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/google/gemma-4-26B-A4B-it-assistant" rel="noopener noreferrer"&gt;google/gemma-4-26B-A4B-it-assistant&lt;/a&gt; — MTP speculative decoding head&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/AtomicChat/gemma-4-26B-A4B-it-assistant-GGUF" rel="noopener noreferrer"&gt;AtomicChat GGUFs&lt;/a&gt; — quantised assistant + main&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 8: First Runs (Baseline, No MTP)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ./build/bin

./llama-server &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--model&lt;/span&gt; ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-gpu-layers&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-cpu-moe&lt;/span&gt; 29 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; turbo3 &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; turbo3 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; on &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--no-mmap&lt;/span&gt; &lt;span class="nt"&gt;--mlock&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 16384
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A smoke-test query from another terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
       "messages": [
         {"role": "system", "content": "You are a helpful assistant."},
         {"role": "user", "content": "Please write a program to stream the Fibonacci numbers under 1000 - with the restriction that there should be only one print statement in the loop"}
       ]
     }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Timing from the server log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prompt eval time =    1035.77 ms /    52 tokens (   19.92 ms/tok,    50.20 tok/s)
       eval time =   67336.21 ms /  1076 tokens (   62.58 ms/tok,    15.98 tok/s)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;About 16 tok/s. Pushing to 128k context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./llama-server &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--model&lt;/span&gt; ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-gpu-layers&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-cpu-moe&lt;/span&gt; 29 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; turbo3 &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; turbo3 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; on &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--no-mmap&lt;/span&gt; &lt;span class="nt"&gt;--mlock&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 128000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prompt eval time =     985.96 ms /    52 tokens (   18.96 ms/tok,    52.74 tok/s)
       eval time =   98309.88 ms /  1538 tokens (   63.92 ms/tok,    15.64 tok/s)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The memory breakdown at startup is informative:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;llama_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (GTX 1080)   |  8107 = 4632 + ( 3299 =  2103 +     664 +     532) +         174 |
llama_memory_breakdown_print: |   - Host               |                 14747 = 14477 +       0 +     270                |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;~3.3 GiB on GPU, ~14.7 GiB on host. There's ~4.6 GiB free on GPU — enough to pull more layers over. Trying &lt;code&gt;--n-cpu-moe 20&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./llama-server &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--model&lt;/span&gt; ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-gpu-layers&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-cpu-moe&lt;/span&gt; 20 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; turbo3 &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; turbo3 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; on &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--no-mmap&lt;/span&gt; &lt;span class="nt"&gt;--mlock&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 128000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prompt eval time =     584.25 ms /    29 tokens (   20.15 ms/tok,    49.64 tok/s)
       eval time =   88887.26 ms /  1758 tokens (   50.56 ms/tok,    19.78 tok/s)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At &lt;code&gt;--n-cpu-moe 19&lt;/code&gt; it OOMs. &lt;strong&gt;20 is the floor for Gemma 4 at 128k context without MTP&lt;/strong&gt; — giving ~20 tok/s.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 9: Adding MTP Speculative Decoding
&lt;/h2&gt;

&lt;p&gt;Gemma 4 ships with a small "assistant" MTP (Multi-Token Prediction) head designed for speculative decoding. The idea: the small assistant drafts several tokens cheaply, then the main model verifies them in one pass. If enough drafts are accepted, effective throughput goes up.&lt;/p&gt;

&lt;p&gt;Initial attempt with MTP:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./llama-server &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--model&lt;/span&gt; ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-gpu-layers&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-cpu-moe&lt;/span&gt; 20 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; turbo3 &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; turbo3 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--mtp-head&lt;/span&gt; ~/Models/gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-gpu-layers-draft&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--spec-type&lt;/span&gt; mtp &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--draft-block-size&lt;/span&gt; 3 &lt;span class="nt"&gt;--draft-max&lt;/span&gt; 16 &lt;span class="nt"&gt;--draft-min&lt;/span&gt; 0 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cache-type-k-draft&lt;/span&gt; turbo3 &lt;span class="nt"&gt;--cache-type-v-draft&lt;/span&gt; turbo3 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; on &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--no-mmap&lt;/span&gt; &lt;span class="nt"&gt;--mlock&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 128000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;eval time =   55049.45 ms /  1151 tokens (   47.83 ms/tok,    20.91 tok/s)
draft acceptance rate = 0.76096 (694 accepted / 912 generated)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;~21 tok/s. A 76% acceptance rate sounds good — but we only gained ~1 tok/s over the no-MTP baseline. Something is wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 10: Debugging Why MTP Barely Helps
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;llama_memory_breakdown_print&lt;/code&gt; line at startup showed ~4.6 GiB free on GPU. The assistant head is small — why isn't it helping more?&lt;/p&gt;

&lt;p&gt;The clue is in the &lt;strong&gt;per-model &lt;code&gt;load_tensors&lt;/code&gt; stanzas&lt;/strong&gt; in the server's startup log. There are two of them — one for the main model, one for the assistant. Here's what they showed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Main Gemma 4 26B-A4B:&lt;/span&gt;
load_tensors: offloaded 31/31 layers to GPU
load_tensors:          CPU model buffer size =   577.50 MiB
load_tensors:        CUDA0 model buffer size =  6504.39 MiB
load_tensors:    CUDA_Host model buffer size =  9498.51 MiB

&lt;span class="gh"&gt;# MTP assistant (5 layers):&lt;/span&gt;
load_tensors: offloaded 5/5 layers to GPU
load_tensors:          CPU model buffer size =   210.00 MiB   ← problem
load_tensors:        CUDA0 model buffer size =    82.24 MiB
load_tensors:    CUDA_Host model buffer size =     3.09 MiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The assistant reports 5/5 layers "offloaded to GPU" — but 210 MiB is still on plain CPU, vs only 82 MiB on &lt;code&gt;CUDA0&lt;/code&gt;. The sum is ~292 MiB; what's in that CPU chunk?&lt;/p&gt;

&lt;p&gt;The answer is in &lt;code&gt;llama.cpp&lt;/code&gt;'s source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// assign the input layer&lt;/span&gt;
&lt;span class="c1"&gt;// there is very little benefit to offloading the input layer,&lt;/span&gt;
&lt;span class="c1"&gt;// so always keep it on the CPU&lt;/span&gt;
&lt;span class="n"&gt;pimpl&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;dev_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;cpu_dev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;pimpl&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;cpu_buft_list&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The token embedding table is &lt;strong&gt;unconditionally pinned to the CPU&lt;/strong&gt;, regardless of &lt;code&gt;--n-gpu-layers&lt;/code&gt;. For most models this is fine: the embedding lookup is a &lt;code&gt;get_rows&lt;/code&gt; operation — it pulls a handful of vocab rows per forward pass and is cheap from CPU.&lt;/p&gt;

&lt;p&gt;But Gemma 4 26B-A4B's assistant has a &lt;strong&gt;tied LM head&lt;/strong&gt;: the LM head matrix is the &lt;em&gt;same tensor&lt;/em&gt; as &lt;code&gt;token_embd.weight&lt;/code&gt;. Every single draft token generation performs a full &lt;code&gt;mul_mat(tok_embd, hidden_state)&lt;/code&gt; — a &lt;code&gt;262144 × 1024&lt;/code&gt; matmul against that 210 MiB table. At Q4_K_M that's ~150 MiB hauled across PCIe &lt;strong&gt;for every draft token generated&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The supposed-to-be-free speculative decoding was actually adding PCIe load on top of the target model's MoE streaming. That's why MTP barely moved the needle.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 11: Fix — Force the Embedding Table onto the GPU
&lt;/h2&gt;

&lt;p&gt;Two subtleties to get this right:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Use &lt;code&gt;--override-tensor-draft&lt;/code&gt;, not &lt;code&gt;--override-tensor&lt;/code&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;llama.cpp&lt;/code&gt; has parallel flags for the target model and the speculative draft model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;-ot&lt;/span&gt;,  &lt;span class="nt"&gt;--override-tensor&lt;/span&gt;         &lt;span class="c"&gt;# affects the target model only&lt;/span&gt;
&lt;span class="nt"&gt;-otd&lt;/span&gt;, &lt;span class="nt"&gt;--override-tensor-draft&lt;/span&gt;   &lt;span class="c"&gt;# affects the assistant/draft model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Use the on-disk tensor name, not the C++ field name.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The tensor is stored on disk as &lt;code&gt;token_embd.weight&lt;/code&gt;, not &lt;code&gt;mtp.tok_embd&lt;/code&gt;. The override flag matches against the on-disk name.&lt;/p&gt;

&lt;p&gt;The corrected invocation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./llama-server &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--model&lt;/span&gt; ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--mtp-head&lt;/span&gt; ~/Models/gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--spec-type&lt;/span&gt; mtp &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--draft-block-size&lt;/span&gt; 3 &lt;span class="nt"&gt;--draft-max&lt;/span&gt; 16 &lt;span class="nt"&gt;--draft-min&lt;/span&gt; 0 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-gpu-layers&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-cpu-moe&lt;/span&gt; 21 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-gpu-layers-draft&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-cpu-moe-draft&lt;/span&gt; 0 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--override-tensor-draft&lt;/span&gt; &lt;span class="s2"&gt;"token_embd&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;weight=CUDA0"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; turbo3 &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; turbo3 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cache-type-k-draft&lt;/span&gt; turbo3 &lt;span class="nt"&gt;--cache-type-v-draft&lt;/span&gt; turbo3 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; on &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--no-mmap&lt;/span&gt; &lt;span class="nt"&gt;--mlock&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 128000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: &lt;code&gt;--n-cpu-moe 21&lt;/code&gt; rather than 20 — moving 210 MiB into VRAM consumes that headroom. 20 now OOMs; 21 is the new floor.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;--verbose&lt;/code&gt;, you can confirm the override fired:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tensor token_embd.weight (210 MiB q6_K) buffer type overridden to CUDA0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the assistant's &lt;code&gt;load_tensors&lt;/code&gt; stanza now shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;load_tensors: offloaded 5/5 layers to GPU
load_tensors:        CUDA0 model buffer size =   292.24 MiB   # was 82 MiB
load_tensors:    CUDA_Host model buffer size =     3.09 MiB
                                                              # CPU line gone
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;82 + 210 = 292. The CPU buffer has disappeared entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 12: Results
&lt;/h2&gt;

&lt;p&gt;Sweeping &lt;code&gt;--n-cpu-moe&lt;/code&gt; to find the sweet spot (more CPU layers = less VRAM pressure but more PCIe per target token):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;--n-cpu-moe 25&lt;/code&gt; (conservative):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;eval time = 48.07 ms/tok,  20.80 tok/s
draft acceptance rate = 0.74150 (829 accepted / 1118 generated)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;--n-cpu-moe 22&lt;/code&gt;:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;eval time = 40.95 ms/tok,  24.42 tok/s
draft acceptance rate = 0.82300 (637 accepted / 774 generated)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;--n-cpu-moe 21&lt;/code&gt; (OOM floor, sweet spot):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;eval time = 40.85 ms/tok,  24.48 tok/s
draft acceptance rate = 0.78587 (712 accepted / 906 generated)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;(20 = OOM)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;~24.5 tok/s at 128k context&lt;/strong&gt; — a real ~22% improvement over the ~20 tok/s no-MTP baseline.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;mtp statistics&lt;/code&gt; line tells the full story:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;statistics mtp: #calls(b,g,a) = 1 453 389  dur(b,g,a) = 0.004, 3048.331, 0.086 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;dur(b,g,a)&lt;/code&gt; tuple is time in each MTP phase: &lt;em&gt;batch&lt;/em&gt; (prefill), &lt;em&gt;generation&lt;/em&gt; (drafting), &lt;em&gt;acceptance&lt;/em&gt; (verification). Generation takes ~3 seconds total across 453 calls (~6.7 ms per draft call); acceptance is essentially free at 0.086 ms total. That's exactly what you want: the draft model is CUDA-compute-bound, not PCIe-bound.&lt;/p&gt;

&lt;p&gt;Before the fix, each draft call was individually slower — the matmul was crossing PCIe. After moving to CUDA0, per-call duration dropped and acceptance rate improved.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Diagnostic: How to Tell If Your MTP Head Is Actually on the GPU
&lt;/h2&gt;

&lt;p&gt;The startup &lt;code&gt;llama_memory_breakdown_print&lt;/code&gt; line is &lt;strong&gt;not reliable&lt;/strong&gt; — it covers the target model only, not the assistant. The correct check is the &lt;strong&gt;second &lt;code&gt;load_tensors&lt;/code&gt; stanza&lt;/strong&gt; in the startup log.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good&lt;/strong&gt; — no CPU line for the assistant:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;load_tensors:        CUDA0 model buffer size =   292.24 MiB
load_tensors:    CUDA_Host model buffer size =     3.09 MiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Bad&lt;/strong&gt; — embedding table is on the CPU, MTP won't benefit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;load_tensors:          CPU model buffer size =   210.00 MiB
load_tensors:        CUDA0 model buffer size =    82.24 MiB
load_tensors:    CUDA_Host model buffer size =     3.09 MiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the CPU line is non-zero for the assistant, check whether the model has a tied LM head and add &lt;code&gt;--override-tensor-draft "token_embd\.weight=CUDA0"&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;mtp statistics&lt;/code&gt; generation time is also a tell: a few milliseconds per draft call means GPU-resident; tens of milliseconds means PCIe-bound.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary: Working Configuration
&lt;/h2&gt;

&lt;p&gt;Here's the fastest configuration found on this hardware (GTX 1080, 8 GiB VRAM, 128k context):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;atomic-llama-cpp-turboquant/build/bin

./llama-server &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--model&lt;/span&gt; ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-gpu-layers&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-cpu-moe&lt;/span&gt; 21 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--mtp-head&lt;/span&gt; ~/Models/gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-gpu-layers-draft&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--n-cpu-moe-draft&lt;/span&gt; 0 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--override-tensor-draft&lt;/span&gt; &lt;span class="s2"&gt;"token_embd&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;weight=CUDA0"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--spec-type&lt;/span&gt; mtp &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--draft-block-size&lt;/span&gt; 3 &lt;span class="nt"&gt;--draft-max&lt;/span&gt; 16 &lt;span class="nt"&gt;--draft-min&lt;/span&gt; 0 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; turbo3 &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; turbo3 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--cache-type-k-draft&lt;/span&gt; turbo3 &lt;span class="nt"&gt;--cache-type-v-draft&lt;/span&gt; turbo3 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; on &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--no-mmap&lt;/span&gt; &lt;span class="nt"&gt;--mlock&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 128000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result: ~24.5 tok/s, 128k context, ~79% draft acceptance rate.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The key lessons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The MoE architecture is what makes this possible.&lt;/strong&gt; Only ~3.8B parameters are active per token; the rest sit cold in RAM and stream on demand. &lt;code&gt;--n-cpu-moe 21&lt;/code&gt; is the sweet spot between VRAM pressure and PCIe bandwidth.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;RotorQuant KV cache matters.&lt;/strong&gt; The &lt;code&gt;--cache-type-k turbo3 --cache-type-v turbo3&lt;/code&gt; flags (from the AtomicBot fork) are what get you from 16k context to 128k context on 8 GiB VRAM.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;MTP works — but only once you force the embedding table onto the GPU.&lt;/strong&gt; &lt;code&gt;--n-gpu-layers-draft 999&lt;/code&gt; is not enough. Gemma 4's assistant has a tied LM head; without &lt;code&gt;--override-tensor-draft "token_embd\.weight=CUDA0"&lt;/code&gt;, the 262144×1024 matmul runs against CPU memory, adding ~150 MiB of PCIe traffic per draft token and negating almost all of the speculative decoding benefit.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Check the second &lt;code&gt;load_tensors&lt;/code&gt; stanza, not &lt;code&gt;llama_memory_breakdown_print&lt;/code&gt;.&lt;/strong&gt; The breakdown line covers the target model only. The per-model load stanzas are the only reliable way to confirm where the assistant's weights actually landed.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;The &lt;code&gt;llama.cpp&lt;/code&gt; fork used throughout: &lt;a href="https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant" rel="noopener noreferrer"&gt;AtomicBot-ai/atomic-llama-cpp-turboquant&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
