DEV Community

Cover image for Running Gemma 4 26B on an Old GTX 1080 with llama.cpp
Martin Andrews
Martin Andrews

Posted on

Running Gemma 4 26B on an Old GTX 1080 with llama.cpp

Gemma 4 Challenge: Write about Gemma 4 Submission

How to get Google's Gemma 4 26B-A4B Mixture-of-Experts model running locally — including speculative decoding — on hardware that has no business running it.


Google's Gemma 4 26B-A4B is a Mixture-of-Experts model: 25.2 billion total parameters, but only 3.8 billion are active per token. That distinction matters enormously for running it locally, because it means you can keep the cold expert weights in system RAM and stream them over PCIe, while a much smaller working set lives on the GPU.

This post walks through getting Gemma 4 running on a GeForce GTX 1080 — a 2016-vintage card with 8 GiB of VRAM — on Fedora 42, achieving ~24.5 tokens/second with 128k context, including fully-GPU-resident speculative decoding via Gemma 4's MTP assistant head.

For comparison: I also ran the Qwen 3.6 35B-A3B model through the same process. It produced slightly slower output at the same context length, and was much more verbose given the same prompts — so for typical assistant workloads, Gemma 4 ends up faster end-to-end regardless of tok/s.


The Hardware

The full system spec matters here, because the CPU and RAM are as important as the GPU when streaming MoE weights over PCIe:

Component Spec
CPU Intel i7-6700 (Skylake, 4c/8t, 2015)
RAM 32 GiB system RAM
GPU NVIDIA GeForce GTX 1080, 8 GiB VRAM (Pascal, 2016)
OS Fedora 42

Nothing here is new : I bought the GPU second-hand in 2025 for under $200 USD.

The key bottleneck to understand upfront:

# Check PCIe link state while the model is generating
lspci -vv -s 01:00.0 | grep LnkSta
#   LnkSta: Speed 8GT/s, Width x16
# i.e. running at PCIe 3.0 maximum
Enter fullscreen mode Exit fullscreen mode

At the same time, nvidia-smi shows the GPU at roughly 40–50% utilisation. PCIe maxed out + GPU half-idle = bandwidth-limited, not compute-limited. This is the single most important fact for this setup: anything that reduces the volume of weight data crossing the PCIe bus per token helps; just having a faster GPU wouldn't.


Gemma 4 26B-A4B: What You're Working With

Property Value
Total parameters 25.2B
Active parameters per token 3.8B
Layers 30
Trained context 256K tokens

The trick: with MoE models, only a few experts activate per token. llama.cpp exposes this directly:

  • --n-cpu-moe N — keep the MoE weights of the first N layers on the CPU
  • --n-gpu-layers 999 — everything else on the GPU

On this card the sweet spot for 128k context turns out to be --n-cpu-moe 21 (with MTP) or --n-cpu-moe 20 (without).


Step 1: Pin the NVIDIA Driver to the 580xx Branch

Pascal (GTX 1080) is approaching legacy status. On Fedora 42 you need to pin akmod-nvidia to the 580xx branch:

dnf swap akmod-nvidia akmod-nvidia-580xx --allowerasing --releasever=44
Enter fullscreen mode Exit fullscreen mode

The --releasever=44 is necessary to pull 580xx packaging from the newer repo metadata, even though the running system is Fedora 42.


Step 2: CUDA Toolkit and a Working nvcc

dnf reinstall cuda-nvcc-12-9.x86_64
find / | grep nvcc
# /usr/local/cuda-12.9/bin/nvcc

export CUDACXX=/usr/local/cuda-12.9/bin/nvcc
Enter fullscreen mode Exit fullscreen mode

Step 3: Force gcc-14 for the CUDA Build

CUDA 12.9 doesn't accept the newest gcc that Fedora 42 ships by default:

dnf install gcc14 gcc14-c++
Enter fullscreen mode Exit fullscreen mode

The straightforward -DCMAKE_C_COMPILER CMake flags don't work here — somewhere inside the NVIDIA/CUDA CMake modules, plain gcc is hard-coded. The least-bad workaround is a symlink early on PATH:

mkdir -p ~/.local/bin
pushd ~/.local/bin/
  ln -s /usr/bin/gcc-14 gcc
  ln -s /usr/bin/g++-14 g++
popd

# Confirm ~/.local/bin is at the front of PATH
echo $PATH
Enter fullscreen mode Exit fullscreen mode

Remember to remove these symlinks afterwards if you don't want every other build using gcc-14.


Step 4: Patch CUDA's math_functions.h for glibc 2.41

CUDA 12.9 headers were written against an older glibc. On Fedora 42 (glibc 2.41) some inline definitions collide. Gentoo has a clean patch:

# Edit by hand, applying the patch:
$EDITOR /usr/local/cuda-12.9/targets/x86_64-linux/include/crt/math_functions.h
Enter fullscreen mode Exit fullscreen mode

The core change: replace rsqrt(double x); with rsqrt(double x) noexcept (true);, and __func__(double rsqrt(double a)); with __func__(double rsqrt(double a)) throw();, for these six functions:

double rsqrt(double a);  double sinpi(double a);  double cospi(double a);
float rsqrtf(float a);   float sinpif(float a);   float cospif(float a);
Enter fullscreen mode Exit fullscreen mode

Step 5: Choose the Right llama.cpp Fork

Vanilla llama.cpp works for most cases, but for Gemma 4 on an 8 GiB card you need two things standard llama.cpp doesn't have:

  1. RotorQuant — a Gemma-specific KV-cache quantisation scheme that makes the difference between fitting at 16k context and fitting at 128k context
  2. MTP speculative decoding support — for Gemma 4's assistant head

The right fork is AtomicBot-ai/atomic-llama-cpp-turboquant, which combines RotorQuant, TurboQuant KV cache, and Gemma 4 MTP support.

git clone https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant.git
cd atomic-llama-cpp-turboquant/
Enter fullscreen mode Exit fullscreen mode

Step 6: Build llama.cpp

export CUDACXX=/usr/local/cuda-12.9/bin/nvcc

cmake --fresh -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native
# 'native' picks up the compute capability of the installed GPU automatically.

cmake --build build --config Release
# NB: --parallel seemed to cause problems, so leave it off
Enter fullscreen mode Exit fullscreen mode

Sanity check:

cd ./build/bin

./llama-cli --list-devices
# ggml_cuda_init: found 1 CUDA devices (Total VRAM: 8107 MiB):
#   Device 0: NVIDIA GeForce GTX 1080, compute capability 6.1, VMM: yes, VRAM: 8107 MiB
Enter fullscreen mode Exit fullscreen mode

It's also worth dumping the full help text — there are a lot of flags and you'll be grepping it constantly:

./llama-server --help > llama.cpp-man.txt
wc -l llama.cpp-man.txt
# 570 llama.cpp-man.txt
Enter fullscreen mode Exit fullscreen mode

Step 7: Download Gemma 4

You need two GGUFs: the main model and the MTP assistant head.

# Via huggingface-cli:
hf download AtomicChat/gemma-4-26B-A4B-it-assistant-GGUF \
    --include "*Q4_K_M.gguf" --local-dir ./Models

hf download unsloth/gemma-4-26B-A4B-it-GGUF \
    --include "*Q4_K_M*.gguf" --local-dir ./Models
Enter fullscreen mode Exit fullscreen mode

Or directly via wget:

wget https://huggingface.co/AtomicChat/gemma-4-26B-A4B-it-assistant-GGUF/resolve/main/gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf
wget https://huggingface.co/ggml-org/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-Q4_K_M.gguf
Enter fullscreen mode Exit fullscreen mode

Move them to ~/Models/ for sanity.

Model cards:


Step 8: First Runs (Baseline, No MTP)

cd ./build/bin

./llama-server \
    --model ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --n-gpu-layers 999 \
    --n-cpu-moe 29 \
    --cache-type-k turbo3 --cache-type-v turbo3 \
    --flash-attn on \
    --no-mmap --mlock \
    --ctx-size 16384
Enter fullscreen mode Exit fullscreen mode

A smoke-test query from another terminal:

curl -X POST http://localhost:8080/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
       "messages": [
         {"role": "system", "content": "You are a helpful assistant."},
         {"role": "user", "content": "Please write a program to stream the Fibonacci numbers under 1000 - with the restriction that there should be only one print statement in the loop"}
       ]
     }'
Enter fullscreen mode Exit fullscreen mode

Timing from the server log:

prompt eval time =    1035.77 ms /    52 tokens (   19.92 ms/tok,    50.20 tok/s)
       eval time =   67336.21 ms /  1076 tokens (   62.58 ms/tok,    15.98 tok/s)
Enter fullscreen mode Exit fullscreen mode

About 16 tok/s. Pushing to 128k context:

./llama-server \
    --model ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --n-gpu-layers 999 \
    --n-cpu-moe 29 \
    --cache-type-k turbo3 --cache-type-v turbo3 \
    --flash-attn on \
    --no-mmap --mlock \
    --ctx-size 128000
Enter fullscreen mode Exit fullscreen mode
prompt eval time =     985.96 ms /    52 tokens (   18.96 ms/tok,    52.74 tok/s)
       eval time =   98309.88 ms /  1538 tokens (   63.92 ms/tok,    15.64 tok/s)
Enter fullscreen mode Exit fullscreen mode

The memory breakdown at startup is informative:

llama_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (GTX 1080)   |  8107 = 4632 + ( 3299 =  2103 +     664 +     532) +         174 |
llama_memory_breakdown_print: |   - Host               |                 14747 = 14477 +       0 +     270                |
Enter fullscreen mode Exit fullscreen mode

~3.3 GiB on GPU, ~14.7 GiB on host. There's ~4.6 GiB free on GPU — enough to pull more layers over. Trying --n-cpu-moe 20:

./llama-server \
    --model ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --n-gpu-layers 999 \
    --n-cpu-moe 20 \
    --cache-type-k turbo3 --cache-type-v turbo3 \
    --flash-attn on \
    --no-mmap --mlock \
    --ctx-size 128000
Enter fullscreen mode Exit fullscreen mode
prompt eval time =     584.25 ms /    29 tokens (   20.15 ms/tok,    49.64 tok/s)
       eval time =   88887.26 ms /  1758 tokens (   50.56 ms/tok,    19.78 tok/s)
Enter fullscreen mode Exit fullscreen mode

At --n-cpu-moe 19 it OOMs. 20 is the floor for Gemma 4 at 128k context without MTP — giving ~20 tok/s.


Step 9: Adding MTP Speculative Decoding

Gemma 4 ships with a small "assistant" MTP (Multi-Token Prediction) head designed for speculative decoding. The idea: the small assistant drafts several tokens cheaply, then the main model verifies them in one pass. If enough drafts are accepted, effective throughput goes up.

Initial attempt with MTP:

./llama-server \
    --model ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --n-gpu-layers 999 \
    --n-cpu-moe 20 \
    --cache-type-k turbo3 --cache-type-v turbo3 \
    --mtp-head ~/Models/gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf \
    --n-gpu-layers-draft 999 \
    --spec-type mtp \
    --draft-block-size 3 --draft-max 16 --draft-min 0 \
    --cache-type-k-draft turbo3 --cache-type-v-draft turbo3 \
    --flash-attn on \
    --no-mmap --mlock \
    --ctx-size 128000
Enter fullscreen mode Exit fullscreen mode
eval time =   55049.45 ms /  1151 tokens (   47.83 ms/tok,    20.91 tok/s)
draft acceptance rate = 0.76096 (694 accepted / 912 generated)
Enter fullscreen mode Exit fullscreen mode

~21 tok/s. A 76% acceptance rate sounds good — but we only gained ~1 tok/s over the no-MTP baseline. Something is wrong.


Step 10: Debugging Why MTP Barely Helps

The llama_memory_breakdown_print line at startup showed ~4.6 GiB free on GPU. The assistant head is small — why isn't it helping more?

The clue is in the per-model load_tensors stanzas in the server's startup log. There are two of them — one for the main model, one for the assistant. Here's what they showed:

# Main Gemma 4 26B-A4B:
load_tensors: offloaded 31/31 layers to GPU
load_tensors:          CPU model buffer size =   577.50 MiB
load_tensors:        CUDA0 model buffer size =  6504.39 MiB
load_tensors:    CUDA_Host model buffer size =  9498.51 MiB

# MTP assistant (5 layers):
load_tensors: offloaded 5/5 layers to GPU
load_tensors:          CPU model buffer size =   210.00 MiB   ← problem
load_tensors:        CUDA0 model buffer size =    82.24 MiB
load_tensors:    CUDA_Host model buffer size =     3.09 MiB
Enter fullscreen mode Exit fullscreen mode

The assistant reports 5/5 layers "offloaded to GPU" — but 210 MiB is still on plain CPU, vs only 82 MiB on CUDA0. The sum is ~292 MiB; what's in that CPU chunk?

The answer is in llama.cpp's source:

// assign the input layer
// there is very little benefit to offloading the input layer,
// so always keep it on the CPU
pimpl->dev_input = { cpu_dev, &pimpl->cpu_buft_list };
Enter fullscreen mode Exit fullscreen mode

The token embedding table is unconditionally pinned to the CPU, regardless of --n-gpu-layers. For most models this is fine: the embedding lookup is a get_rows operation — it pulls a handful of vocab rows per forward pass and is cheap from CPU.

But Gemma 4 26B-A4B's assistant has a tied LM head: the LM head matrix is the same tensor as token_embd.weight. Every single draft token generation performs a full mul_mat(tok_embd, hidden_state) — a 262144 × 1024 matmul against that 210 MiB table. At Q4_K_M that's ~150 MiB hauled across PCIe for every draft token generated.

The supposed-to-be-free speculative decoding was actually adding PCIe load on top of the target model's MoE streaming. That's why MTP barely moved the needle.


Step 11: Fix — Force the Embedding Table onto the GPU

Two subtleties to get this right:

1. Use --override-tensor-draft, not --override-tensor.

llama.cpp has parallel flags for the target model and the speculative draft model:

-ot,  --override-tensor         # affects the target model only
-otd, --override-tensor-draft   # affects the assistant/draft model
Enter fullscreen mode Exit fullscreen mode

2. Use the on-disk tensor name, not the C++ field name.

The tensor is stored on disk as token_embd.weight, not mtp.tok_embd. The override flag matches against the on-disk name.

The corrected invocation:

./llama-server \
    --model ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --mtp-head ~/Models/gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf \
    --spec-type mtp \
    --draft-block-size 3 --draft-max 16 --draft-min 0 \
    --n-gpu-layers 999 \
    --n-cpu-moe 21 \
    --n-gpu-layers-draft 999 \
    --n-cpu-moe-draft 0 \
    --override-tensor-draft "token_embd\.weight=CUDA0" \
    --cache-type-k turbo3 --cache-type-v turbo3 \
    --cache-type-k-draft turbo3 --cache-type-v-draft turbo3 \
    --flash-attn on \
    --no-mmap --mlock \
    --ctx-size 128000
Enter fullscreen mode Exit fullscreen mode

Note: --n-cpu-moe 21 rather than 20 — moving 210 MiB into VRAM consumes that headroom. 20 now OOMs; 21 is the new floor.

With --verbose, you can confirm the override fired:

tensor token_embd.weight (210 MiB q6_K) buffer type overridden to CUDA0
Enter fullscreen mode Exit fullscreen mode

And the assistant's load_tensors stanza now shows:

load_tensors: offloaded 5/5 layers to GPU
load_tensors:        CUDA0 model buffer size =   292.24 MiB   # was 82 MiB
load_tensors:    CUDA_Host model buffer size =     3.09 MiB
                                                              # CPU line gone
Enter fullscreen mode Exit fullscreen mode

82 + 210 = 292. The CPU buffer has disappeared entirely.


Step 12: Results

Sweeping --n-cpu-moe to find the sweet spot (more CPU layers = less VRAM pressure but more PCIe per target token):

--n-cpu-moe 25 (conservative):

eval time = 48.07 ms/tok,  20.80 tok/s
draft acceptance rate = 0.74150 (829 accepted / 1118 generated)
Enter fullscreen mode Exit fullscreen mode

--n-cpu-moe 22:

eval time = 40.95 ms/tok,  24.42 tok/s
draft acceptance rate = 0.82300 (637 accepted / 774 generated)
Enter fullscreen mode Exit fullscreen mode

--n-cpu-moe 21 (OOM floor, sweet spot):

eval time = 40.85 ms/tok,  24.48 tok/s
draft acceptance rate = 0.78587 (712 accepted / 906 generated)
Enter fullscreen mode Exit fullscreen mode

(20 = OOM)

~24.5 tok/s at 128k context — a real ~22% improvement over the ~20 tok/s no-MTP baseline.

The mtp statistics line tells the full story:

statistics mtp: #calls(b,g,a) = 1 453 389  dur(b,g,a) = 0.004, 3048.331, 0.086 ms
Enter fullscreen mode Exit fullscreen mode

The dur(b,g,a) tuple is time in each MTP phase: batch (prefill), generation (drafting), acceptance (verification). Generation takes ~3 seconds total across 453 calls (~6.7 ms per draft call); acceptance is essentially free at 0.086 ms total. That's exactly what you want: the draft model is CUDA-compute-bound, not PCIe-bound.

Before the fix, each draft call was individually slower — the matmul was crossing PCIe. After moving to CUDA0, per-call duration dropped and acceptance rate improved.


The Diagnostic: How to Tell If Your MTP Head Is Actually on the GPU

The startup llama_memory_breakdown_print line is not reliable — it covers the target model only, not the assistant. The correct check is the second load_tensors stanza in the startup log.

Good — no CPU line for the assistant:

load_tensors:        CUDA0 model buffer size =   292.24 MiB
load_tensors:    CUDA_Host model buffer size =     3.09 MiB
Enter fullscreen mode Exit fullscreen mode

Bad — embedding table is on the CPU, MTP won't benefit:

load_tensors:          CPU model buffer size =   210.00 MiB
load_tensors:        CUDA0 model buffer size =    82.24 MiB
load_tensors:    CUDA_Host model buffer size =     3.09 MiB
Enter fullscreen mode Exit fullscreen mode

If the CPU line is non-zero for the assistant, check whether the model has a tied LM head and add --override-tensor-draft "token_embd\.weight=CUDA0".

The mtp statistics generation time is also a tell: a few milliseconds per draft call means GPU-resident; tens of milliseconds means PCIe-bound.


Summary: Working Configuration

Here's the fastest configuration found on this hardware (GTX 1080, 8 GiB VRAM, 128k context):

cd atomic-llama-cpp-turboquant/build/bin

./llama-server \
    --model ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --n-gpu-layers 999 \
    --n-cpu-moe 21 \
    --mtp-head ~/Models/gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf \
    --n-gpu-layers-draft 999 \
    --n-cpu-moe-draft 0 \
    --override-tensor-draft "token_embd\.weight=CUDA0" \
    --spec-type mtp \
    --draft-block-size 3 --draft-max 16 --draft-min 0 \
    --cache-type-k turbo3 --cache-type-v turbo3 \
    --cache-type-k-draft turbo3 --cache-type-v-draft turbo3 \
    --flash-attn on \
    --no-mmap --mlock \
    --ctx-size 128000
Enter fullscreen mode Exit fullscreen mode

Result: ~24.5 tok/s, 128k context, ~79% draft acceptance rate.

The key lessons:

  1. The MoE architecture is what makes this possible. Only ~3.8B parameters are active per token; the rest sit cold in RAM and stream on demand. --n-cpu-moe 21 is the sweet spot between VRAM pressure and PCIe bandwidth.

  2. RotorQuant KV cache matters. The --cache-type-k turbo3 --cache-type-v turbo3 flags (from the AtomicBot fork) are what get you from 16k context to 128k context on 8 GiB VRAM.

  3. MTP works — but only once you force the embedding table onto the GPU. --n-gpu-layers-draft 999 is not enough. Gemma 4's assistant has a tied LM head; without --override-tensor-draft "token_embd\.weight=CUDA0", the 262144×1024 matmul runs against CPU memory, adding ~150 MiB of PCIe traffic per draft token and negating almost all of the speculative decoding benefit.

  4. Check the second load_tensors stanza, not llama_memory_breakdown_print. The breakdown line covers the target model only. The per-model load stanzas are the only reliable way to confirm where the assistant's weights actually landed.


The llama.cpp fork used throughout: AtomicBot-ai/atomic-llama-cpp-turboquant

Top comments (0)