RTX 4070 + Qwen 35B: 2.8x Speedup From One llama.cpp Flag (--cpu-moe)

#llm #performance #ai #hardware

The Ollama defaults gave me 12.2 tok/s on Qwen3.5-35B-A3B against an RTX 4070 (12 GB). I switched to llama.cpp with two flags and got 34.6 tok/s. 2.8x.

The two flags were -ngl 99 (offload all layers to GPU) and --cpu-moe (except the MoE experts, which go on the CPU). One of them is the obvious one. The other is the entire post.

llama-server -m qwen35.gguf -ngl 99 --cpu-moe -c 4096

I want to walk through why that specific combination works on a 12 GB card that "should not" fit a 35B model, the full offload sweep so you can pick the right setting for your VRAM tier, and the three things that will cause the same command to under-perform on your machine.

The setup, briefly

GPU: RTX 4070, 12 GB
RAM: 31 GB
OS: WSL2 Ubuntu 24.04
CUDA: 12.9
Model: Qwen3.5-35B-A3B, Q4_K_M quantization (20.49 GiB, 34.66B params, 128 experts, 8 active per token)
llama.cpp: built with cmake -B build -DGGML_CUDA=ON

If you are on the newer Qwen3.6-35B-A3B (released early 2026) or the smaller Qwen3-Coder-30B-A3B, the same flag math applies. Both are 128-expert MoEs with 8 active per forward pass, so the offload ratio is identical and the speedup pattern transfers.

Why "no chunks in VRAM" wins on a 12 GB card

The reflexive instinct on a 12 GB card is to pack as much of the model as possible into VRAM and let the rest spill to RAM. That instinct is wrong for MoE.

A dense 35B model wants every layer warm because every weight participates in every token. A MoE model is the opposite: of 128 experts in each layer, only 8 fire per token. The other 120 are dead weight that round trips for nothing.

-ngl 99 says "put every layer on the GPU." On a 12 GB card with a 20 GiB model, this should be impossible.

--cpu-moe (added to llama.cpp in mid-2025 as the all-CPU shortcut for the more granular --n-cpu-moe N) is the escape hatch: put every layer on the GPU except the MoE experts, which stay on the CPU. Now what is on the GPU is the bandwidth-hungry part (attention, KV cache, layer norms, the router) and what is on the CPU is the sparse part (the experts themselves, which barely fire).

The result on this machine:

VRAM used: 11.7 GB (95% of 12 GB)
Generation: 34.6 tok/s (vs Ollama default 12.2)

Ollama's autoloader is not dumb — it figures out that ~58% of the model has to go on CPU and ~42% on GPU. But it splits the model the dense way: by layer. So you end up with half the attention paths on CPU (where bandwidth chokes) and half the experts on GPU (where they sit idle most of the time). It is the worst of both worlds, and it costs you 2.8x.

The full offload sweep

The flag people will reach for next is --n-cpu-moe N, which lets you offload only N layers of experts to CPU and put the rest on GPU. The instinct is "well, GPU is faster, so put as many experts back on GPU as fit." This is also wrong, and the sweep shows by how much.

-ngl 99 is fixed. Only N changes:

`--n-cpu-moe`	Expert layers on GPU	tok/s (tg128)	vs Ollama default
48 (all experts on CPU)	0	34.60	2.8x
44	4	27.19	2.2x
40	8	16.88	1.4x
36	12	15.29	1.3x
32	16	14.06	1.2x
28	20	12.85	1.1x
24	24	11.71	0.96x

Monotonic decay. The moment you start pulling experts back onto the GPU, you steal VRAM from the parts that actually need bandwidth, and the whole pipeline slows down. By the time half the experts (24 layers) are back on GPU, you are slower than Ollama's automatic split.

The reading is direct: the optimum for a 12 GB card on a 128-expert MoE is all experts on CPU. Not "as many as fit." All of them.

The bench command to reproduce, if you want to:

./build/bin/llama-bench -m qwen35.gguf -ngl 99 -ncmoe 48 -n 128 -r 3

-ncmoe 48 is the bench-tool equivalent of --n-cpu-moe 48, which in turn is the explicit form of --cpu-moe for a 48-layer model. Same setting, three names. The flag landed in llama.cpp during a period of active iteration on MoE offload semantics.

What VRAM tier maps to what

You can almost read your sweet spot off the table above. Roughly:

8–10 GB cards (RTX 4060 / 3070): full --cpu-moe. You will not have headroom to put any experts on GPU.
12 GB cards (RTX 4070 / 3060 12 GB): full --cpu-moe. The sweep above is yours. 34.6 tok/s is the realistic ceiling for Q4 35B-A3B.
16 GB cards (RTX 4060 Ti 16 GB / 4070 Ti SUPER): you can start putting a few expert layers back on GPU (N=44 → 4 layers on GPU) and gain a little, but only a little — the next regime gets dominated by PCIe-bus expert transfers that eat the win.
24 GB cards (RTX 4090 / 3090): you can fit the model fully and skip this entire post. Lucky you.

The crossover where "all on GPU" beats "experts on CPU" is somewhere around 24 GB for Q4 35B-A3B. Below that, the math says spread.

Three things that will tank your number

The 34.6 tok/s above is not what you get by pasting the command. It is what you get by pasting the command after clearing three traps.

Trap 1: VRAM is not actually free. WSL2 happily shares VRAM with Windows-side processes. If Edge has 200 MB of "hardware acceleration" stuck in VRAM, your attention layer fights for it and loses. Check with nvidia-smi before benching. The number you want next to "Used" is under 500 MB before you start.

Trap 2: Qwen thinking mode. Qwen3.5 has a "thinking" mode that emits reasoning tokens before the answer. If you benchmark with a generic prompt, it will think for 400 tokens about "what is 2+2" and your tok/s number is meaningless. Either disable thinking via the system prompt or measure with prompts that bypass it.

Trap 3: Quantization and build flags. The 34.6 figure is Q4_K_M with a CUDA-enabled build. Q5_K_M will fall to roughly 28–30 tok/s on the same card because the experts get heavier per token. A CPU-only build of llama.cpp will obviously sit at single digits. If your number is off by 40%, check nvidia-smi during inference — llama-server should show 95%+ GPU utilization on the prompt and 30–60% on generation. If it shows 5%, you are running CPU-only without realizing it.

The line worth memorizing

For dense models, "put as much as possible on GPU" is correct. For sparse MoE on consumer GPUs, "put everything on GPU except the experts" is correct, and the gap between those two heuristics is 2.8x on this card.

The one-line version: the bottleneck on a 12 GB card running a 35B MoE is not parameters. It is bandwidth. The right partition puts the bandwidth-hungry part on the bandwidth-rich device, even when that means leaving 60% of the parameter count on the CPU.

If you take one thing away from this post: run llama-bench with -ngl 99 -ncmoe 48 -n 128 -r 3 on your card and write the number down. If it is more than 2x your Ollama default, the rest of your local-LLM tuning is variance. If it is less, your VRAM is not actually free.

If you want the full data-driven engineering pattern behind this kind of measurement-first tuning — same logic applied to broader system harnesses, build pipelines, and agent loops — Harness Engineering Guide is the long form.