DEV Community

SomeOddCodeGuy
SomeOddCodeGuy

Posted on • Originally published at someoddcodeguy.dev

Understanding MoE Offloading

So last night/early this morning, I decided to go down a "how does this work?" rabbit hole. I was trying to answer someone's question about how Llama.cpp handles offloading with Mixture of Experts models on a regular gaming PC with a 24GB GPU, and ended up spending a few hours in a deep dive. I figured I'd write up a description here, too, in case anyone ever stumbled here and was curious how it worked.

So most folks have seen how the models label themselves as N number of parameters, but Y number of active parameters. Like Qwen3 30b a3b- 30b model, 3b active parameters. As the name implies- when you send inference, it uses 3b worth of parameters.

The trick is that the model is built with a "router" that, for each token, selects a small subset of "experts" to process it. Again, old news for most of you, but just covering the bases.

Now that the obvious is out of the way, lets go into what does trip people up.

An expert is essentially a self-contained "feed-forward network" with its own set of parameters. The model has a whole library of them to choose from. Let's use gpt-oss-120b as a concrete example. It's listed as a ~120B parameter model, but only activates 5.1B parameters per token. The active parameter count is made up of two parts:

  • The "dense" or shared parts (~1.5B parameters): These are components like the self-attention mechanism and the router itself. They're always on and are used for every single token processed.
  • The active expert parameters (~3.5B parameters): This is the "on-demand" part. This model has 36 layers, and each layer contains 128 distinct experts, but only 4 are used per token, per layer. (this is all in their doc above). That's a pool of 4,608 experts total, and as the token works its way through each of the 36 layers one by one, the router in those layers picks 4 of those experts to use. Since the total expert parameter pool is around 114.7B, that means each expert is about 24M parameters in size. The calculation for the active portion is: 36 layers * 4 experts/layer * 24M params/expert ≈ 3.5B parameters.

Add the dense part (~1.5B) to the active expert part (~3.5B), and you get your ~5.1B active parameters. So for any given token, the computational load is that of a 5.1B model, not a 120B one.

So that, in a nutshell, is what an MoE is doing. But that all just explains the computation, not the memory.

Even quantized, the full 120B parameter set often won't fit in 24GB of VRAM. You'd have to go down to a ~1bpw to do that usually ((1 ÷ 8) * 120) == 0.125 * 120 == 15GB for a 1bpw model). Of course, this model is a little more fun in that it's MXFP4 trained model, meaning its about half the size, so you can probably get away with q2 ((2 ÷ 8) * 60) == 0.25 * 60 == 15GB for a 2bpw model). Of course MXFP4 is a little bigger than just plan 4bpw, but close enough.

The point is- you don't even want to try usually. Quantizing an MoE sucks bad enough without dipping into the really crappy quants. This is where the Llama.cpp's offloading comes in.

Llama.cpp has the ability to chose some of the layers to be run in system RAM/CPU, and some to be run in GPU. That's what the -ngl flag is doing; stating how many layers to offload into the GPU. If you do a value of "99", since it's higher than the number of layers most of the models have, it pretty much means you'll offload the whole model into the GPU.

Anyhow, it has a special offload just for MoEs; you do it by combining -ngl 99 with --n-cpu-moe N.

  • -ngl 99 tells llama.cpp to try and load all layers of the model into the GPU's VRAM. Since the model likely has fewer than 99 layers (36 for gpt-oss-120b), this is effectively an "offload everything to the GPU" command.
  • --n-cpu-moe 20 (as an example value) then acts as an exception. It tells the engine: "For the first 20 layers of the model, take the expert components and move them to the CPU's system RAM."

NOTE: I incorrectly thought in my answer to the person who asked the question that it was the last 20 layers, but looking at the code, it appears to be the first 20 instead

for (int i = 0; i < value; ++i) {
    // ...
    buft_overrides.push_back(string_format("blk\\.%d\\.ffn_(up|down|gate)_exps", i));
    // ...
}
Enter fullscreen mode Exit fullscreen mode

Anyhow, using these two flags: now the model is split. The dense, always-on parts and the experts from layers 21-36 are in VRAM. The experts for layers 1-20 are in slower system RAM.

When a prompt comes in and each token bounces through each layer like a pachinko ball, the router for that layer determines which experts to use for that token.

If it picks an expert that's in VRAM, the computation is extremely fast on the GPU. If it picks an expert that was offloaded to system RAM, that's the bottleneck.

Here's what happens under the hood: Llama.cpp doesn't move the entire 24M parameter expert from RAM into VRAM—that would be way too slow. Instead, it sends the token's small activation vector from VRAM across the PCIe bus to system RAM. The CPU then performs the math using the expert weights residing in RAM, and the result is sent back across the PCIe bus to the VRAM for the GPU to continue its work. Even though the CPU is slower, it's still plenty for what is essentially a 24M model.

This round-trip happens for every single token that gets routed to an offloaded expert. The actual bottleneck isn't the CPU's processing speed, but the latency of the PCIe bus.

So in the end, you're not really processing a 120B model. You're processing a ~5B model, where some of the work takes a slower path through the CPU and system RAM. It's slower than a native 5B model that fits entirely in VRAM, but vastly faster than trying to run a dense 120B model.

As far as I know, finding the right --n-cpu-moe value for your specific hardware is just a matter of trial and error to find the best performance sweet spot. There are also other commands, like tensor-override, which give you even more fine-grained control... but at that point you really gotta understand what's going on lol.

Top comments (0)