plasmon

Posted on Mar 31

MoE Beat Dense 27B by 2.4x on 8GB VRAM — The 35B-A3B Benchmark Nobody Expected

#ai #llm #gpu #machinelearning

Start with the benchmarks

In a previous article, I compared three Qwen3.5 models on the same hardware. Here are the MoE-relevant numbers.

Test environment: RTX 4060 8GB / Ryzen 7 / 32GB DDR5 / llama.cpp / Q4_K_M

Model                Speed(t/s)  VRAM    GPU%   CPU%   RAM     ngl
Qwen3.5-9B          33.0        7.1GB   91%    32%    22.6GB  99 (all layers GPU)
Qwen3.5-27B         3.57        7.7GB   60%    74%    28.3GB  24 (24/58 layers GPU)
Qwen3.5-35B-A3B     8.61        7.6GB   95%    65%    30.8GB  99 (all layers GPU)

All three models consume nearly the same VRAM (7.1-7.7GB). Yet speed varies by 10x: 33.0, 3.57, 8.61 t/s.

The critical comparison is Dense 27B vs MoE 35B-A3B. The 35B model is faster than the 27B model by 2.4x, despite having more parameters.

Why 35B beats 27B

The answer is in the GPU utilization numbers.

Dense 27B (GPU 60%): Q4_K_M size is about 16GB. Can't fit in 8GB, so only 24 out of 58 layers run on the GPU (ngl=24). The remaining 34 layers run on CPU. The GPU finishes its portion and sits idle waiting for the CPU. 60% GPU utilization means the GPU is wasting 40% of its time.

MoE 35B-A3B (GPU 95%): Q4_K_M size is about 21GB. Also can't fit in 8GB. Yet ngl=99 — all layers on GPU. How?

MoE's structure is the key. The 35B-A3B has 256 experts, but only 8 routed experts + 1 shared expert activate per token — roughly 3B parameters worth of compute. llama.cpp loads these active parameters into VRAM and keeps the inactive expert weights in system RAM. Result: 7.6GB VRAM with GPU pipelines 95% utilized on 3B worth of computation.

The 30.8GB system RAM consumption confirms this: 8.2GB more than the 9B model's 22.6GB. That difference accounts for the full expert weights stored in RAM (the RAM delta roughly matches expected expert weight storage, though exact internal behavior depends on llama.cpp implementation).

How MoE works: call only the specialists

A standard Transformer (dense model) routes every token through all parameters. A 27B model runs 27B parameters of computation per token.

MoE has N experts (small FFNs) per layer, and a gating network selects top-K per token.

Dense 27B:
  Input token → [All 27B parameters] → Output
  27B computation every time. Can't fit in 8GB → 24/58 layers GPU → rest CPU → slow

MoE 35B-A3B:
  Input token → Gating → Select 8+1 experts
                       → [Active params: ~3B] → Output
  3B computation per token. Fits in GPU → fast
  Remaining 32B of inactive experts wait in RAM

The reason this works on 8GB is clear. Dense 27B needs all parameters through the GPU but can't fit them. MoE 35B only needs active parameters through the GPU, and 3B fits easily. When both models exceed VRAM, MoE has a structural advantage.

MoE efficiency in numbers

Major MoE model specs (official published values):

Model                    Total Params  Active Params  Active %  Experts
Mixtral 8x7B             46.7B         12.9B          27.6%     8
Mixtral 8x22B            141B          39B            27.7%     8
Qwen3-235B-A22B          235B          22B            9.4%      128
Qwen3.5-35B-A3B          35B           3B             8.6%      256
DeepSeek-V3              671B          37B            5.5%      256

Active % trend: Early MoE (27%) → Latest MoE (5-9%)
Expert count:   Early (8) → Latest (64-256)

There's a clear trend toward lower activation ratios. Mixtral-era 27% down to DeepSeek-V3's 5.5%. The direction: split experts finer, select fewer more precisely.

A key finding from the DeepSeekMoE paper: finer expert granularity improves performance even with the same compute budget. More small experts with top-K selection outperforms fewer large experts.

For 8GB users, the key insight: the lower the activation ratio, the less computation needs to happen inside VRAM. Qwen3.5-35B-A3B's 8.6% activation ratio means using 35B worth of knowledge at 3B of GPU load.

8GB user decision framework

Based on the benchmark data:

Speed priority → Dense 9B (33 t/s). All layers fit in GPU, 91% utilization. Best for interactive chat and code completion.

Quality priority → MoE 35B-A3B (8.61 t/s). 2.4x faster than Dense 27B (3.57 t/s), with comparable or better output quality. In my tests, 35B-A3B produced more concise thinking (8 lines vs 27B's 11 lines) and generated code with parallel test cases.

Dense 27B has no advantage. 3.57 t/s is too slow for interactive use. Quality doesn't clearly beat MoE 35B-A3B. The partial GPU offload (ngl=24) creates an inefficient state at GPU 60% / CPU 74%.

One constraint: MoE 35B-A3B consumes 30.8GB of system RAM. On a 16GB RAM machine, swapping would occur and these conclusions may not hold (on my 32GB system, 30.8GB consumed leaves only 1.2GB free — already near the limit).

What the literature got wrong

Most MoE explainers assume "all parameters fit in VRAM." They target A100 80GB or RTX 4090 24GB setups.

This assumption leads to "if it doesn't fit in VRAM, use Dense instead." That conclusion contradicts the 8GB benchmark data. Here's why.

The literature's speed advantage for MoE is about "active-parameter compute at dense quality." When everything fits in VRAM, 3B active params give you 35B quality. That's correct.

But what the literature misses: Dense models that exceed VRAM are even slower. 27B Dense also doesn't fit in 8GB. MoE also doesn't fit. When comparing two models that both exceed VRAM, the one that can complete its GPU computation with fewer parameters wins.

[When both exceed 8GB VRAM]

Dense 27B:     Needs all 27B computed → 24/58 layers GPU → 3.57 t/s
MoE 35B-A3B:   Needs only 3B computed → all layers GPU → 8.61 t/s

MoE wins when "neither fits."
The literature's "MoE needs abundant VRAM" only holds when the Dense comparison model DOES fit.

For 8GB users, MoE's value isn't "fast when you have enough VRAM." It's "faster than comparable Dense when you DON'T have enough VRAM." The advantage shows up precisely in constrained environments.

MoE limitations: not a silver bullet

The benchmarks also reveal MoE's constraints.

Context exhaustion: The 35B-A3B's deep knowledge means its thinking consumes context aggressively. In my tests, a knowledge summarization task (quantum computing in 500 characters) exhausted 8192 context in 20 minutes. The 9B completed the same task. Deep knowledge becomes a handicap under context constraints.

System RAM consumption: 30.8GB pushes a 32GB system to its limit. Running other applications alongside becomes difficult. Not practical on 16GB RAM systems.

Speed ceiling: 8.61 t/s is usable but far from 9B's 33 t/s. For real-time streaming output, the difference is immediately felt.

References

Qwen3.5 3-Model Benchmark on RTX 4060 — Source of benchmark data in this article
MoE LLMs — Cameron R. Wolfe — Comprehensive MoE architecture overview
DeepSeek MoE and V2 — Chipstrat — DeepSeekMoE design philosophy
Qwen1.5-MoE — Achieving 7B performance with 1/3 active parameters
The Big LLM Architecture Comparison — Dense/MoE/MLA comparison

DEV Community