pulkitgovrani

Posted on May 24

Gemma 4 26B A4B: What "Mixture of Experts" Actually Means for Your Inference Budget

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Gemma 4's most interesting model isn't the 31B flagship. It's the 26B A4B — a Mixture-of-Experts model that activates only 4 billion parameters per token while delivering performance nearly identical to the dense 31B.

If that sounds like magic, it's not. But the engineering behind it is worth understanding, because it changes what hardware you need to run a near-frontier model locally.

Dense vs MoE: The Core Difference

In a standard dense transformer (like Gemma 4 31B), every token that passes through the model activates every parameter. All 31 billion of them, every forward pass.

In a Mixture-of-Experts model, the network is split into a large pool of "expert" sub-networks. Each token is routed — by a learned gating function — to a small subset of those experts. Only the selected experts do computation for that token.

The Gemma 4 26B A4B has:

128 total expert sub-networks
8 experts activated per token (hence "A4B" — ~4B active params)
26B total parameters in the full model

During inference, you're doing the compute of roughly a 4B model. But the model has 26B parameters of learned knowledge available to route between.

Dense 31B:  [token] → ALL 31B params → output
            Cost: 31B FLOPs per token

MoE 26B A4B: [token] → router → 8 of 128 experts → output  
             Cost: ~4B FLOPs per token
             But knowledge from: 26B params

Why This Matters for VRAM

This is where things get practical. VRAM requirements are dominated by parameter count in memory, not by compute per token.

The 26B A4B still needs to hold all 26B parameters in memory — or at least the layers that might be needed for any given batch. At bfloat16, that's ~52GB. At 4-bit quantization (Q4_K_M), it's roughly 13-14GB.

Compare to the dense 31B at 4-bit: ~17-18GB.

So you save meaningful VRAM versus the dense 31B, and you get near-identical output quality. The tradeoff compared to a true 4B dense model: you need 3-4x the VRAM, but you get 20-25x better benchmark performance.

Model	Active params	VRAM (bf16)	VRAM (Q4)	AIME 2026
Gemma 4 E4B	4.5B	~9GB	~3GB	—
Gemma 4 26B A4B	4B active	~52GB	~14GB	88.3%
Gemma 4 31B	31B	~62GB	~17GB	89.2%

For the 26B A4B: a 16GB consumer GPU (RTX 4080, 4090) can run it at 4-bit. A Mac with 32GB unified memory runs it comfortably at 8-bit. No multi-GPU setup required.

Running the 26B A4B Locally

Ollama

ollama pull gemma4:26b
ollama run gemma4:26b

Ollama handles quantization automatically. On a 16GB GPU it applies Q4 by default.

llama.cpp

# Download the quantized GGUF
huggingface-cli download unsloth/gemma-4-26b-a4b-it-GGUF \
  --local-dir ./gemma4-26b \
  --include "gemma-4-26b-a4b-it-Q4_K_M.gguf"

# Run
llama-server \
  -m ./gemma4-26b/gemma-4-26b-a4b-it-Q4_K_M.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 40 \
  --host 0.0.0.0 \
  --port 8080

MLX (Apple Silicon)

pip install mlx-lm
mlx_lm.generate \
  --model mlx-community/gemma-4-26b-a4b-it-4bit \
  --prompt "Explain the tradeoffs between B-trees and LSM-trees for write-heavy workloads" \
  --max-tokens 1024

On an M3 Max (128GB), this runs at 30-40 tokens/second. On an M4 Pro (48GB), 20-30 t/s at 4-bit.

How the Router Works

The gating network is a small learned linear layer that maps each token's hidden state to a score over all 128 experts. The top-8 scoring experts are selected, their outputs are weighted by their gate scores, and the weighted sum is the layer output.

# Simplified MoE forward pass (conceptual)
def moe_forward(x, experts, gate):
    # x: [batch, seq_len, hidden_dim]
    scores = gate(x)                          # [batch, seq_len, 128]
    top_k_scores, top_k_idx = scores.topk(8)  # select 8 experts
    top_k_scores = F.softmax(top_k_scores, dim=-1)

    output = torch.zeros_like(x)
    for i, expert_idx in enumerate(top_k_idx.unbind(-1)):
        expert_output = experts[expert_idx](x)
        output += top_k_scores[..., i:i+1] * expert_output

    return output

The interesting part: different experts specialize during training. Some become better at code, some at reasoning, some at factual recall. The router learns to dispatch accordingly — a form of implicit task routing without any explicit labeling.

The Latency Picture: Where MoE Wins and Where It Doesn't

MoE wins on throughput (batch inference): When processing many requests simultaneously, the reduced compute per token means you can serve more requests per second on the same hardware.

MoE is roughly equal on single-token latency: The routing overhead is small, and you're still doing full attention across the sequence.

MoE loses on memory bandwidth: All 26B parameters sit in VRAM. If your GPU's memory bandwidth is the bottleneck (common on consumer GPUs), you pay the full bandwidth cost even though you only activate 4B params per forward pass.

The practical upshot: for a local inference server handling multiple concurrent users, the 26B A4B is a better choice than the 31B dense. For single-user interactive use, they'll feel similar. For production batch jobs, the 26B A4B has a clear advantage.

Comparing 26B A4B vs Dense Alternatives

The natural comparison isn't just "vs Gemma 4 31B" — it's "vs everything else you could run at this VRAM budget."

At ~14GB Q4:

Gemma 4 26B A4B (AIME 88.3%, Codeforces 2100+ ELO)
Llama 3.3 70B Q2 (heavy quantization, quality degrades significantly)
Qwen 2.5 14B (good, but single architecture with ~70B-scale quality gap)
Mistral Small 22B Q4 (~12GB, strong but narrower multimodal support)

For reasoning tasks specifically — math, code, multi-step logic — the 26B A4B has a substantial quality lead over everything else in the 14-16GB VRAM bracket. That's the genuine breakthrough of the MoE architecture here.

When to Use 26B A4B vs 31B Dense

Use 26B A4B when:

You're running on a single 16GB consumer GPU
You need high throughput (multiple concurrent users)
Your tasks are reasoning-heavy (math, code, logic) — this is where MoE specialization shines
You're on Apple Silicon with 32-64GB unified memory

Use 31B dense when:

You have 24GB+ VRAM (or multi-GPU)
You need maximum consistency across diverse task types
You're doing long-context work where 256K context matters and you have the VRAM headroom
You're fine-tuning — dense models are easier to fine-tune than MoE (routing can shift unpredictably during LoRA adaptation)

The Architecture Bet That Paid Off

MoE is not new — it's been in research since the 1990s and production since Switch Transformer. What's new is Google making it work at this quality level in an open-weight model at a size that fits on consumer hardware.

The 26B A4B is the clearest evidence yet that "bigger dense model" is not the only path to capability. For the developer who doesn't have a multi-GPU server, this model is the reason Gemma 4 is meaningfully different from every open-weight release before it.

DEV Community