This is a submission for the Gemma 4 Challenge: Write About Gemma 4
"Everyone describes Gemma 4 26B as a standard MoE. The architecture says otherwise. Here's the design choice nobody's unpacking."
When Google released Gemma 4 on April 2, 2026, the headlines about the 26B model were predictable: "It's an MoE!" — "Activates only 3.8B parameters!" — "26B quality at 4B cost!"
All of those are technically true. None of them describe what Google actually built.
Read any of the great write-ups on this topic — and there are a lot of them — and you'll find the same explanation of Gemma 4's MoE: "A router picks a small subset of experts for each token." That's the description of Mixtral. Of DeepSeek. Of Qwen. It's a fine description of Mixture-of-Experts in general.
It is not what Gemma 4 26B does.
Google built something quietly different. If you're planning to fine-tune Gemma 4, serve it at scale, or just want to know what's inside the binary you're downloading, this is the architecture detail that's worth your time.
The standard MoE pattern (Mixtral, DeepSeek, Qwen)
Pretend you've never seen MoE before. A transformer is a stack of blocks. Each block has two main components:
- Attention — figures out which other tokens matter for the current one.
- Feed-forward network (FFN), also called the MLP — applies a learned non-linear transformation. This is where most of the parameters live.
The standard MoE trick is: replace the FFN with multiple FFNs (called "experts") and add a small "router" that picks which ones to run for each token.
Standard MoE block (Mixtral / DeepSeek / Qwen style):
Input
│
▼
[Attention]
│
▼
[Router] ── decides top-K of N experts
│
▼
┌────────┼─────────┬─────────┐
▼ ▼ ▼ ▼
[Exp 1] [Exp 2] [Exp 3] ... [Exp N]
│ │ │ │
└────────┴─────────┴─────────┘
│ (only top-K fire)
▼
Output
Mixtral 8x7B: 8 experts, 2 active per token. DeepSeek V3: 256 experts, 8 active. The router gives you sparse compute — all parameters are stored in VRAM, but only a small fraction does math for any given token.
When people say "MoE," this is what they mean. The dense FFN is replaced by sparse experts. That word matters: replaced.
What Gemma 4 26B actually does
Gemma 4's MoE block keeps the dense FFN. It doesn't replace it. The sparse experts run as a parallel path alongside the dense one, and the outputs are summed. There's also a shared expert that fires on every token regardless of what the router decides.
Three pathways. Two of them always on. One sparse.
Gemma 4 26B MoE block (the actual design):
Input
│
▼
[Attention]
│
┌───────────┼───────────┐
▼ ▼ ▼
[Dense FFN] [Shared Exp] [Router]
(always on) (always on) │
▼
┌──────────┼──────────┐
▼ ▼ ▼
[Exp 1] [Exp 2] ... [Exp 128]
│ │ │
└──────────┴───────────┘
│ (only 8 fire)
▼
┌───────────┴───────────┐
▼
[Sum all three pathways]
│
▼
Output
The numbers, with the caveat that Google hasn't released a full technical report at time of writing: roughly 128 routed experts, 8 active per token, one always-on shared expert, plus the always-on dense FFN. Total parameters ≈ 25.2B, active per token ≈ 3.8B.
That third always-on pathway is what makes this design unusual. Standard MoE is a substitution. Gemma 4's MoE is an addition.
Why Google built it this way
Here's the part nobody is unpacking. Why bother keeping a dense path when you've already built an expert system?
The cleanest answer is robustness.
In a standard MoE, the output quality of each block depends entirely on the router making the right choice. If the router picks the wrong experts for a token, the representation degrades, and there's no fallback. Routers are small networks, easy to break during training, and prone to collapse — a failure mode where the router gets stuck picking the same few experts and the rest go to waste.
In Gemma 4, the dense FFN provides a stable, always-on representational backbone. The shared expert adds a second always-on signal. The routed experts add specialization on top. If the router whiffs on a token, two coherent signals still carry it forward.
That buys you several real things:
- Easier training. Router collapse is less catastrophic when the dense path is doing useful work in parallel — your loss still goes down even when the routing is bad.
- Cleaner distillation from a dense teacher. Gemma 4 was trained alongside a 31B dense sibling. A student that already has a dense path can absorb a dense teacher's signal naturally; a pure MoE student has to learn to route at the same time it's learning to imitate.
- Graceful degradation under quantization. When you Q3 or Q2 the routed experts to save memory, the dense path stays at full precision and keeps the model coherent.
The tradeoff is efficiency. A pure MoE wastes nothing — every FLOP serves the network. Gemma 4 spends FLOPs on a dense path that runs even when the experts could have handled a token alone. Google deliberately traded peak MoE efficiency for training robustness and distillation friendliness.
That choice has downstream consequences. Let's get into them.
The catch: memory ≠ compute
Here is the most expensive mistake you can make with Gemma 4 26B: assuming "3.8B active" means it'll run like a 4B model on your hardware.
It computes like a 4B model. It does not fit like one.
The router doesn't know in advance which experts will be needed for the next token, so all 128 of them must be loaded in VRAM simultaneously. You save nothing on memory.
Real numbers (a recent controlled benchmark on Gemma 4, Phi-4, and Qwen3 measured peak VRAM under realistic inference loads):
| Model | Active params / token | Total in VRAM | Measured peak VRAM |
|---|---|---|---|
| Gemma 4 E4B (dense) | 4.5B | 4.5B | 14.9 GB |
| Gemma 4 26B A4B | 3.8B | 25.2B | 48.1 GB |
Same active parameter count. Three-times the memory footprint. People running the 26B MoE on Macs with 24 GB have reported swap thrashing and ~2 tok/s. The model is fast in terms of compute per token; it's slow at fitting on consumer hardware.
The 31B Dense, by comparison, is more honest about its appetite. You see 31B, you allocate for 31B. The MoE label hides the fact that you still need workstation- or server-class memory to use it.
Rule of thumb: if you can't comfortably hold ≥48 GB of model + KV cache + activation buffers, skip the 26B MoE. Use E4B for an edge build, or 31B with quantization-and-offloading for a quality build.
What this means for production serving
Standard playbook for serving a 26B-ish model: one beefy GPU, one request at a time, cap concurrency to control latency.
That's exactly the wrong setup for Gemma 4 26B.
The parallel-dense-plus-experts design rewards concurrency. Different tokens from different users activate different sets of experts. With a sane batcher (vLLM, SGLang, TGI), you can:
- Send token A from user 1 through experts
{3, 17, 42, 88, 91, 102, 110, 124} - Send token B from user 2 through experts
{1, 5, 9, 26, 33, 57, 71, 119} - …in the same forward pass.
This is expert parallelism, and the flag that turns it on is the one that matters in vLLM:
python -m vllm.entrypoints.openai.api_server \
--model google/gemma-4-26B-A4B-it \
--dtype bfloat16 \
--max-model-len 32768 \
--enable-expert-parallel \
--served-model-name gemma4-moe
Without --enable-expert-parallel, you're paying the memory cost of an MoE and getting dense-model throughput. With it on, the MoE earns its keep.
The decision tree, in one line:
- Single user, laptop, predictable latency → Gemma 4 31B Dense.
- Multi-user API with 8+ concurrent requests → Gemma 4 26B MoE with expert parallelism enabled.
Pick one based on your traffic shape, not just the parameter counts on the model card.
What this means for fine-tuning
This is where the parallel-dense design becomes really interesting, and where most tooling hasn't caught up yet.
In Mixtral, DeepSeek, or Qwen, fine-tuning an MoE means choosing what to update:
- Update the router → risks expert collapse.
- Update the experts → safe but slow and gradient-sparse.
- LoRA on the experts → the standard pattern.
In Gemma 4 26B you have a third surface to fine-tune: the dense FFN that runs on every token. That changes the strategy.
A LoRA on the dense FFN gives you a steady, broadly-applied signal that doesn't require any router decision to take effect. The gradients are clean — they fire on every single token in your dataset, no sparse activation problem. Your domain knowledge "lifts the floor" across the whole model.
This suggests a layered approach:
- First pass: LoRA on the dense FFN paths only. Cheap, fast, broadly effective. Best for general domain adaptation (legal tone, medical terminology, your company's writing style).
- Second pass: LoRA on the shared expert. Similar broad coverage, smaller capacity, slightly more specialized.
- Third pass (only if needed): LoRA on selected routed experts. Targeted, expensive, useful when you need specialization the dense path can't absorb (a rare schema, an unusual task format).
I haven't seen this layered strategy spelled out anywhere yet for Gemma 4 — it falls naturally out of how the architecture works, but standard MoE fine-tuning guides don't anticipate having a dense path to lean on.
Practical heads-up: as of launch, full QLoRA tooling for Gemma 4 26B was not ready in the major libraries. Unsloth and the TRL team are working on it. If you're planning a serious fine-tune, watch their repos before committing to a stack.
What this tells us about Google's research direction
Gemma 4 26B's hybrid design isn't an accident. It's a deliberate bet on a few principles:
- Distillation matters more than peak MoE efficiency. A model that distills cleanly from a dense teacher is more useful than one that wrings every FLOP out of routing.
- Robustness beats benchmark peaks. Quality you can count on across diverse inputs is worth more than a leaderboard score that depends on the router being perfectly trained.
- Open weights have to fine-tune well. A model the community can't reliably fine-tune won't grow the ecosystem Google needs to compete with Llama.
The other open-model labs are running a different play. DeepSeek and Qwen are pushing pure MoE further — more experts, finer sparsity, more peak efficiency. Google is going the opposite direction: more pathways in parallel, more redundancy, more graceful failure.
It's too early to say which approach wins long-term. But the next time someone tells you "Gemma 4 26B is just Google's MoE," you can correct them. It's something else — a hybrid that quietly rewrites the rules of how a Mixture-of-Experts model should look.
What to do with this
Three concrete takeaways:
- Read architecture diagrams before benchmarks. "MoE" is not one design. The label hides differences that show up in production — in your memory bill, your throughput, your fine-tuning recipe.
- Size your hardware for total parameters, not active parameters. Active is compute. Total is memory. The 26B MoE eats ~48 GB at BF16; if you can't hold that, use E4B or 31B with offloading.
- If you're fine-tuning, target the dense FFN first. It's the most stable surface in Gemma 4 26B, and most MoE models don't even have one. Use that.
Architecture details aren't trivia. They're the gap between a model that fits your workload and one that quietly disappoints.
Sources: Gemma 4 model card, Welcome Gemma 4 on Hugging Face, Louis Wang's Gemma 4 architecture analysis (cited for the parallel-dense observation), and a controlled multi-model benchmark on Gemma 4 / Phi-4 / Qwen3 (cited for VRAM numbers). Exact expert counts (8 of 128) come from third-party architecture write-ups; Google hasn't released a full Gemma 4 technical report at time of writing. If a paper drops and numbers shift, this post will be updated.
Built on a workstation; nothing here required cloud access. If you want to verify the memory figures, run nvidia-smi during a vLLM serve of google/gemma-4-26B-A4B-it at BF16. The numbers are reproducible.
Top comments (0)