Tech_Nuggets

Posted on Jun 13

Mixture of Experts (MoE): what it actually does under the hood, and when it pays off

#llm #architecture #opensource #ai

Mixture of Experts (MoE): what it actually does under the hood, and when it pays off

You deployed a 7B model in production. Response times are fine — 45 ms per token — but you want to scale to a 70B without buying four more GPUs. Someone mentions MoE: "70B performance at 7B compute." It sounds like free lunch. So you look at the Mixtral 8x7B paper, you see 45 billion parameters and a claim that each token only activates about 13 billion of them, and you wonder: how is that physically possible, and what is the catch?

This post explains the sparse MoE architecture that powers Mixtral, DeepSeek-MoE, Qwen2.5-MoE, DBRX, and Grok-1: what the router actually does, why load-balancing is the hardest problem in training them, and the three specific constraints that determine whether MoE is the right choice for your deployment.

Why the distinction between total parameters and active parameters matters

A dense transformer (like Llama 3.2) activates 100 percent of its parameters for every token. The FFN layer in each transformer block runs the same matrix multiplication for every input. This makes memory use predictable and throughput easy to model, but it also means that scaling from 7B to 70B multiplies both memory and compute by 10x.

MoE decouples the two. The model stores more parameters (more memory), but each token only uses a fraction of them (less compute). Here is the core trade-off expressed in numbers:

Metric	Dense 7B	Dense 70B	MoE 45B (Mixtral)
Total parameters	7B	70B	45B (8 experts)
Active per token	7B	70B	~12.9B (2 experts)
Compute per token	7B-equiv	70B-equiv	14B-equiv
Memory (weights)	~14 GB	~140 GB	~90 GB
Throughput (tokens/s)	high	low	medium-high

The headline is this: MoE gives you better compute efficiency than a dense 70B, but you still pay the memory cost of a much larger model. You cannot run Mixtral on a single consumer GPU. You need at least two 24 GB cards to fit the weights. The computational savings only show up once the model is already loaded — that is the catch that the "70B performance at 7B compute" tagline often omits.

How sparse MoE works in a transformer

In a standard transformer, every layer has an FFN block (two linear projections with an activation in between). In a sparse MoE transformer, each FFN is replaced by multiple parallel "expert" FFNs plus a learned router that picks which experts to use for each token.

Here is the data flow for a single token passing through one MoE layer:

flowchart LR
    A[Input token<br/>hidden states] --> B[Router / Gate<br/>learned linear layer]
    B --> C{Softmax over<br/>N experts}
    C --> D[Select top-k<br/>experts]
    D --> E1[Expert 1<br/>FFN]
    D --> E2[Expert 2<br/>FFN]
    D --> E3[...<br/>idle]
    D --> E4[Expert N<br/>idle]
    E1 --> F[Weighted sum<br/>by router scores]
    E2 --> F
    F --> G[Output token<br/>hidden states]

The router is a small learned linear layer that takes the token's hidden state and outputs a score for each expert. You take the softmax over all experts, pick the k with the highest scores, run the token through only those experts, and combine the results weighted by the router scores. For Mixtral, k=2 out of 8 experts. For DeepSeek-MoE, k=6 out of 64 experts. The router itself adds negligible compute — a single matrix multiply of size (hidden_dim, n_experts).

The router is not just "which GPU does this go to"

A common mental model is that the router is a load balancer that assigns tokens to experts similar to how a distributed scheduler assigns work to machines. This is misleading. The router is a learned differentiable gate trained end-to-end with the rest of the model through backpropagation. It learns which experts specialize in which types of patterns — subject-matter expertise, syntactic structures, token positions — without any explicit supervision.

Expert specialization emerges, it is not designed

When you inspect the routed outputs after training, individual experts do develop preferences. One expert in Mixtral handles arithmetic-heavy tokens disproportionately often. Another handles function words and punctuation. A third handles code syntax. But these specializations are soft, not hard: there is no constraint that says "expert 3 is the math expert." The router simply learns the assignment that minimizes the loss.

Training an MoE model: the load-balancing problem

The hardest part of MoE training is preventing the router from sending every token to the same two experts. If there is no corrective signal, the router quickly collapses: it sends everything to the experts that happen to initialize well, those experts get more gradient updates, they get better, the router sends even more traffic their way, and the unused experts atrophy.

The standard fix is an auxiliary load-balancing loss added to the total training loss. The most common formulation (used in Mixtral, GShard, and ST-MoE) penalizes the router for imbalance:

# Simplified load-balancing loss (following the Switch Transformer formulation)
def load_balancing_loss(router_logits, num_experts, num_tokens):
    """
    router_logits: (num_tokens, num_experts) — raw router scores before softmax
    """
    router_probs = torch.softmax(router_logits, dim=-1)             # (tokens, experts)
    fraction_per_expert = router_probs.mean(dim=0)                  # (experts,) avg probability per expert

    # Fraction of tokens routed to each expert
    _, selected_experts = router_probs.topk(k=2, dim=-1)
    tokens_per_expert = torch.zeros(num_experts, device=router_logits.device)
    tokens_per_expert.scatter_add_(0, selected_experts.flatten(), 
                                    torch.ones(num_tokens * 2, device=router_logits.device))
    load_per_expert = tokens_per_expert / (num_tokens * 2)          # (experts,) normalized token count

    # Auxiliary loss: dot product of fraction and load
    # Minimized (zero) when all experts have equal probability AND equal load
    aux_loss = num_experts * (fraction_per_expert * load_per_expert).sum()
    return aux_loss

The num_experts multiplier scales the loss so it does not vanish at different expert counts. Typical aux_loss coefficients are between 0.01 and 0.001. Too high and the router loses discriminative power. Too low and the expert collapse returns.

Beyond the auxiliary loss: modern routing strategies

Recent work has introduced alternatives that reduce or eliminate the auxiliary loss:

DeepSeek-MoE uses a combination of shared experts (always-on, handles common patterns) and routed experts with top-6 selection. The shared experts cover the base computation that every token needs, so the routed experts can specialize more aggressively without collapsing.
Qwen2.5-MoE uses finer-grained experts (smaller intermediate size) with more of them, combined with shared experts and a "route-constrained" auxiliary loss.
Dense-to-Sparse training (DeepSpeed-MoE) starts with a dense checkpoint and incrementally sparsifies it, avoiding the collapse problem at initialization entirely.

MoE serving: where throughput meets memory

Serving an MoE model requires different infrastructure than a dense model. The key insight is that expert weights are wide but narrowly used:

Expert parallelism: place different experts on different GPUs. Since only k experts activate per token, each GPU only computes 2/k of the total expert FFN. This is the standard approach in vLLM, TGI, and SGLang for MoE models.
Memory overhead: all expert weights must be resident across the combined GPU memory. With 8 experts and 2 active per token, you need 4x the total GPU memory of the active-parameter count. For Mixtral (45B total, 12.9B active), you need ~90 GB of VRAM, which means at least 2x A100-80GB or 4x L40S.
All-to-all communication: before the MoE layer, tokens must be grouped by which expert they were routed to, sent to the correct GPU, processed, and then sent back. The router dispatch and combine operations are the main latency bottleneck in MoE inference, not the expert compute itself.

Here is a concrete serving comparison:

# vLLM configuration for MoE vs dense on 4x A100-80GB
# Dense 70B:
  model: meta-llama/Llama-3.3-70B-Instruct
  tensor_parallel_size: 2
  max_model_len: 8192
  estimated throughput: ~1800 tokens/s

# MoE 45B (Mixtral):
  model: mistralai/Mixtral-8x7B-Instruct-v0.1
  tensor_parallel_size: 2
  max_model_len: 32768  # sliding window attention
  estimated throughput: ~3200 tokens/s

The MoE throughput advantage is real but narrower than the parameter count suggests, because the dispatch overhead and the memory ceiling eat into the margin.

Common pitfalls

Router collapse during training. Even with load-balancing loss, the router can still collapse in the first few thousand steps. Monitor the expert utilization histogram during training. If one expert receives more than 30 percent of tokens while another receives less than 5 percent, increase the auxiliary loss coefficient or switch to a different routing strategy (e.g., DeepSeek's shared-expert design).

Ignoring dispatch overhead in latency budgets. The all-to-all communication in expert routing adds 5-15 ms per MoE layer depending on batch size and interconnect bandwidth. For a 32-layer model with 16 MoE layers, that is 80-240 ms of overhead before any compute happens. For latency-sensitive applications, this cost can erase the throughput gains.

Training on too-small batch sizes. MoE models require larger batch sizes than dense models because the expert capacity constrain means that each expert sees only a fraction of the batch. A batch of 256 tokens with 8 experts and k=2 means each expert processes roughly 64 tokens. Training on small batches leads to underutilized experts and noisy gradients.

Using MoE for fine-tuning without adaptation. Most MoE models were trained from scratch with MoE architecture. Taking a dense checkpoint and converting it to MoE (as in DeepSpeed-MoE's d2s approach) requires careful initialization and a warm-up schedule. Simple LoRA fine-tuning on an existing MoE model can break the learned routing patterns. Always evaluate the downstream task before and after fine-tuning to verify the routing did not drift.

Measuring memory wrong. The total parameter count of an MoE model determines model.parameters(), but the memory you need to serve it is the sum of all experts plus the shared layers. For DeepSeek-MoE-16B, the 64 experts (each with intermediate_size 1408 at hidden_size 2048) means the expert weights alone occupy roughly 45 GB at FP16. The total 16B label refers to the active parameter count, not the storage requirement.

When NOT to use it

MoE is not always the right architecture for your model:

You need consistent latency for every request. Because the router's top-k selection varies per token, and because batch composition affects which experts are active, MoE latency has higher variance than dense models. If your SLO requires 99th percentile latency under 200 ms per token, a dense model is easier to calibrate.
You are deploying on a single GPU with less than 48 GB VRAM. MoE models with real quality (anything above 2-3 active billion parameters) require at least two GPUs to fit the total weights. If your deployment is a single RTX 4090 or A5000, stick with dense models in the 7B-13B range.
You are building a small model under 3B parameters. The overhead of the router, the auxiliary loss, and the expert parallelism infrastructure is not worth it at this scale. MoE starts to pay off when the dense baseline you are trying to beat is above 30-50B parameters.
Your batch size is small and latency-critical. A batch of 1 (streaming chat) does not benefit from expert parallelism because the dispatch overhead dominates. The throughput advantage of MoE is most visible at batch sizes above 64.
You cannot afford the engineering complexity. MoE serving requires custom kernel support (Triton or CUDA kernels for fused experts, dispatch, and combine), non-trivial CI for load-balancing validation, and integration with inference engines that are still maturing their MoE support. If your team has limited ML infrastructure, a dense model with QLoRA is the safer bet.

TL;DR

MoE decouples total parameters from per-token compute by routing each token to a subset of expert FFNs.
Mixtral 8x7B has 45B total parameters but only activates ~13B per token, giving 70B-class compute efficiency at ~14B-class cost.
The router is a learned linear layer trained end-to-end, not a scheduler. Expert specialization emerges naturally.
Load-balancing loss is essential during training to prevent router collapse. Typical coefficients range from 0.01 to 0.001.
Serving MoE requires expert parallelism across GPUs. Dispatch overhead is the main latency bottleneck, not the expert compute.
MoE memory footprint is proportional to total parameters (all experts), not active parameters. You cannot fit Mixtral on a single 24 GB GPU.
MoE pays off at large scale (target dense baseline above 30B). For small models, single-GPU deployments, or latency-sensitive applications, dense is simpler and often better.

Next post: structured output — how JSON mode, function calling, and grammar-constrained decoding work under the hood, and when each approach fails.

DEV Community

Mixture of Experts (MoE): what it actually does under the hood, and when it pays off

Mixture of Experts (MoE): what it actually does under the hood, and when it pays off

Why the distinction between total parameters and active parameters matters

How sparse MoE works in a transformer

The router is not just "which GPU does this go to"

Expert specialization emerges, it is not designed

Training an MoE model: the load-balancing problem

Beyond the auxiliary loss: modern routing strategies

MoE serving: where throughput meets memory

Common pitfalls

When NOT to use it

TL;DR

Top comments (0)