Why Most MoE Explanations Skip the Routing Problem
Mixture-of-Experts models are everywhere now — DeepSeek-V3, Mixtral 8x7B, GPT-4 (rumored) — but most tutorials just show you the sparsity math and call it a day. They skip the part that actually matters in production: how do tokens decide which experts to visit?
The routing mechanism is where MoE models live or die. Route poorly and you get load imbalance (some experts idle while others bottleneck), collapsed diversity (all tokens pick the same 2 experts), or straight-up training instability. Route well and you get 3-5x more parameters for the same compute budget.
DeepSeek-V3 and Mixtral both use top-$k$ gating, but their routing strategies differ in ways that matter for throughput, training stability, and hardware utilization. I'm going to show you both approaches with actual code, explain where each one breaks, and tell you which design choice I'd pick.
The Core MoE Primitive: Gating Function
Continue reading the full article on TildAlice

Top comments (0)