The 8-Expert Model That Only Uses 2
You train a Mixture of Experts model with 8 experts, expecting distributed specialization. After a few thousand steps, you check the routing statistics and find 87% of tokens going to experts 0 and 3. The other six experts? Basically decorative.
This is router collapse, and it's one of the most frustrating failure modes in MoE training. Your model has 8x the parameters but uses a fraction of them. The paper that first systematically addressed this — Shazeer et al.'s "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (2017) — remains the foundational reference. You can read it here.
The core insight is deceptively simple: without explicit load balancing, routers learn to send everything to whichever experts happen to perform slightly better early in training. Those experts get more gradient signal, improve faster, and attract even more tokens. It's a rich-get-richer dynamic that starves most experts of training signal entirely.
Continue reading the full article on TildAlice

Top comments (0)