How does a model have hundreds of billions of parameters but still run affordably? Mixture of Experts. Instead of every token using the whole network, a router sends each token to just a few specialists. Here's the routing, visualized.
🧠 Watch the router route each token: https://dev48v.infy.uk/ai/days/day19-mixture-of-experts.html
The idea
A layer holds N expert sub-networks (say 8) plus a small router/gating network. For each token, the router scores the experts and sends the token to only the top-k (e.g. top-2). Those experts light up and do the work; the other six stay dark. Sparse activation.
In the demo you feed a sentence token-by-token and watch different tokens route to different experts (some specialize in function words, others in numbers, etc.), with a live "2/8 active" tally.
Why it's a big deal
It decouples capacity from compute. Total params can be enormous (all experts exist in memory), but the active params per token stay small — so you get the quality of a huge model at a fraction of the per-token cost. Flip to "Dense" in the demo and all 8 fire = 4× the compute.
The catches
- Load balancing: an auxiliary loss stops the router from overusing a few experts.
- Memory: you must hold all experts in memory even though each token uses a few.
- Routing can be unstable to train.
Mixtral's "8x7B" is exactly this: 8 experts, top-2 per token.
🔨 Built from concept (experts + gating → top-k route → combine → load-balance loss) on the page: https://dev48v.infy.uk/ai/days/day19-mixture-of-experts.html
Part of AIFromZero. 🌐 https://dev48v.infy.uk
Top comments (0)