Mixture of Experts: Big Models, Cheap Inference

#ai #llm #machinelearning #beginners

How does a model have hundreds of billions of parameters but still run affordably? Mixture of Experts. Instead of every token using the whole network, a router sends each token to just a few specialists. Here's the routing, visualized.

🧠 Watch the router route each token: https://dev48v.infy.uk/ai/days/day19-mixture-of-experts.html

The idea

A layer holds N expert sub-networks (say 8) plus a small router/gating network. For each token, the router scores the experts and sends the token to only the top-k (e.g. top-2). Those experts light up and do the work; the other six stay dark. Sparse activation.

In the demo you feed a sentence token-by-token and watch different tokens route to different experts (some specialize in function words, others in numbers, etc.), with a live "2/8 active" tally.

Why it's a big deal

It decouples capacity from compute. Total params can be enormous (all experts exist in memory), but the active params per token stay small — so you get the quality of a huge model at a fraction of the per-token cost. Flip to "Dense" in the demo and all 8 fire = 4× the compute.

The catches

Load balancing: an auxiliary loss stops the router from overusing a few experts.
Memory: you must hold all experts in memory even though each token uses a few.
Routing can be unstable to train.

Mixtral's "8x7B" is exactly this: 8 experts, top-2 per token.

🔨 Built from concept (experts + gating → top-k route → combine → load-balance loss) on the page: https://dev48v.infy.uk/ai/days/day19-mixture-of-experts.html

Part of AIFromZero. 🌐 https://dev48v.infy.uk

DEV Community

Mixture of Experts: Big Models, Cheap Inference

The idea

Why it's a big deal

The catches

Top comments (0)