The Obvious Problem with Transformers That Everyone Ignores
Every token in a Transformer gets the same compute budget. The word "the" gets as many FLOPs as the phrase "quantum entanglement" in a GPT model. That's wasteful.
Mixture-of-Depths (MoD), introduced by Raposo et al. in their 2024 paper, asks a simple question: what if tokens could choose whether to go through each layer? Not every token needs full computation at every depth. Some tokens coast through early layers and do heavy lifting later. Others front-load their work and skip the rest.
The result: 40% FLOPs reduction at iso-quality in GPT-scale models. No accuracy drop. Same perplexity, same downstream performance.
How MoD Works: Routing Tokens Through Layers
Standard Transformers process all $N$ tokens through all $L$ layers. Total compute: $O(N \cdot L \cdot d^2)$ where $d$ is the hidden dimension.
MoD introduces a top-k routing mechanism at each layer. Before the self-attention block, a lightweight router scores every token:
$$
Continue reading the full article on TildAlice

Top comments (0)