Mixture-of-Depths: Dynamic Token Skip Cuts 40% FLOPs

#mixtureofdepths #transformeroptimizat #efficientllms #dynamicrouting

The Obvious Problem with Transformers That Everyone Ignores

Every token in a Transformer gets the same compute budget. The word "the" gets as many FLOPs as the phrase "quantum entanglement" in a GPT model. That's wasteful.

Mixture-of-Depths (MoD), introduced by Raposo et al. in their 2024 paper, asks a simple question: what if tokens could choose whether to go through each layer? Not every token needs full computation at every depth. Some tokens coast through early layers and do heavy lifting later. Others front-load their work and skip the rest.

The result: 40% FLOPs reduction at iso-quality in GPT-scale models. No accuracy drop. Same perplexity, same downstream performance.