DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

Mixture-of-Depths: Dynamic Token Skip Cuts 40% FLOPs

The Obvious Problem with Transformers That Everyone Ignores

Every token in a Transformer gets the same compute budget. The word "the" gets as many FLOPs as the phrase "quantum entanglement" in a GPT model. That's wasteful.

Mixture-of-Depths (MoD), introduced by Raposo et al. in their 2024 paper, asks a simple question: what if tokens could choose whether to go through each layer? Not every token needs full computation at every depth. Some tokens coast through early layers and do heavy lifting later. Others front-load their work and skip the rest.

The result: 40% FLOPs reduction at iso-quality in GPT-scale models. No accuracy drop. Same perplexity, same downstream performance.

Close-up of wooden Scrabble tiles spelling 'China' and 'Deepseek' on a wooden surface.

Photo by Markus Winkler on Pexels

How MoD Works: Routing Tokens Through Layers

Standard Transformers process all $N$ tokens through all $L$ layers. Total compute: $O(N \cdot L \cdot d^2)$ where $d$ is the hidden dimension.

MoD introduces a top-k routing mechanism at each layer. Before the self-attention block, a lightweight router scores every token:

$$


Continue reading the full article on TildAlice

Top comments (0)