The Linear-Time Transformer Replacement Everyone's Building
The quadratic complexity of attention — $O(n^2)$ for sequence length $n$ — stopped being theoretical the moment context windows hit 128k tokens. State Space Models (SSMs) promise $O(n)$ complexity without sacrificing quality, but three architectures dominate 2026: Mamba-2, Griffin, and RWKV-6.
I benchmarked all three on the same 1.3B parameter budget. The results challenged what I thought I knew about attention alternatives.
What Makes SSMs Different From Transformers
Transformers compute attention scores between every token pair. For a 10k token sequence, that's 100M comparisons. SSMs instead maintain a fixed-size hidden state that gets updated sequentially:
$$h_t = \bar{A}h_{t-1} + \bar{B}x_t$$
$$y_t = Ch_t$$
The matrices $\bar{A}, \bar{B}, C$ are learned, but crucially: $h_t$ doesn't grow with sequence length. You process 10 tokens or 100k tokens with the same memory footprint.
Continue reading the full article on TildAlice

Top comments (0)