Mamba-2 vs Griffin vs RWKV-6: SSM Architecture Benchmark

#statespacemodels #mamba2 #rwkv #transformeralternati

The Linear-Time Transformer Replacement Everyone's Building

The quadratic complexity of attention — $O(n^2)$ for sequence length $n$ — stopped being theoretical the moment context windows hit 128k tokens. State Space Models (SSMs) promise $O(n)$ complexity without sacrificing quality, but three architectures dominate 2026: Mamba-2, Griffin, and RWKV-6.

I benchmarked all three on the same 1.3B parameter budget. The results challenged what I thought I knew about attention alternatives.

Close-up of a Seagate FireCuda SSD on a white background with three yellow rubber ducks. — Photo by Andrey Matveev on Pexels

What Makes SSMs Different From Transformers

Transformers compute attention scores between every token pair. For a 10k token sequence, that's 100M comparisons. SSMs instead maintain a fixed-size hidden state that gets updated sequentially:

$$h_t = \bar{A}h_{t-1} + \bar{B}x_t$$
$$y_t = Ch_t$$

The matrices $\bar{A}, \bar{B}, C$ are learned, but crucially: $h_t$ doesn't grow with sequence length. You process 10 tokens or 100k tokens with the same memory footprint.

Continue reading the full article on TildAlice