The Promise vs The Reality
Mamba and RWKV both claim to solve the $O(n^2)$ attention bottleneck that kills Transformer scaling beyond 8K tokens. The pitch is seductive: linear-time inference, constant memory per token, no KV cache explosion. I wanted to see if either delivers on real long-context tasks — not the cherry-picked benchmarks from the papers, but the messy reality of 32K-token summarization and QA retrieval.
Spoiler: one architecture chokes past 16K. The other scales but trades accuracy for speed in ways the paper glossed over.
You can read the original Mamba paper here (Gu & Dao, 2023) and the RWKV paper here (Peng et al., 2023).
What State Space Models Actually Do
Both Mamba and RWKV replace attention with recurrent state updates. The core idea: instead of comparing every token to every other token (the $O(n^2)$ operation in self-attention), maintain a fixed-size hidden state that gets updated as you process each token sequentially.
Continue reading the full article on TildAlice

Top comments (0)