DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

Mamba vs RWKV: 32K Context Benchmark on A100

The Promise vs The Reality

Mamba and RWKV both claim to solve the $O(n^2)$ attention bottleneck that kills Transformer scaling beyond 8K tokens. The pitch is seductive: linear-time inference, constant memory per token, no KV cache explosion. I wanted to see if either delivers on real long-context tasks — not the cherry-picked benchmarks from the papers, but the messy reality of 32K-token summarization and QA retrieval.

Spoiler: one architecture chokes past 16K. The other scales but trades accuracy for speed in ways the paper glossed over.

You can read the original Mamba paper here (Gu & Dao, 2023) and the RWKV paper here (Peng et al., 2023).

A rocket launches over the ocean, leaving behind smoke and fire trails. Scenic backdrop with clear skies.

Photo by SpaceX on Pexels

What State Space Models Actually Do

Both Mamba and RWKV replace attention with recurrent state updates. The core idea: instead of comparing every token to every other token (the $O(n^2)$ operation in self-attention), maintain a fixed-size hidden state that gets updated as you process each token sequentially.


Continue reading the full article on TildAlice

Top comments (0)