Mamba vs RWKV: 32K Context Benchmark on A100

#mamba #rwkv #statespacemodels #longcontext

The Promise vs The Reality

Mamba and RWKV both claim to solve the $O(n^2)$ attention bottleneck that kills Transformer scaling beyond 8K tokens. The pitch is seductive: linear-time inference, constant memory per token, no KV cache explosion. I wanted to see if either delivers on real long-context tasks — not the cherry-picked benchmarks from the papers, but the messy reality of 32K-token summarization and QA retrieval.

Spoiler: one architecture chokes past 16K. The other scales but trades accuracy for speed in ways the paper glossed over.

You can read the original Mamba paper here (Gu & Dao, 2023) and the RWKV paper here (Peng et al., 2023).

A rocket launches over the ocean, leaving behind smoke and fire trails. Scenic backdrop with clear skies. — Photo by SpaceX on Pexels

What State Space Models Actually Do

Both Mamba and RWKV replace attention with recurrent state updates. The core idea: instead of comparing every token to every other token (the $O(n^2)$ operation in self-attention), maintain a fixed-size hidden state that gets updated as you process each token sequentially.

Continue reading the full article on TildAlice

DEV Community

Mamba vs RWKV: 32K Context Benchmark on A100

The Promise vs The Reality

What State Space Models Actually Do

Top comments (0)