Mamba-2 vs Mamba vs Transformer: Long Range Arena Results

#mamba2 #mamba #transformer #statespacemodels

The Promise vs the Reality

Mamba-2 claims to fix Mamba's hardware inefficiency while keeping its linear-time magic. The original paper shows impressive throughput numbers — 2-8x faster training than Mamba, competitive with Transformers on A100s. But I wanted to see if that speed came at an accuracy cost, especially on tasks where long-range dependencies actually matter.

Long Range Arena (LRA) is the benchmark everyone uses to prove their architecture "handles long sequences better." It's a suite of tasks (ListOps, text classification, image classification, pathfinder) designed to stress-test models on sequences up to 16K tokens. If you're going to claim you beat Transformers at long context, you need to show LRA numbers.

Here's what I found: Mamba-2 doesn't just match Mamba's accuracy — it actually improves on several LRA tasks while being substantially faster. But there's a catch the paper downplays.