DEV Community

Kevin
Kevin

Posted on

Mamba-3 and AttnRes: AI Architecture Research Is Finally Building for Inference, Not Just Training

Mamba-3 and AttnRes: AI Architecture Research Is Finally Building for Inference, Not Just Training

The dominant narrative in AI research for the past few years has been about training: bigger models, better loss curves, faster matmuls. But something shifted quietly in 2025, and by early 2026, it's becoming impossible to ignore. Two papers that dropped this week — Mamba-3 from researchers at CMU, Princeton, and Together AI, and Attention Residuals (AttnRes) from MoonshotAI — signal that the field is finally starting to take inference seriously as a first-class architectural concern.

These aren't incremental improvements. They're a rethinking of foundational design choices that have been baked into modern LLMs since before GPT-2 landed.


The Context: Why Inference Is Eating Training Alive

Here's the thing that's been quietly obvious to anyone actually running AI systems in production: inference demand has exploded beyond all reasonable expectation.

It's not just that you're serving models to users. It's the compounding effect of everything that's happened since 2024:

  • RL post-training at scale requires generating millions of rollouts per training run. Reinforcement learning with verifiable rewards (RLVR) — the technique behind reasoning models — is inference-bound, not training-bound. You need the model to generate answers to grade, over and over.
  • Agentic workflows — Claude Code, Codex, OpenCode — have pushed per-session token counts through the roof. An agentic coding session might generate 50,000+ tokens where a simple chat completion generates 500.
  • Long-context tasks like document analysis and RAG pipelines are becoming table stakes, not edge cases.

The result: inference compute is growing faster than training compute. Your GPU is now mostly serving, not learning.

This is the backdrop against which Mamba-3 was designed. And it changes everything about what "good" looks like in an architecture.


Mamba-3: What Changed, and Why It Matters

State Space Models (SSMs) have been the most credible alternative to Transformers in language modeling for the past couple of years. The pitch is simple: a fixed-size recurrent state gives you O(1) per-token inference cost instead of the Transformer's O(n) KV cache growth. At long sequence lengths, this is a massive advantage.

But here's the problem: Mamba-2, the previous state of the art, was optimized for training speed.

"Since the release of Mamba-2 in mid-2024, most architectures have switched from Mamba-1. Why? Mamba-2 made the bet that training efficiency was the largest bottleneck for state space models." — Together AI blog

Mamba-2 achieved fast training by simplifying the underlying recurrence — specifically, collapsing the transition matrix to a scalar times identity. This made the math clean and training fast. But it left the inference step "too simple and squarely memory-bound." The GPUs weren't computing; they were just moving memory around.

Mamba-3 inverts this. Designed with inference efficiency as the primary goal, the team made three concrete changes:

1. More Expressive Recurrence

The team derived a new recurrence using an exponential-trapezoidal discretization scheme. Without getting lost in the math: this makes each hidden state update richer and more computationally dense, meaning the GPU's tensor cores actually have something to chew on during decoding. More work per memory access = better hardware utilization = faster wall-clock inference.

As a side effect, this new discretization implicitly handles what the old "short causal convolution" used to do explicitly — so Mamba-3 drops that component entirely, simplifying the overall architecture.

2. Complex-Valued States

Mamba-1 and 2 operated with real-valued hidden states. Mamba-3 introduces a complex-valued SSM system.

This matters because complex numbers naturally encode rotation and oscillation — phenomena that are useful for tracking position, periodicity, and phase relationships in sequences. RoPE (Rotary Position Embeddings), which is now standard in transformers, exploits exactly this intuition. Mamba-3 brings it to the SSM world, expressing complex transitions via rotations and implementing them through a RoPE module — avoiding the need for expensive kernel reimplementations.

3. MIMO: Multiple SSMs in Parallel

Traditional SSMs are SISO — Single Input, Single Output. Each layer processes one channel of information through one state. Mamba-3 introduces MIMO (Multi-Input, Multi-Output) SSMs, running multiple SSMs in parallel per layer.

This is clever because of how GPU arithmetic works. During decoding, each timestep performs so little compute that hardware tensor cores sit idle while memory buses are saturated. MIMO adds more FLOPs per timestep, but since those FLOPs fit within the idle compute capacity, they don't increase latency — you get a free accuracy upgrade.

The result: Mamba-3 MIMO boosts downstream accuracy by over 1 percentage point at 1.5B scale versus SISO, with no increase in decoding latency. Training is somewhat slower, but inference is not.

The Benchmarks

The headline result: Mamba-3 SISO beats Mamba-2, Gated DeltaNet, and even Llama-3.2-1B (a full Transformer) on prefill+decode latency across all sequence lengths at the 1.5B scale.

That last point deserves emphasis: a Mamba-3 SSM at 1.5B is faster than a Transformer of comparable quality. Not just on long sequences where SSMs have an obvious advantage, but across all sequence lengths.

Language modeling quality is also improved over Mamba-2 across all tested scales. The MIMO variant goes further, though with higher training costs.

The team open-sourced the kernels, built using Triton, TileLang, and CuTe DSL for maximum hardware performance. Everything is available at Goomba Lab.


AttnRes: MoonshotAI Rethinks the Residual Connection

On the Transformer side, MoonshotAI dropped something equally interesting: Attention Residuals (AttnRes), a drop-in replacement for the humble residual connection that's been a fixture of neural network design since ResNet.

Standard residual connections are dead simple:

h_l = h_{l-1} + F(h_{l-1})
Enter fullscreen mode Exit fullscreen mode

Each layer takes the previous layer's output, applies a transformation, and adds the result back. Uniform weights. Fixed accumulation. No selectivity whatsoever.

The problem at scale: as you add more layers, this uniform aggregation dilutes each layer's contribution. Every new layer competes equally with all previous layers for influence over the final representation. The hidden-state magnitudes also grow unboundedly with depth — a well-documented issue with PreNorm architectures.

AttnRes replaces this with softmax attention over all preceding layer outputs:

h_l = Σ α(i→l) · v_i   for i in 0..l-1
Enter fullscreen mode Exit fullscreen mode

Where the weights α are computed via a single learned pseudo-query per layer. This gives every layer selective, input-dependent access to earlier representations — instead of being forced to accept whatever the previous layer handed it.

The results are striking. Block AttnRes (a practical variant that groups layers into blocks to keep memory manageable) matches the loss of a baseline trained with 1.25× more compute. That's a free 25% compute efficiency gain, achieved purely through a better residual connection.

Block AttnRes groups layers into N blocks (~8 blocks recovers most of the gain), applies standard residuals within blocks, and uses attention only at block boundaries. Memory footprint is O(Nd) instead of O(Ld), making it practical even at scale.

The implementation is available on GitHub with clean PyTorch pseudocode and the arxiv paper at 2603.15031.


The Deeper Thread: Architecture Research Is Growing Up

Reading these two papers together, a pattern emerges.

For most of the deep learning era, architecture research was driven by a single question: how do we make training faster? Batch normalization, skip connections, attention — all of these were primarily evaluated on training metrics. Get loss down. Win benchmarks. Ship.

That made sense when training was the bottleneck. But training isn't the bottleneck anymore.

Mamba-3 is explicit about this shift. The paper's framing is almost confrontational: other linear models were designed with a training-first perspective. We didn't do that. And then they show you why it matters.

AttnRes is less overtly inference-focused, but the insight is similar: the standard residual connection was designed for convergence, not necessarily for quality of representation at depth. When you actually think carefully about what you need at inference time — rich, selective aggregation of layer-wise information — a fixed accumulation scheme looks pretty crude.


What About the Hybrid Future?

One thing both papers agree on: pure SSM models still have a retrieval problem. Because SSMs maintain a fixed-size state, they have to compress everything into that representation. Transformers, with their ever-growing KV cache, can do exact lookup of any prior token. For needle-in-a-haystack tasks, attention wins.

The Mamba-3 team's prediction: linear layers will predominantly be used in conjunction with global self-attention layers going forward. Not either/or. Hybrid architectures that combine SSM efficiency with Transformer recall.

This matches what we're already seeing in production models. Jamba, Zamba, and similar hybrid architectures interleave attention and SSM layers — getting the efficiency of SSMs for most of the sequence while using attention where precision matters.

Mamba-3 and AttnRes both make those components better. Which means hybrid architectures just got a free upgrade on multiple fronts.


The Practical Takeaway

If you're building or fine-tuning models, here's what this week's research means for you:

  1. If you're working on inference-heavy applications (agents, RL pipelines, long-context tasks): watch the SSM space closely. Mamba-3's inference-first design philosophy is going to become the norm, not the exception.

  2. If you're training from scratch or experimenting with custom architectures: AttnRes is a low-risk, meaningful improvement. One changed component, 1.25× compute equivalent gain. That's a good trade.

  3. If you're thinking about architecture at a systems level: the training-first era is ending. Chips are getting faster, but inference demand is growing faster than chips can keep up. Architecture choices that were "good enough" when training dominated are going to look increasingly expensive.

The Transformer isn't going away. But the version of the Transformer (and SSM) that ships in 2027 is going to look meaningfully different from what we have today. Both of these papers are pointing at the same direction.


Resources

Inference is the new training. The architectures are catching up.

Top comments (0)