State Space Models offer linear-time sequence modeling with content-aware selective filtering, challenging Transformers for long-context inference.
Why This Matters
State Space Models (SSMs) provide a principled alternative to Transformers for long-sequence modeling. In production systems handling long contexts (e.g., code generation, genomic analysis), Transformer attention's quadratic cost becomes a bottleneck. Mamba achieves linear-time inference with constant-memory state, making it viable for million-token contexts where attention-based models are prohibitively expensive.
Core Idea
SSMs originate from continuous-time control theory: a latent state evolves over time driven by input, and observations are linear projections of that state. Mamba's key innovation is making the SSM parameters input-selective — the model learns to gate which information enters and exits the state, mimicking attention's ability to focus on relevant tokens without the cost.
Technical Details
The continuous-time SSM is defined as:
where is latent state, is input, and , , . Using zero-order hold discretization with step :
The recurrent update becomes:
Mamba's selective mechanism makes , , and input-dependent:
The parallel scan algorithm computes this recurrence in during training. Inference is O(1) per token with fixed state size N , yielding constant-memory decoding regardless of sequence length.
How It Works
- Project input: Map token to expanded dimension .
- Generate selective parameters: Compute input-dependent , , from .
- Discretize: Convert continuous to discrete using .
- Recurrent scan: Apply parallel scan (training) or sequential update (inference) to compute hidden states .
- Output projection: Compute , then project through gating (SiLU) to output dimension.
Key Insights
- Selectivity is essential: Non-selective SSMs (S4) cannot do in-context retrieval; making input-dependent enables content-aware filtering.
- Diagonal + low-rank structure on enables recurrence; Mamba uses diagonal matrices exclusively.
- Hardware-aware design: The scan kernel is IO-bound, not compute-bound — Mamba's CUDA kernel fuses discretization, scan, and output projection to minimize memory reads.
- Linear decoding cost: Unlike KV-cache which grows linearly, SSM state is fixed-size , making generation memory-constant.
Sources
- Gu, A. & Dao, T. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752 (2023). https://arxiv.org/abs/2312.00752
- Gu, A. & Dao, T. "Mamba-3: Improved Sequence Modeling using State Space Principles." arXiv:2603.15569 (2026). https://arxiv.org/abs/2603.15569
Top comments (0)