Mamba/SSM Basics

#ai #mamba #llm #ssm

State Space Models offer linear-time sequence modeling with content-aware selective filtering, challenging Transformers for long-context inference.

Why This Matters

State Space Models (SSMs) provide a principled alternative to Transformers for long-sequence modeling. In production systems handling long contexts (e.g., code generation, genomic analysis), Transformer attention's quadratic cost becomes a bottleneck. Mamba achieves linear-time inference with constant-memory state, making it viable for million-token contexts where attention-based models are prohibitively expensive.

Core Idea

SSMs originate from continuous-time control theory: a latent state evolves over time driven by input, and observations are linear projections of that state. Mamba's key innovation is making the SSM parameters input-selective — the model learns to gate which information enters and exits the state, mimicking attention's ability to focus on relevant tokens without the $O(n^2)$ cost.

Technical Details

The continuous-time SSM is defined as:

x'(t) = Ax(t) + Bu(t), \quad y(t) = Cx(t) + Du(t)

where $x(t) \in \mathbb{R}^N$ is latent state, $u(t)$ is input, and $A \in \mathbb{R}^{N \times N}$ , $B \in \mathbb{R}^{N \times 1}$ , $C \in \mathbb{R}^{1 \times N}$ . Using zero-order hold discretization with step $\Delta$ :

\bar{A} = \exp(\Delta A), \quad \bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B

The recurrent update becomes:

x_k = \bar{A}x_{k-1} + \bar{B}u_k, \quad y_k = Cx_k

Mamba's selective mechanism makes $B$ , $C$ , and $\Delta$ input-dependent:

B_k = \text{Linear}B(x_k), \quad C_k = \text{Linear}_C(x_k), \quad \Delta_k = \text{softplus}(\text{Linear}\Delta(x_k))

The parallel scan algorithm computes this recurrence in $O(n \log n)$ during training. Inference is $O(1)$ O(1) per token with fixed state size N $N$ , yielding constant-memory decoding regardless of sequence length.

How It Works

Project input: Map token $u_k$ to expanded dimension $D \cdot N$ .
Generate selective parameters: Compute input-dependent $B_k$ , $C_k$ , $\Delta_k$ from $u_k$ .
Discretize: Convert continuous $(A, B)$ to discrete $(\bar{A}_k, \bar{B}_k)$ using $\Delta_k$ .
Recurrent scan: Apply parallel scan (training) or sequential update (inference) to compute hidden states $x_k$ .
Output projection: Compute $y_k = C_k x_k$ , then project through gating (SiLU) to output dimension.

Key Insights

Selectivity is essential: Non-selective SSMs (S4) cannot do in-context retrieval; making $B, C, \Delta$ input-dependent enables content-aware filtering.
Diagonal + low-rank structure on $A$ enables $O(n)$ recurrence; Mamba uses diagonal $A$ matrices exclusively.
Hardware-aware design: The scan kernel is IO-bound, not compute-bound — Mamba's CUDA kernel fuses discretization, scan, and output projection to minimize memory reads.
Linear decoding cost: Unlike KV-cache which grows linearly, SSM state is fixed-size $O(ND)$ , making generation memory-constant.

Sources

Gu, A. & Dao, T. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752 (2023). https://arxiv.org/abs/2312.00752
Gu, A. & Dao, T. "Mamba-3: Improved Sequence Modeling using State Space Principles." arXiv:2603.15569 (2026). https://arxiv.org/abs/2603.15569

DEV Community