I've been spending a lot of time recently understanding how the next generation of AI models are being built. Not just using them — actually understanding what's happening under the hood.
And here's what I've realized: the Transformer isn't going anywhere. But it's also not enough on its own anymore.
Let me explain.
The Problem Nobody Talks About
Every major LLM today — GPT-4, Claude, Gemini, Llama — is built on the Transformer architecture. It works. It works incredibly well. But it has one fundamental problem that gets worse as we push for longer contexts:
Self-attention is O(L²).
That means if you double your context window from 64K to 128K tokens, the compute doesn't just double — it quadruples. And the KV cache (the memory that stores past token information during inference) grows linearly with sequence length.
At 128K context with a 32-layer Transformer, you're looking at roughly 128 GB of KV cache. That's not fitting on a single GPU.
This is where Mamba enters the picture. And this is where things get interesting.
Part 1: The Transformer — What You Already Use
Let me break down what's actually happening inside a Transformer. If you've used any modern LLM, this is the engine running underneath.
The Full Architecture
The original Transformer (Vaswani et al., 2017 — "Attention Is All You Need") has two halves — an Encoder and a Decoder:
Most modern LLMs (GPT, Llama, Claude) use decoder-only Transformers — they drop the encoder entirely and just stack decoder layers. But the core mechanism is the same.
How Self-Attention Actually Works
This is the heart of the Transformer. Let me walk through it with a concrete example.
Say you have the sentence: "The cat sat on it"
When the model processes the token "it", it needs to figure out: what does "it" refer to?
Step 1: Each token creates three vectors from its embedding:
- Query (Q) — "What am I looking for?"
- Key (K) — "What do I contain?"
- Value (V) — "What information do I carry?"
Step 2: Compute attention scores = dot product of Q("it") with K(every other token):
flowchart LR
T1["The = 0.05"]
T2["cat = 0.72"]
T3["sat = 0.12"]
T4["on = 0.03"]
T5["it = 0.08"]
classDef low fill:#f1f5f9,stroke:#94a3b8,color:#334155
classDef high fill:#c084fc,stroke:#9333ea,stroke-width:3px,color:#3b0764
class T1,T3,T4,T5 low
class T2 high
Step 3: Softmax normalizes scores into attention weights (sum = 1.0)
Step 4: Output = weighted sum of all Value vectors
The result? "it" attends most strongly to "cat" (weight 0.72) — it learned that "it" refers to "the cat" without anyone telling it to. That's the magic.
Multi-Head Attention
The model runs this process multiple times in parallel (typically 8-16 heads). Each head learns different relationships:
The Feed-Forward Network
After attention, each token goes through a two-layer MLP independently:
FFN(x) = max(0, x·W₁ + b₁)·W₂ + b₂
This is where the model stores factual knowledge — it acts like a key-value memory. Research has shown you can actually locate specific facts in specific neurons of the FFN layers.
Residual Connections + Layer Norm
Every sub-layer is wrapped with: LayerNorm(x + SubLayer(x))
This prevents gradient degradation through deep stacks. Without residuals, gradients would vanish by layer 10. With them, you can stack 80+ layers.
The Cost
| Operation | Training Cost | Inference (per token) | Memory |
|---|---|---|---|
| Self-Attention | O(L²·d) | O(L·d) | O(L·d) KV cache |
| FFN | O(L·d²) | O(d²) | O(d²) weights |
That L² in attention is the killer. Everything else scales linearly.
Part 2: Mamba — The Challenger
Mamba was introduced by Albert Gu and Tri Dao in December 2023. It takes a completely different approach: instead of letting every token look at every other token, it processes the sequence through a Selective State Space Model (SSM).
The Core Idea: A Learned Compression
Think of it like this:
- Transformer: keeps a complete photo album of every past token (KV cache)
- Mamba: keeps a single, constantly-updated summary note (hidden state)
The math comes from continuous dynamical systems:
Continuous form:
h'(t) = A·h(t) + B·x(t) ← hidden state evolution
y(t) = C·h(t) ← output
Discretized (for actual tokens):
h_t = A_bar · h_(t-1) + B_bar · x_t ← update hidden state
y_t = C · h_t ← read output
Where:
- A (transition/decay matrix) — controls how much old state is retained
- B (input matrix) — controls how much of the new token gets written in
- C (output matrix) — controls what gets read out
- Delta (step size) — controls discretization granularity
What Makes Mamba Different: Selectivity
Previous state space models (S4, H3) used fixed A, B, C parameters — same for every token. Mamba's key innovation: these parameters are input-dependent.
This is selectivity. The model dynamically decides, per-token:
- When to remember (small Delta = gentle update)
- When to forget (large Delta = aggressive reset)
- What to write (B controls input projection)
- What to output (C controls readout)
A delimiter token might trigger high forgetting. A content word triggers careful accumulation. The model learns this entirely from data.
The Mamba Block
flowchart TB
INPUT["Input - B, L, D"] --> NORM["RMS Norm"]
NORM --> LP1["Branch A: Linear Proj D to E"]
NORM --> LP2["Branch B: Linear Proj D to E - gate"]
LP1 --> CONV["Causal Conv1D kernel=4"]
CONV --> SILU1["SiLU Activation"]
SILU1 --> SSSM["SELECTIVE SSM - Input-dependent A, B, C, Delta"]
LP2 --> SILU2["SiLU Activation"]
SSSM --> MUL["Multiply - Gating"]
SILU2 --> MUL
MUL --> LP3["Linear Proj E to D"]
LP3 --> RES["Residual Add"]
INPUT -.->|"Skip connection"| RES
RES --> OUT["Output - B, L, D"]
classDef neutral fill:#f1f5f9,stroke:#64748b,color:#1e293b
classDef norm fill:#e2e8f0,stroke:#64748b,color:#1e293b
classDef proj fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
classDef conv fill:#fed7aa,stroke:#f97316,stroke-width:2px,color:#7c2d12
classDef ssm fill:#fecaca,stroke:#ef4444,stroke-width:3px,color:#7f1d1d
classDef gate fill:#fce7f3,stroke:#ec4899,stroke-width:2px,color:#831843
classDef res fill:#d1fae5,stroke:#10b981,color:#064e3b
class INPUT,OUT neutral
class NORM,SILU1,SILU2 norm
class LP1,LP2,LP3 proj
class CONV conv
class SSSM ssm
class MUL gate
class RES res
The split-branch design is key: one path does the heavy computation (Conv1D followed by SSM), the other acts as a gate. Multiplying them together gives the model fine control over information flow.
The Hardware Trick: Parallel Scan
"But wait," you're thinking, "if Mamba processes tokens sequentially through a recurrence, isn't training slow?"
Here's the engineering brilliance: the state update is associative. That means you can parallelize it using a scan (prefix sum) operation — the same algorithm GPUs are incredibly good at.
During training, Mamba achieves near-Transformer parallelism. During inference, it's purely recurrent — just update the hidden state with each new token. No KV cache. Constant memory. Constant time per token.
The Hidden State Flow
flowchart LR
X1["x1"] -->|"B"| H1["h1"]
X2["x2"] -->|"B"| H2["h2"]
X3["x3"] -->|"B"| H3["h3"]
X4["x4"] -->|"B"| H4["h4"]
X5["x5"] -->|"B"| H5["h5"]
H1 -->|"A decay"| H2
H2 -->|"A decay"| H3
H3 -->|"A decay"| H4
H4 -->|"A decay"| H5
H1 -->|"C"| Y1["y1"]
H2 -->|"C"| Y2["y2"]
H3 -->|"C"| Y3["y3"]
H4 -->|"C"| Y4["y4"]
H5 -->|"C"| Y5["y5"]
classDef inp fill:#fef3c7,stroke:#f59e0b,color:#78350f
classDef hid fill:#ccfbf1,stroke:#14b8a6,stroke-width:2px,color:#134e4a
classDef outp fill:#e0e7ff,stroke:#6366f1,color:#312e81
class X1,X2,X3,X4,X5 inp
class H1,H2,H3,H4,H5 hid
class Y1,Y2,Y3,Y4,Y5 outp
Each hidden state h_t carries a compressed summary of all past tokens. The A matrix controls decay (how much old info fades), B controls what gets written in, C controls what gets read out. All input-dependent in Mamba.
The Cost
| Operation | Training Cost | Inference (per token) | Memory |
|---|---|---|---|
| Selective SSM | O(L·d) linear | O(d) constant | O(N·d) fixed state |
| Conv1D | O(L·d) | O(d) | O(k·d) kernel |
Everything is linear or constant. No quadratic anywhere.
Part 3: The Head-to-Head Comparison
Let me be real about where each architecture wins and loses:
| Dimension | Transformer | Mamba |
|---|---|---|
| Core mechanism | Self-attention (token-to-token) | Selective SSM (compressed state) |
| Training compute | O(L²·d) quadratic | O(L·d) linear |
| Inference per token | O(L·d) grows with context | O(d) constant |
| Memory (KV cache) | O(L·d) grows | O(N·d) fixed |
| Context window | Fixed (4K-128K) hard wall | Theoretically unlimited |
| Exact retrieval | Perfect — can pinpoint any token | Lossy — compressed into finite state |
| In-context learning | Strong — dynamic pattern routing | Weaker on exact copy/retrieval |
| Scale proven | 400B+ parameters | Up to ~8B so far |
| Ecosystem | Mature (FlashAttention, vLLM) | Newer, custom CUDA kernels |
The Fundamental Tradeoff
It comes down to this:
Attention = complete, addressable memory of every past token. You pay O(L²) for it.
SSM = fixed-size compressed summary of history. Much cheaper, but lossy.
The question is: is that compression sufficient for the task at hand?
For most of the computation in a forward pass — yes. You don't need to look at every single past token to process the word "the". But for some critical operations — resolving coreferences, copying exact numbers, following specific instructions — you need the precision of attention.
And that's exactly why hybrids are the answer.
Part 4: The Hybrid Architecture — Best of Both Worlds
This is where I got really excited. The hybrid approach isn't a compromise — it's genuinely better than either architecture alone.
The Core Insight
You don't need quadratic attention at every single layer.
Most layers are doing local, incremental processing. Mamba handles this beautifully at linear cost. But at certain critical points, you need the model to "check its work" against the full context with exact attention.
Jamba: The Reference Implementation
AI21 Labs built Jamba — the first production-scale hybrid (52B total / 12B active, 256K context). Here's the layer stack:
flowchart TB
INPUT["Token Embedding"] --> M1
subgraph BLOCK1["Layers 1-7 : Mamba Foundation"]
M1["Mamba SSM + Dense FFN"]
M2["Mamba SSM + Dense FFN"]
M3["... x 7 layers total"]
end
M3 --> A1
subgraph ATTN1["Layer 8 : Attention Checkpoint"]
A1["Self-Attention + MoE FFN -- KV cache stored"]
end
A1 --> M4
subgraph BLOCK2["Layers 9-15 : Mamba Efficient Flow"]
M4["Mamba SSM + MoE FFN"]
M5["Mamba SSM + MoE FFN"]
M6["... x 7 layers total"]
end
M6 --> A2
subgraph ATTN2["Layer 16 : Attention Checkpoint"]
A2["Self-Attention + MoE FFN -- KV cache stored"]
end
A2 --> M7
subgraph BLOCK3["Layers 17-23 : Mamba Efficient Flow"]
M7["... x 7 Mamba SSM layers"]
end
M7 --> A3
subgraph ATTN3["Layer 24 : Attention Checkpoint"]
A3["Self-Attention + MoE FFN"]
end
A3 --> M8
subgraph BLOCK4["Layers 25-31 : Mamba Efficient Flow"]
M8["... x 7 Mamba SSM layers"]
end
M8 --> A4
subgraph ATTN4["Layer 32 : Attention Checkpoint"]
A4["Self-Attention + MoE FFN"]
end
A4 --> NORM["RMS Norm + Linear Head"]
NORM --> OUTPUT["Output Logits"]
classDef mambaBlock fill:#d1fae5,stroke:#10b981,stroke-width:2px,color:#064e3b
classDef attnBlock fill:#ede9fe,stroke:#8b5cf6,stroke-width:2px,color:#3b1f7e
classDef mambaNode fill:#a7f3d0,stroke:#059669,color:#064e3b
classDef attnNode fill:#c4b5fd,stroke:#7c3aed,color:#3b1f7e
classDef ioNode fill:#fef3c7,stroke:#f59e0b,color:#78350f
classDef normNode fill:#e2e8f0,stroke:#64748b,color:#1e293b
class BLOCK1,BLOCK2,BLOCK3,BLOCK4 mambaBlock
class ATTN1,ATTN2,ATTN3,ATTN4 attnBlock
class M1,M2,M3,M4,M5,M6,M7,M8 mambaNode
class A1,A2,A3,A4 attnNode
class INPUT,OUTPUT ioNode
class NORM normNode
Pattern: 7 Mamba layers followed by 1 Attention layer, repeat. That's 87.5% Mamba, 12.5% Attention.
Why This Pattern Works
Mamba layers (87.5% of the stack) handle:
- Local patterns and syntax
- Sequential information flow
- Building up contextual representations
- General "thinking" that doesn't need exact retrieval
Attention layers (12.5% of the stack) handle:
- Disambiguating words ("bank" = river bank or financial bank?)
- Coreference resolution ("it" → "the cat")
- Exact information retrieval from context
- Following specific instructions from the prompt
The Mamba layers enrich the token representations before they hit the attention layer. By the time attention fires, the representations are already pretty good — attention just needs to do targeted cleanup and retrieval.
The MoE Layer: The Third Ingredient
Jamba also uses Mixture of Experts (MoE) in its feed-forward layers:
flowchart TD
TOKEN["Input Token"] --> ROUTER["Router Network - scores all 16 experts"]
ROUTER -->|"Selected"| E3["Expert 3 - active"]
ROUTER -->|"Selected"| E11["Expert 11 - active"]
ROUTER -.->|"Not selected"| E1["Expert 1"]
ROUTER -.->|"Not selected"| E2["Expert 2"]
ROUTER -.->|"..."| E16["Expert 16"]
E3 --> SUM["Weighted Sum"]
E11 --> SUM
SUM --> OUT["Output"]
classDef token fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#78350f
classDef router fill:#fecaca,stroke:#ef4444,stroke-width:2px,color:#7f1d1d
classDef active fill:#bbf7d0,stroke:#22c55e,stroke-width:2px,color:#14532d
classDef inactive fill:#f1f5f9,stroke:#cbd5e1,color:#94a3b8
classDef sum fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
classDef neutral fill:#f1f5f9,stroke:#64748b,color:#1e293b
class TOKEN token
class ROUTER router
class E3,E11 active
class E1,E2,E16 inactive
class SUM sum
class OUT neutral
16 experts total, 2 active per token. This means:
- Total parameters: 52B (lots of knowledge capacity)
- Active parameters per token: 12B (fast inference)
- Each expert can specialize (math, code, language, reasoning, etc.)
The Memory Savings Are Massive
flowchart LR
subgraph TRANSFORMER["Pure Transformer - 32 layers"]
T1["32 attention layers | 32 KV caches | ~128 GB at 256K | Needs multiple GPUs"]
end
subgraph HYBRID["Jamba Hybrid - 32 layers"]
H1["4 attn + 28 Mamba | 4 KV caches | ~16 GB at 256K | Single 80GB GPU"]
end
TRANSFORMER -->|"8x memory reduction"| HYBRID
classDef bad fill:#fecaca,stroke:#ef4444,stroke-width:2px,color:#7f1d1d
classDef good fill:#bbf7d0,stroke:#22c55e,stroke-width:2px,color:#14532d
classDef badNode fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
classDef goodNode fill:#dcfce7,stroke:#22c55e,color:#14532d
class TRANSFORMER bad
class HYBRID good
class T1 badNode
class H1 goodNode
That's the difference between needing a cluster and fitting on a single GPU.
Part 5: How Data Flows Through a Hybrid
Let me trace what happens when you send "The bank of the river was steep and muddy" through a hybrid model:
Phase 1: Mamba Layers Build Compressed Context
flowchart LR
H1["h: the bank..."] -->|"A decay"| H2["h: bank of river..."]
H2 -->|"A decay"| H3["h: river was steep..."]
H3 -->|"A decay"| H4["h: steep and..."]
classDef s1 fill:#a7f3d0,stroke:#059669,color:#064e3b
classDef s2 fill:#6ee7b7,stroke:#059669,color:#064e3b
classDef s3 fill:#34d399,stroke:#059669,color:#064e3b
classDef s4 fill:#10b981,stroke:#047857,color:#f0fdf4
class H1 s1
class H2 s2
class H3 s3
class H4 s4
Good at: local patterns, syntax, general context. Weak at: pinpointing a specific distant token.
Phase 2: Attention Layer Resolves Ambiguities
flowchart TD
AND["and - query token"] -->|"weight: 0.35"| STEEP["steep"]
AND -->|"weight: 0.28"| RIVER["river"]
AND -->|"weight: 0.22"| BANK["bank"]
AND -->|"weight: 0.08"| WAS["was"]
BANK2["bank - query"] -->|"weight: 0.65 STRONG"| RIVER2["river - disambiguates!"]
BANK2 -->|"weight: 0.12"| OF["of"]
BANK2 -->|"weight: 0.10"| THE["the"]
classDef query fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#78350f
classDef high fill:#c4b5fd,stroke:#8b5cf6,color:#3b1f7e
classDef mid fill:#93c5fd,stroke:#3b82f6,color:#1e3a5f
classDef low fill:#f1f5f9,stroke:#94a3b8,color:#334155
classDef match fill:#fecaca,stroke:#ef4444,stroke-width:3px,color:#7f1d1d
class AND,BANK2 query
class STEEP high
class RIVER,BANK mid
class WAS,OF,THE low
class RIVER2 match
This is the checkpoint moment. "bank" attends to "river" with weight 0.65 — it gets disambiguated as "river bank", not "financial bank". After this layer, the representation is precise.
Phase 3-4: More Mamba → More Attention → Repeat
Back to efficient Mamba processing, now with enriched representations. The disambiguation carries forward. Next attention checkpoint does higher-level reasoning.
Part 6: The Three Design Decisions
If you're building or fine-tuning a hybrid model, these are the knobs:
1. Attention-to-Mamba Ratio
flowchart LR
TOO_MUCH["Too much attention - back to Transformer costs"] --- SWEET["Sweet spot 1:6 to 1:8 - best tradeoff"] --- TOO_LITTLE["Too little attention - retrieval quality drops"]
classDef bad fill:#fecaca,stroke:#ef4444,color:#7f1d1d
classDef good fill:#bbf7d0,stroke:#22c55e,stroke-width:2px,color:#14532d
class TOO_MUCH,TOO_LITTLE bad
class SWEET good
2. Where to Place Attention Layers
- Early layers learn low-level features (syntax, local patterns) → Mamba handles fine
- Deep layers handle abstract reasoning, long-range deps → attention more valuable
- Jamba distributes evenly (every 8th layer)
3. MoE vs Dense FFN
- First few layers: Dense FFN (shared features needed by all tokens)
- Later layers: MoE (specialized knowledge, different tokens need different experts)
Part 7: Other Hybrid Models to Watch
Jamba isn't alone. The hybrid approach is becoming a movement:
| Model | Approach | Key Innovation |
|---|---|---|
| Jamba (AI21) | Mamba + Attention + MoE | First production hybrid, 256K context |
| Zamba (Zyphra) | Mamba + shared attention | Shared KV cache across attention insertion points |
| Griffin (DeepMind) | Gated linear recurrence + attention | Google's take on efficient recurrence |
| RWKV-6 | Linear attention + recurrence | Open-source, Transformer-quality at linear cost |
| Mamba-2 (Gu & Dao) | Improved SSM as structured matrix mult | Faster hardware utilization, easier hybridization |
Part 8: The Future — My Take
Here's what I think is going to happen in the next 1-2 years:
Transformers Are NOT Dead
Let me be clear. Transformers won't die. They're:
- Proven at 400B+ parameter scale
- Backed by massive ecosystem (FlashAttention, vLLM, TensorRT, ONNX)
- Still the best at tasks requiring precise information routing
- The architecture behind every frontier model today
You can't just throw away 7+ years of engineering and optimization. The tooling, the infrastructure, the deployment pipelines — all built for Transformers.
But Pure Transformers Are Over-Engineered
Here's the thing: most of the computation in a Transformer is wasted. Not every layer needs quadratic attention. Not every token needs to look at every other token at every layer.
The research is converging on a clear signal: use the minimum attention necessary, fill the rest with efficient recurrence.
Hybrids Are the Path Forward
flowchart TD
F1["Frontier labs adopt hybrids -- 8x memory savings too compelling"]
F2["Attention becomes premium -- used sparingly like DB indexes"]
F3["Open-source hybrids close the gap -- Mamba + RWKV already open"]
F4["1M+ context windows standard -- constant-memory inference"]
F5["New hardware optimizations -- custom silicon for selective scan"]
F6["Cost per token drops -- model tiering strategies evolve"]
classDef c1 fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
classDef c2 fill:#e0e7ff,stroke:#6366f1,color:#312e81
classDef c3 fill:#d1fae5,stroke:#10b981,color:#064e3b
classDef c4 fill:#fef3c7,stroke:#f59e0b,color:#78350f
classDef c5 fill:#fce7f3,stroke:#ec4899,color:#831843
classDef c6 fill:#ccfbf1,stroke:#14b8a6,color:#134e4a
class F1 c1
class F2 c2
class F3 c3
class F4 c4
class F5 c5
class F6 c6
For Us Developers
The architectural shift means real changes in how we build:
- Cost per token will drop — hybrid models need less compute
- Context windows will explode — entire codebases in context becomes feasible
- Real-time apps become easier — Mamba's O(1) inference enables streaming use cases
- Model tiering evolves — hybrid for bulk processing, Transformer for precision
Wrapping Up
The AI architecture landscape is evolving fast. Transformers gave us the foundation. Mamba showed us there's a better way to handle long sequences. And hybrids are proving you don't have to choose — you can have both.
The future isn't "Transformer vs Mamba." It's "Transformer AND Mamba, used where each excels."
If you're an engineer working with AI, now is a great time to understand these architectures. The next generation of models you'll be using — and maybe building on — will likely be hybrids.
If you found this useful, drop a like or a comment. I'm happy to go deeper on any section — the attention math, the SSM discretization, or the MoE routing if there's interest.
I'm Himanshu — an AI-augmented full-stack developer exploring how these architectures change the way we build software. Follow me for more deep dives on AI engineering.




Top comments (0)