Himanshu Singh Tomar

Posted on Mar 23

Transformers Are Not Dead — But Hybrids Are the Future. Here's Why.

#machinelearning #architecture #deeplearning #ai

I've been spending a lot of time recently understanding how the next generation of AI models are being built. Not just using them — actually understanding what's happening under the hood.

And here's what I've realized: the Transformer isn't going anywhere. But it's also not enough on its own anymore.

Let me explain.

The Problem Nobody Talks About

Every major LLM today — GPT-4, Claude, Gemini, Llama — is built on the Transformer architecture. It works. It works incredibly well. But it has one fundamental problem that gets worse as we push for longer contexts:

Self-attention is O(L²).

That means if you double your context window from 64K to 128K tokens, the compute doesn't just double — it quadruples. And the KV cache (the memory that stores past token information during inference) grows linearly with sequence length.

At 128K context with a 32-layer Transformer, you're looking at roughly 128 GB of KV cache. That's not fitting on a single GPU.

This is where Mamba enters the picture. And this is where things get interesting.

Part 1: The Transformer — What You Already Use

Let me break down what's actually happening inside a Transformer. If you've used any modern LLM, this is the engine running underneath.

The Full Architecture

The original Transformer (Vaswani et al., 2017 — "Attention Is All You Need") has two halves — an Encoder and a Decoder:

Most modern LLMs (GPT, Llama, Claude) use decoder-only Transformers — they drop the encoder entirely and just stack decoder layers. But the core mechanism is the same.

How Self-Attention Actually Works

This is the heart of the Transformer. Let me walk through it with a concrete example.

Say you have the sentence: "The cat sat on it"

When the model processes the token "it", it needs to figure out: what does "it" refer to?

Step 1: Each token creates three vectors from its embedding:

Query (Q) — "What am I looking for?"
Key (K) — "What do I contain?"
Value (V) — "What information do I carry?"

Step 2: Compute attention scores = dot product of Q("it") with K(every other token):

flowchart LR
    T1["The = 0.05"]
    T2["cat = 0.72"]
    T3["sat = 0.12"]
    T4["on = 0.03"]
    T5["it = 0.08"]

    classDef low fill:#f1f5f9,stroke:#94a3b8,color:#334155
    classDef high fill:#c084fc,stroke:#9333ea,stroke-width:3px,color:#3b0764

    class T1,T3,T4,T5 low
    class T2 high

Step 3: Softmax normalizes scores into attention weights (sum = 1.0)

Step 4: Output = weighted sum of all Value vectors

The result? "it" attends most strongly to "cat" (weight 0.72) — it learned that "it" refers to "the cat" without anyone telling it to. That's the magic.

Multi-Head Attention

The model runs this process multiple times in parallel (typically 8-16 heads). Each head learns different relationships:

The Feed-Forward Network

After attention, each token goes through a two-layer MLP independently:

FFN(x) = max(0, x·W₁ + b₁)·W₂ + b₂

This is where the model stores factual knowledge — it acts like a key-value memory. Research has shown you can actually locate specific facts in specific neurons of the FFN layers.

Residual Connections + Layer Norm

Every sub-layer is wrapped with: LayerNorm(x + SubLayer(x))

This prevents gradient degradation through deep stacks. Without residuals, gradients would vanish by layer 10. With them, you can stack 80+ layers.

The Cost

Operation	Training Cost	Inference (per token)	Memory
Self-Attention	O(L²·d)	O(L·d)	O(L·d) KV cache
FFN	O(L·d²)	O(d²)	O(d²) weights

That L² in attention is the killer. Everything else scales linearly.

Part 2: Mamba — The Challenger

Mamba was introduced by Albert Gu and Tri Dao in December 2023. It takes a completely different approach: instead of letting every token look at every other token, it processes the sequence through a Selective State Space Model (SSM).

The Core Idea: A Learned Compression

Think of it like this:

Transformer: keeps a complete photo album of every past token (KV cache)
Mamba: keeps a single, constantly-updated summary note (hidden state)

The math comes from continuous dynamical systems:

Continuous form:
  h'(t) = A·h(t) + B·x(t)     ← hidden state evolution
  y(t)  = C·h(t)               ← output

Discretized (for actual tokens):
  h_t = A_bar · h_(t-1) + B_bar · x_t   ← update hidden state
  y_t = C · h_t                          ← read output

Where:

A (transition/decay matrix) — controls how much old state is retained
B (input matrix) — controls how much of the new token gets written in
C (output matrix) — controls what gets read out
Delta (step size) — controls discretization granularity

What Makes Mamba Different: Selectivity

Previous state space models (S4, H3) used fixed A, B, C parameters — same for every token. Mamba's key innovation: these parameters are input-dependent.

This is selectivity. The model dynamically decides, per-token:

When to remember (small Delta = gentle update)
When to forget (large Delta = aggressive reset)
What to write (B controls input projection)
What to output (C controls readout)

A delimiter token might trigger high forgetting. A content word triggers careful accumulation. The model learns this entirely from data.

The Mamba Block

flowchart TB
    INPUT["Input - B, L, D"] --> NORM["RMS Norm"]
    NORM --> LP1["Branch A: Linear Proj D to E"]
    NORM --> LP2["Branch B: Linear Proj D to E - gate"]

    LP1 --> CONV["Causal Conv1D kernel=4"]
    CONV --> SILU1["SiLU Activation"]
    SILU1 --> SSSM["SELECTIVE SSM - Input-dependent A, B, C, Delta"]

    LP2 --> SILU2["SiLU Activation"]

    SSSM --> MUL["Multiply - Gating"]
    SILU2 --> MUL

    MUL --> LP3["Linear Proj E to D"]
    LP3 --> RES["Residual Add"]
    INPUT -.->|"Skip connection"| RES
    RES --> OUT["Output - B, L, D"]

    classDef neutral fill:#f1f5f9,stroke:#64748b,color:#1e293b
    classDef norm fill:#e2e8f0,stroke:#64748b,color:#1e293b
    classDef proj fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef conv fill:#fed7aa,stroke:#f97316,stroke-width:2px,color:#7c2d12
    classDef ssm fill:#fecaca,stroke:#ef4444,stroke-width:3px,color:#7f1d1d
    classDef gate fill:#fce7f3,stroke:#ec4899,stroke-width:2px,color:#831843
    classDef res fill:#d1fae5,stroke:#10b981,color:#064e3b

    class INPUT,OUT neutral
    class NORM,SILU1,SILU2 norm
    class LP1,LP2,LP3 proj
    class CONV conv
    class SSSM ssm
    class MUL gate
    class RES res

The split-branch design is key: one path does the heavy computation (Conv1D followed by SSM), the other acts as a gate. Multiplying them together gives the model fine control over information flow.

The Hardware Trick: Parallel Scan

"But wait," you're thinking, "if Mamba processes tokens sequentially through a recurrence, isn't training slow?"

Here's the engineering brilliance: the state update is associative. That means you can parallelize it using a scan (prefix sum) operation — the same algorithm GPUs are incredibly good at.

During training, Mamba achieves near-Transformer parallelism. During inference, it's purely recurrent — just update the hidden state with each new token. No KV cache. Constant memory. Constant time per token.

The Hidden State Flow

flowchart LR
    X1["x1"] -->|"B"| H1["h1"]
    X2["x2"] -->|"B"| H2["h2"]
    X3["x3"] -->|"B"| H3["h3"]
    X4["x4"] -->|"B"| H4["h4"]
    X5["x5"] -->|"B"| H5["h5"]

    H1 -->|"A decay"| H2
    H2 -->|"A decay"| H3
    H3 -->|"A decay"| H4
    H4 -->|"A decay"| H5

    H1 -->|"C"| Y1["y1"]
    H2 -->|"C"| Y2["y2"]
    H3 -->|"C"| Y3["y3"]
    H4 -->|"C"| Y4["y4"]
    H5 -->|"C"| Y5["y5"]

    classDef inp fill:#fef3c7,stroke:#f59e0b,color:#78350f
    classDef hid fill:#ccfbf1,stroke:#14b8a6,stroke-width:2px,color:#134e4a
    classDef outp fill:#e0e7ff,stroke:#6366f1,color:#312e81

    class X1,X2,X3,X4,X5 inp
    class H1,H2,H3,H4,H5 hid
    class Y1,Y2,Y3,Y4,Y5 outp

Each hidden state h_t carries a compressed summary of all past tokens. The A matrix controls decay (how much old info fades), B controls what gets written in, C controls what gets read out. All input-dependent in Mamba.

The Cost

Operation	Training Cost	Inference (per token)	Memory
Selective SSM	O(L·d) linear	O(d) constant	O(N·d) fixed state
Conv1D	O(L·d)	O(d)	O(k·d) kernel

Everything is linear or constant. No quadratic anywhere.

Part 3: The Head-to-Head Comparison

Let me be real about where each architecture wins and loses:

Dimension	Transformer	Mamba
Core mechanism	Self-attention (token-to-token)	Selective SSM (compressed state)
Training compute	O(L²·d) quadratic	O(L·d) linear
Inference per token	O(L·d) grows with context	O(d) constant
Memory (KV cache)	O(L·d) grows	O(N·d) fixed
Context window	Fixed (4K-128K) hard wall	Theoretically unlimited
Exact retrieval	Perfect — can pinpoint any token	Lossy — compressed into finite state
In-context learning	Strong — dynamic pattern routing	Weaker on exact copy/retrieval
Scale proven	400B+ parameters	Up to ~8B so far
Ecosystem	Mature (FlashAttention, vLLM)	Newer, custom CUDA kernels

The Fundamental Tradeoff

It comes down to this:

Attention = complete, addressable memory of every past token. You pay O(L²) for it.

SSM = fixed-size compressed summary of history. Much cheaper, but lossy.

The question is: is that compression sufficient for the task at hand?

For most of the computation in a forward pass — yes. You don't need to look at every single past token to process the word "the". But for some critical operations — resolving coreferences, copying exact numbers, following specific instructions — you need the precision of attention.

And that's exactly why hybrids are the answer.

Part 4: The Hybrid Architecture — Best of Both Worlds

This is where I got really excited. The hybrid approach isn't a compromise — it's genuinely better than either architecture alone.

The Core Insight

You don't need quadratic attention at every single layer.

Most layers are doing local, incremental processing. Mamba handles this beautifully at linear cost. But at certain critical points, you need the model to "check its work" against the full context with exact attention.

Jamba: The Reference Implementation

AI21 Labs built Jamba — the first production-scale hybrid (52B total / 12B active, 256K context). Here's the layer stack:

flowchart TB
    INPUT["Token Embedding"] --> M1

    subgraph BLOCK1["Layers 1-7 : Mamba Foundation"]
        M1["Mamba SSM + Dense FFN"]
        M2["Mamba SSM + Dense FFN"]
        M3["... x 7 layers total"]
    end

    M3 --> A1

    subgraph ATTN1["Layer 8 : Attention Checkpoint"]
        A1["Self-Attention + MoE FFN -- KV cache stored"]
    end

    A1 --> M4

    subgraph BLOCK2["Layers 9-15 : Mamba Efficient Flow"]
        M4["Mamba SSM + MoE FFN"]
        M5["Mamba SSM + MoE FFN"]
        M6["... x 7 layers total"]
    end

    M6 --> A2

    subgraph ATTN2["Layer 16 : Attention Checkpoint"]
        A2["Self-Attention + MoE FFN -- KV cache stored"]
    end

    A2 --> M7

    subgraph BLOCK3["Layers 17-23 : Mamba Efficient Flow"]
        M7["... x 7 Mamba SSM layers"]
    end

    M7 --> A3

    subgraph ATTN3["Layer 24 : Attention Checkpoint"]
        A3["Self-Attention + MoE FFN"]
    end

    A3 --> M8

    subgraph BLOCK4["Layers 25-31 : Mamba Efficient Flow"]
        M8["... x 7 Mamba SSM layers"]
    end

    M8 --> A4

    subgraph ATTN4["Layer 32 : Attention Checkpoint"]
        A4["Self-Attention + MoE FFN"]
    end

    A4 --> NORM["RMS Norm + Linear Head"]
    NORM --> OUTPUT["Output Logits"]

    classDef mambaBlock fill:#d1fae5,stroke:#10b981,stroke-width:2px,color:#064e3b
    classDef attnBlock fill:#ede9fe,stroke:#8b5cf6,stroke-width:2px,color:#3b1f7e
    classDef mambaNode fill:#a7f3d0,stroke:#059669,color:#064e3b
    classDef attnNode fill:#c4b5fd,stroke:#7c3aed,color:#3b1f7e
    classDef ioNode fill:#fef3c7,stroke:#f59e0b,color:#78350f
    classDef normNode fill:#e2e8f0,stroke:#64748b,color:#1e293b

    class BLOCK1,BLOCK2,BLOCK3,BLOCK4 mambaBlock
    class ATTN1,ATTN2,ATTN3,ATTN4 attnBlock
    class M1,M2,M3,M4,M5,M6,M7,M8 mambaNode
    class A1,A2,A3,A4 attnNode
    class INPUT,OUTPUT ioNode
    class NORM normNode

Pattern: 7 Mamba layers followed by 1 Attention layer, repeat. That's 87.5% Mamba, 12.5% Attention.

Why This Pattern Works

Mamba layers (87.5% of the stack) handle:

Local patterns and syntax
Sequential information flow
Building up contextual representations
General "thinking" that doesn't need exact retrieval

Attention layers (12.5% of the stack) handle:

Disambiguating words ("bank" = river bank or financial bank?)
Coreference resolution ("it" → "the cat")
Exact information retrieval from context
Following specific instructions from the prompt

The Mamba layers enrich the token representations before they hit the attention layer. By the time attention fires, the representations are already pretty good — attention just needs to do targeted cleanup and retrieval.

The MoE Layer: The Third Ingredient

Jamba also uses Mixture of Experts (MoE) in its feed-forward layers:

flowchart TD
    TOKEN["Input Token"] --> ROUTER["Router Network - scores all 16 experts"]
    ROUTER -->|"Selected"| E3["Expert 3 - active"]
    ROUTER -->|"Selected"| E11["Expert 11 - active"]
    ROUTER -.->|"Not selected"| E1["Expert 1"]
    ROUTER -.->|"Not selected"| E2["Expert 2"]
    ROUTER -.->|"..."| E16["Expert 16"]
    E3 --> SUM["Weighted Sum"]
    E11 --> SUM
    SUM --> OUT["Output"]

    classDef token fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#78350f
    classDef router fill:#fecaca,stroke:#ef4444,stroke-width:2px,color:#7f1d1d
    classDef active fill:#bbf7d0,stroke:#22c55e,stroke-width:2px,color:#14532d
    classDef inactive fill:#f1f5f9,stroke:#cbd5e1,color:#94a3b8
    classDef sum fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef neutral fill:#f1f5f9,stroke:#64748b,color:#1e293b

    class TOKEN token
    class ROUTER router
    class E3,E11 active
    class E1,E2,E16 inactive
    class SUM sum
    class OUT neutral

16 experts total, 2 active per token. This means:

Total parameters: 52B (lots of knowledge capacity)
Active parameters per token: 12B (fast inference)
Each expert can specialize (math, code, language, reasoning, etc.)

The Memory Savings Are Massive

flowchart LR
    subgraph TRANSFORMER["Pure Transformer - 32 layers"]
        T1["32 attention layers | 32 KV caches | ~128 GB at 256K | Needs multiple GPUs"]
    end

    subgraph HYBRID["Jamba Hybrid - 32 layers"]
        H1["4 attn + 28 Mamba | 4 KV caches | ~16 GB at 256K | Single 80GB GPU"]
    end

    TRANSFORMER -->|"8x memory reduction"| HYBRID

    classDef bad fill:#fecaca,stroke:#ef4444,stroke-width:2px,color:#7f1d1d
    classDef good fill:#bbf7d0,stroke:#22c55e,stroke-width:2px,color:#14532d
    classDef badNode fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef goodNode fill:#dcfce7,stroke:#22c55e,color:#14532d

    class TRANSFORMER bad
    class HYBRID good
    class T1 badNode
    class H1 goodNode

That's the difference between needing a cluster and fitting on a single GPU.

Part 5: How Data Flows Through a Hybrid

Let me trace what happens when you send "The bank of the river was steep and muddy" through a hybrid model:

Phase 1: Mamba Layers Build Compressed Context

flowchart LR
    H1["h: the bank..."] -->|"A decay"| H2["h: bank of river..."]
    H2 -->|"A decay"| H3["h: river was steep..."]
    H3 -->|"A decay"| H4["h: steep and..."]

    classDef s1 fill:#a7f3d0,stroke:#059669,color:#064e3b
    classDef s2 fill:#6ee7b7,stroke:#059669,color:#064e3b
    classDef s3 fill:#34d399,stroke:#059669,color:#064e3b
    classDef s4 fill:#10b981,stroke:#047857,color:#f0fdf4

    class H1 s1
    class H2 s2
    class H3 s3
    class H4 s4

Good at: local patterns, syntax, general context. Weak at: pinpointing a specific distant token.

Phase 2: Attention Layer Resolves Ambiguities

flowchart TD
    AND["and - query token"] -->|"weight: 0.35"| STEEP["steep"]
    AND -->|"weight: 0.28"| RIVER["river"]
    AND -->|"weight: 0.22"| BANK["bank"]
    AND -->|"weight: 0.08"| WAS["was"]

    BANK2["bank - query"] -->|"weight: 0.65 STRONG"| RIVER2["river - disambiguates!"]
    BANK2 -->|"weight: 0.12"| OF["of"]
    BANK2 -->|"weight: 0.10"| THE["the"]

    classDef query fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#78350f
    classDef high fill:#c4b5fd,stroke:#8b5cf6,color:#3b1f7e
    classDef mid fill:#93c5fd,stroke:#3b82f6,color:#1e3a5f
    classDef low fill:#f1f5f9,stroke:#94a3b8,color:#334155
    classDef match fill:#fecaca,stroke:#ef4444,stroke-width:3px,color:#7f1d1d

    class AND,BANK2 query
    class STEEP high
    class RIVER,BANK mid
    class WAS,OF,THE low
    class RIVER2 match

This is the checkpoint moment. "bank" attends to "river" with weight 0.65 — it gets disambiguated as "river bank", not "financial bank". After this layer, the representation is precise.

Phase 3-4: More Mamba → More Attention → Repeat

Back to efficient Mamba processing, now with enriched representations. The disambiguation carries forward. Next attention checkpoint does higher-level reasoning.

Part 6: The Three Design Decisions

If you're building or fine-tuning a hybrid model, these are the knobs:

1. Attention-to-Mamba Ratio

flowchart LR
    TOO_MUCH["Too much attention - back to Transformer costs"] --- SWEET["Sweet spot 1:6 to 1:8 - best tradeoff"] --- TOO_LITTLE["Too little attention - retrieval quality drops"]

    classDef bad fill:#fecaca,stroke:#ef4444,color:#7f1d1d
    classDef good fill:#bbf7d0,stroke:#22c55e,stroke-width:2px,color:#14532d

    class TOO_MUCH,TOO_LITTLE bad
    class SWEET good

2. Where to Place Attention Layers

Early layers learn low-level features (syntax, local patterns) → Mamba handles fine
Deep layers handle abstract reasoning, long-range deps → attention more valuable
Jamba distributes evenly (every 8th layer)

3. MoE vs Dense FFN

First few layers: Dense FFN (shared features needed by all tokens)
Later layers: MoE (specialized knowledge, different tokens need different experts)

Part 7: Other Hybrid Models to Watch

Jamba isn't alone. The hybrid approach is becoming a movement:

Model	Approach	Key Innovation
Jamba (AI21)	Mamba + Attention + MoE	First production hybrid, 256K context
Zamba (Zyphra)	Mamba + shared attention	Shared KV cache across attention insertion points
Griffin (DeepMind)	Gated linear recurrence + attention	Google's take on efficient recurrence
RWKV-6	Linear attention + recurrence	Open-source, Transformer-quality at linear cost
Mamba-2 (Gu & Dao)	Improved SSM as structured matrix mult	Faster hardware utilization, easier hybridization

Part 8: The Future — My Take

Here's what I think is going to happen in the next 1-2 years:

Transformers Are NOT Dead

Let me be clear. Transformers won't die. They're:

Proven at 400B+ parameter scale
Backed by massive ecosystem (FlashAttention, vLLM, TensorRT, ONNX)
Still the best at tasks requiring precise information routing
The architecture behind every frontier model today

You can't just throw away 7+ years of engineering and optimization. The tooling, the infrastructure, the deployment pipelines — all built for Transformers.

But Pure Transformers Are Over-Engineered

Here's the thing: most of the computation in a Transformer is wasted. Not every layer needs quadratic attention. Not every token needs to look at every other token at every layer.

The research is converging on a clear signal: use the minimum attention necessary, fill the rest with efficient recurrence.

Hybrids Are the Path Forward

flowchart TD
    F1["Frontier labs adopt hybrids -- 8x memory savings too compelling"]
    F2["Attention becomes premium -- used sparingly like DB indexes"]
    F3["Open-source hybrids close the gap -- Mamba + RWKV already open"]
    F4["1M+ context windows standard -- constant-memory inference"]
    F5["New hardware optimizations -- custom silicon for selective scan"]
    F6["Cost per token drops -- model tiering strategies evolve"]

    classDef c1 fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef c2 fill:#e0e7ff,stroke:#6366f1,color:#312e81
    classDef c3 fill:#d1fae5,stroke:#10b981,color:#064e3b
    classDef c4 fill:#fef3c7,stroke:#f59e0b,color:#78350f
    classDef c5 fill:#fce7f3,stroke:#ec4899,color:#831843
    classDef c6 fill:#ccfbf1,stroke:#14b8a6,color:#134e4a

    class F1 c1
    class F2 c2
    class F3 c3
    class F4 c4
    class F5 c5
    class F6 c6

For Us Developers

The architectural shift means real changes in how we build:

Cost per token will drop — hybrid models need less compute
Context windows will explode — entire codebases in context becomes feasible
Real-time apps become easier — Mamba's O(1) inference enables streaming use cases
Model tiering evolves — hybrid for bulk processing, Transformer for precision

Wrapping Up

The AI architecture landscape is evolving fast. Transformers gave us the foundation. Mamba showed us there's a better way to handle long sequences. And hybrids are proving you don't have to choose — you can have both.

The future isn't "Transformer vs Mamba." It's "Transformer AND Mamba, used where each excels."

If you're an engineer working with AI, now is a great time to understand these architectures. The next generation of models you'll be using — and maybe building on — will likely be hybrids.

If you found this useful, drop a like or a comment. I'm happy to go deeper on any section — the attention math, the SSM discretization, or the MoE routing if there's interest.

I'm Himanshu — an AI-augmented full-stack developer exploring how these architectures change the way we build software. Follow me for more deep dives on AI engineering.