Google Releases DiffusionGemma: Parallel Block Decoding

#ai #llm #machinelearning #tutorial

What: Google released DiffusionGemma, an open-weight model whose headline trick is parallel block decoding — it writes text by refining a whole block of tokens at once through iterative denoising, instead of predicting one next token at a time.

Why: Decoding is the slow, sequential part of running an LLM: emitting N tokens normally costs N forward passes that each wait on the last. Laying down a block in parallel is why DiffusionGemma reports up to 4x faster decode and 1000+ tokens/sec on an H100.

vs prior: Versus standard autoregressive decoding — left-to-right, one token per forward pass under a causal mask — DiffusionGemma starts from a canvas of 256 placeholder tokens and refines them all at once with bidirectional attention, so it can revise an early token using later context.

Think of it as

a Polaroid photo developing all at once vs a printer typing left to right

                       THE PARAGRAPH
                           │
             ┌─────────────┴─────────────┐
             │                           │
     ┌───────▼────────┐         ┌────────▼───────┐
     │    PRINTER     │         │    POLAROID    │
     │ (autoregress.) │         │  (diffusion)   │
     └───────┬────────┘         └────────┬───────┘
             │                           │
     types one token            lays the whole block
     left-to-right, each        down at once, then
     waiting on the last        sharpens it in passes
             │                           │
             ▼                           ▼
        ✗ N passes              ✓ a few parallel
          per N tokens            passes per block
        (sequential)            (up to ~4x faster)

printer (autoregressive) = types one token left-to-right, each waiting on the last
blank Polaroid = a block of 256 placeholder tokens, all present at once but unreadable
the photo developing = iterative denoising — the whole block sharpens in parallel over a few passes
no corner-first rule = bidirectional attention, so any token can use any other to fix itself

Quick glossary

Autoregressive decoding — The standard way LLMs write: predict the next token, append it, feed the longer sequence back in, repeat. Each token needs its own forward pass, and they happen strictly in order — that sequential chain is what makes decode slow.

Diffusion language model — A text model that borrows the recipe behind image generators: start from noise (random placeholder tokens) and repeatedly denoise toward a clean output. Unlike image diffusion it works over discrete tokens, refining a block rather than left-to-right.

Iterative denoising — The refinement loop. Each pass locks in the high-confidence tokens and re-evaluates the rest, so a blank block sharpens into readable text over a handful of passes instead of one token at a time.

Bidirectional (non-causal) attention — Attention with no left-to-right rule: every position can look at every other, future included. It is what lets the model fix an early token using context that appears later — the opposite of the causal mask autoregressive decoders rely on.

Forward pass — One run of the input through the network. Autoregressive decode pays one forward pass per token; DiffusionGemma emits 256 tokens from each pass, then spends a few more passes cleaning them up.

Mixture-of-Experts (MoE) — A model split into many expert sub-networks where each token activates only a few. DiffusionGemma is 26B total / ~3.8B active, so it has the knowledge of a big model but the per-token compute of a small one.

The news. On June 10, 2026, Google released DiffusionGemma, an Apache-2.0 model that generates text by iterative denoising rather than left-to-right sampling. It seeds a block with placeholder tokens and refines 256 tokens in parallel per forward pass using bidirectional attention, reaching 1000+ tokens/sec on an H100 and 700+ tokens/sec on an RTX 5090, and fitting in 18 GB of VRAM when quantized. It is a 26B-parameter mixture-of-experts model with about 3.8B active. Read the announcement →

Picture two machines printing the same paragraph. The first is a dot-matrix printer: it types one character left-to-right and the next character can't start until the last one lands — that is autoregressive decoding, the way nearly every LLM you have used writes one token at a time. The second is a Polaroid: the whole photo comes out at once, blank and blurry, then sharpens everywhere simultaneously over a few seconds. DiffusionGemma is the Polaroid. It lays down a whole block of placeholder tokens up front and then develops them in parallel, so the paragraph appears all at once and gets clearer with each pass.

Underneath the metaphor, "developing the photo" is iterative denoising. The model seeds a block with 256 noisy placeholder slots, then makes several refinement passes; each pass locks in the tokens it is now confident about and re-evaluates the rest. The trick that makes this legal is bidirectional attention — dropping the causal mask that forces a normal decoder to only look backward. Because every slot can attend to every other slot, future included, the model can self-correct an early token using words that only got resolved later. A left-to-right decoder can never do that: once it commits token 5, tokens 6 onward can lean on it, but it can't lean on them.

Property	Autoregressive (standard Gemma)	Parallel block decoding (DiffusionGemma)
How a token is produced	predict the single next token, append, repeat	seed a block of placeholders, denoise all at once
Tokens per forward pass	1	256 (Google)
Attention	causal (look backward only)	bidirectional (look both ways)
Can fix an earlier token?	no — already committed	yes — re-evaluated each pass
Reported decode speed	baseline	up to ~4x faster, 1000+ tok/s on H100 (Google, reported)

Why does generating in blocks win? Run the numbers on a 512-token answer (illustrative). The autoregressive printer needs 512 forward passes — one per token, each stalled waiting on the previous, which is exactly why decode is the latency-bound, memory-starved phase of LLM inference. DiffusionGemma instead lays those 512 tokens down as two blocks of 256 and refines each over a handful of denoising passes — say ~16 passes total (illustrative; Google reports the speedup, not the pass count). That collapses hundreds of strictly-sequential steps into a few parallel ones, and a parallel-friendly pass keeps the GPU busy, which is where the up to 4x faster decode and 1000+ tokens/sec on an H100 come from.

The catch is that each denoising pass is heavier than a single autoregressive step. Bidirectional attention re-reads the whole block every pass, so it can't reuse a backward-only KV cache the way a causal decoder does, and the headline 4x is measured on dedicated GPUs where that parallel work has lanes to fill. DiffusionGemma offsets the cost with a mixture-of-experts design — 26B total parameters but only ~3.8B active per token — and ships in 18 GB of VRAM when quantized, so it still fits a high-end consumer GPU such as the RTX 5090 the source benchmarks. The payoff is a different shape of LLM: not a faster printer, but a model that drafts a paragraph all at once and sharpens it — a live, open-weight alternative to left-to-right decoding.

Goes deeper in: LLM Internals → Text Generation → One Token at a Time

Related explainers

PSD — parallel speculative decoding for diffusion LLMs — a different lever on the same family: speed up diffusion decoding by drafting and verifying, rather than tuning the denoising schedule
dMoE — block-level expert routing — the memory side of serving a diffusion-LLM mixture-of-experts, the model family DiffusionGemma belongs to
Gemma 4 12B — encoder-free multimodal projection — a sibling open Gemma release chasing efficiency from the architecture side instead of the decoding side

FAQ

What is parallel block decoding?

It is a way to generate text a whole block at a time instead of one token at a time. DiffusionGemma seeds a block with 256 placeholder tokens, then makes several "denoising" passes that lock in the confident tokens and re-evaluate the rest, so the whole block sharpens in parallel. Because it uses bidirectional attention, the model can revise an early token using context that appears later — something an autoregressive, left-to-right decoder cannot do.

Why is it faster than autoregressive generation?

Autoregressive decoding produces one token per forward pass, and the passes happen strictly in order, so a 512-token answer needs 512 sequential steps. DiffusionGemma emits 256 tokens per pass and finishes a block in a handful of passes, collapsing hundreds of serial steps into a few parallel ones. Google reports up to 4x faster decode and 1000+ tokens/sec on an H100. The trade-off is that each denoising pass is heavier and can't reuse a backward-only KV cache, so the win is largest on dedicated GPUs.

How does it relate to diffusion image models and normal text generation?

It borrows the core idea from image diffusion — start from noise and repeatedly denoise toward a clean result — but applies it to discrete tokens and refines a block rather than a 2D image. Compared with normal autoregressive text generation, it swaps "predict the next token under a causal mask" for "refine a whole block under bidirectional attention." The output is still text; only the decoding procedure changes.

Originally posted on Learn AI Visually.