This article was originally published on runaihome.com
TL;DR: DiffusionGemma 26B-A4B is Google DeepMind's experimental text-diffusion model that denoises 256-token blocks in parallel instead of writing one token at a time. That gets you 1,000+ tok/s on an H100 and ~700 tok/s on an RTX 5090 — roughly 4× a same-size autoregressive model. The catch: the headline 18GB footprint needs NVFP4, which is Blackwell-only, and there's no GGUF/Ollama path yet.
| RTX 5090 32GB | RTX 4090 24GB | DGX Spark 128GB | |
|---|---|---|---|
| Best for | Native NVFP4, best home speed | Transformers path, no GGUF yet | Compact deskside, big context |
| Quantization | NVFP4 (18 GB) | bf16 (~52 GB won't fit) / int4 | NVFP4 (18 GB) |
| Generation speed | ~700 tok/s | ~200–400 tok/s (community) | 150+ tok/s |
| Street price (Jun 2026) | $2,000+ | ~$2,250 used | $3,999 |
Honest take: DiffusionGemma is a research preview, not your daily driver — it scores below standard Gemma 4 on every published benchmark, and Google says so. But if you have a Blackwell card and a throughput-bound workload, the 4× speedup is real. On anything older than RTX 50-series, wait for llama.cpp support before you bother.
Google DeepMind released DiffusionGemma on June 10, 2026 under the Apache 2.0 license. It's the first open-weights model in the Gemma line built on discrete text diffusion rather than autoregression, and the local-AI community noticed immediately because the speed numbers look like a different category of hardware. Before you git clone anything, you need to understand what diffusion text generation actually demands from your GPU — because it's not the same shape as running a normal LLM.
What "text diffusion" changes about your hardware
A normal LLM — Llama, Qwen, standard Gemma — is autoregressive. It produces one token, feeds it back in, produces the next, and repeats. Throughput is gated by how fast you can do one forward pass per token, which on consumer hardware is dominated by memory bandwidth: every token requires streaming the active weights through the GPU.
DiffusionGemma works differently. It starts from a block of 256 masked "noise" tokens and iteratively denoises the entire block in parallel, refining all 256 positions at once and self-correcting earlier guesses as it goes. Evaluations use what Google calls the Entropy-Bounded Denoising with Adaptive Stopping (EB) sampler, capped at 48 denoising steps. So instead of 256 sequential forward passes to fill a 256-token block, you do up to 48 passes that each operate on the whole block. That's where the parallelism — and the speed — comes from.
The architecture underneath is the Gemma 4 26B-A4B Mixture-of-Experts: 25.2B total parameters, ~3.8B active per step. It's multimodal on input (text, image, and video in; text out) and supports a context window up to 256K tokens.
The practical consequence for your hardware: DiffusionGemma is more compute-bound and less purely bandwidth-bound than an equivalent autoregressive model, because each denoising step does dense work across a full block. That matters when you pick a GPU. A card with monster bandwidth but weak compute won't extract the full 4× — and the format you can run determines whether you even get in the door.
The 18GB number, and the NVFP4 asterisk
Every headline says DiffusionGemma "fits in 18GB of VRAM." That's true, but only at NVFP4.
At bf16 the 25.2B weights occupy roughly 52GB — the whole expert set has to be resident even though only 3.8B activate per step, the same MoE memory trap that applies to Qwen 3.6 35B-A3B and every other A-class MoE. 52GB doesn't fit on any single consumer card. To get to 18GB you need NVFP4, NVIDIA's 4-bit floating-point format.
Here's the part most write-ups skip: NVFP4 is a Blackwell-native format. It has hardware tensor-core support on RTX 50-series and the RTX PRO line, but not on Ada (RTX 40-series) or Ampere (RTX 30-series). So the clean 18GB-in-an-RTX-4090 story you see repeated is misleading — a 4090 has the 24GB capacity, but it can't run NVFP4 with native acceleration. We cover the format in depth in the ComfyUI NVFP4 guide; the same generation rule applies here.
| Format | VRAM (weights) | Native on | Reality in June 2026 |
|---|---|---|---|
| bf16 | ~52 GB | All | Needs 2× 24GB or a 48GB+ card |
| FP8 | ~26 GB | Ada, Blackwell, Hopper | Datacenter path; tight on 24GB consumer |
| NVFP4 | ~18 GB | Blackwell (RTX 50-series, RTX PRO) | The "18GB" headline number |
| GGUF int4 | ~16 GB (projected) | Any (via llama.cpp) | Not available yet |
That last row is the one that stings for most home labs. As of mid-June 2026, llama.cpp GGUF support for DiffusionGemma's block-diffusion sampler is still an open PR, not a release. No GGUF means no Ollama and no LM Studio yet — those wrap llama.cpp. Day-zero support shipped for vLLM, HuggingFace Transformers, MLX, Unsloth, and NVIDIA NeMo, so the supported local path today is vLLM or raw Transformers, not the one-line ollama pull most readers want. If your stack is Ollama-first, see our vLLM vs Ollama breakdown for what switching costs you.
Real speed numbers, and the 1,000 tok/s ceiling
The "1,000 tokens per second" headline is a datacenter number. At batch size 1, the FP8 build reaches about 1,008 tok/s on a single H100 and 1,288 tok/s on an H200; NVIDIA quotes up to 2,000 tok/s on a DGX Station. Those are the figures behind "4× faster" — for reference, autoregressive Gemma 4 27B does roughly 40 tok/s on an RTX 4090.
What you actually get at home:
| Hardware | Memory BW | DiffusionGemma speed | Notes |
|---|---|---|---|
| H100 SXM | ~3.35 TB/s | ~1,008 tok/s (FP8) | The "1,000 tok/s" headline |
| H200 | ~4.8 TB/s | ~1,288 tok/s (FP8) | Datacenter |
| RTX 5090 32GB | 1,792 GB/s | ~700 tok/s | Best consumer number, native NVFP4 |
| DGX Spark 128GB | (LPDDR5X) | 150+ tok/s | Compact deskside, huge context headroom |
| RTX 4090 24GB | 1,008 GB/s | ~200–400 tok/s | Community estimate; no native NVFP4 |
So no consumer card saturates the 1,000 tok/s ceiling — that requires H100-class bandwidth. The RTX 5090 gets closest at ~700 tok/s because it pairs 1,792 GB/s of bandwidth with native NVFP4 tensor cores. It is, today, the only consumer GPU that runs DiffusionGemma the way it was designed to run. The RTX PRO 6000 Blackwell also qualifies and adds 96GB for long-context work, but at workstation prices.
The RTX 4090 is the interesting tweener. It has the VRAM and the bandwidth, but no NVFP4 acceleration, so you're stuck running a heavier format through Transformers — community reports land around 200–400 tok/s. That's still several times faster than autoregressive Gemma 4 on the same card, but it's not the 4× story, and you're paying ~$2,250 used for a card that's now mid-pack for this model.
What about RTX 3090 and the budget tier?
This is where DiffusionGemma diverges hard from the usual local-AI advice. Normally a used RTX 3090 — around $1,050 on eBay in June 2026, down from its peak but no longer the $500 bargain it once was — is the value king for 24GB workloads. Here it's a poor fit:
- No NVFP4. Ampere can't accelerate the format that makes 18GB possible.
- No GGUF yet. The int4 path that would let a 3090 run this hasn't shipped.
- Bandwidth gap. At 936 GB/s the 3090 trails the 5090's 1,792 GB/s by nearly half, and diffusion's compute-heavy steps don't favor Ampere.
If you own a 3090, the right move is to keep running autoregressive Gemma 4 or Qwen on it and revisit DiffusionGemma when llama.cpp lands a GGUF. Buying a 3090 for DiffusionGemma makes no sense to
Top comments (0)