DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

DiffusionGemma 26B for Local AI in 2026: 18GB VRAM, 4 Faster Generation, and Which Consumer GPUs Actually Saturate the 1,000 tok/s Ceiling

This article was originally published on runaihome.com

TL;DR: DiffusionGemma 26B-A4B is Google DeepMind's experimental text-diffusion model that denoises 256-token blocks in parallel instead of writing one token at a time. That gets you 1,000+ tok/s on an H100 and ~700 tok/s on an RTX 5090 — roughly 4× a same-size autoregressive model. The catch: the headline 18GB footprint needs NVFP4, which is Blackwell-only, and there's no GGUF/Ollama path yet.

RTX 5090 32GB RTX 4090 24GB DGX Spark 128GB
Best for Native NVFP4, best home speed Transformers path, no GGUF yet Compact deskside, big context
Quantization NVFP4 (18 GB) bf16 (~52 GB won't fit) / int4 NVFP4 (18 GB)
Generation speed ~700 tok/s ~200–400 tok/s (community) 150+ tok/s
Street price (Jun 2026) $2,000+ ~$2,250 used $3,999

Honest take: DiffusionGemma is a research preview, not your daily driver — it scores below standard Gemma 4 on every published benchmark, and Google says so. But if you have a Blackwell card and a throughput-bound workload, the 4× speedup is real. On anything older than RTX 50-series, wait for llama.cpp support before you bother.

Google DeepMind released DiffusionGemma on June 10, 2026 under the Apache 2.0 license. It's the first open-weights model in the Gemma line built on discrete text diffusion rather than autoregression, and the local-AI community noticed immediately because the speed numbers look like a different category of hardware. Before you git clone anything, you need to understand what diffusion text generation actually demands from your GPU — because it's not the same shape as running a normal LLM.

What "text diffusion" changes about your hardware

A normal LLM — Llama, Qwen, standard Gemma — is autoregressive. It produces one token, feeds it back in, produces the next, and repeats. Throughput is gated by how fast you can do one forward pass per token, which on consumer hardware is dominated by memory bandwidth: every token requires streaming the active weights through the GPU.

DiffusionGemma works differently. It starts from a block of 256 masked "noise" tokens and iteratively denoises the entire block in parallel, refining all 256 positions at once and self-correcting earlier guesses as it goes. Evaluations use what Google calls the Entropy-Bounded Denoising with Adaptive Stopping (EB) sampler, capped at 48 denoising steps. So instead of 256 sequential forward passes to fill a 256-token block, you do up to 48 passes that each operate on the whole block. That's where the parallelism — and the speed — comes from.

The architecture underneath is the Gemma 4 26B-A4B Mixture-of-Experts: 25.2B total parameters, ~3.8B active per step. It's multimodal on input (text, image, and video in; text out) and supports a context window up to 256K tokens.

The practical consequence for your hardware: DiffusionGemma is more compute-bound and less purely bandwidth-bound than an equivalent autoregressive model, because each denoising step does dense work across a full block. That matters when you pick a GPU. A card with monster bandwidth but weak compute won't extract the full 4× — and the format you can run determines whether you even get in the door.

The 18GB number, and the NVFP4 asterisk

Every headline says DiffusionGemma "fits in 18GB of VRAM." That's true, but only at NVFP4.

At bf16 the 25.2B weights occupy roughly 52GB — the whole expert set has to be resident even though only 3.8B activate per step, the same MoE memory trap that applies to Qwen 3.6 35B-A3B and every other A-class MoE. 52GB doesn't fit on any single consumer card. To get to 18GB you need NVFP4, NVIDIA's 4-bit floating-point format.

Here's the part most write-ups skip: NVFP4 is a Blackwell-native format. It has hardware tensor-core support on RTX 50-series and the RTX PRO line, but not on Ada (RTX 40-series) or Ampere (RTX 30-series). So the clean 18GB-in-an-RTX-4090 story you see repeated is misleading — a 4090 has the 24GB capacity, but it can't run NVFP4 with native acceleration. We cover the format in depth in the ComfyUI NVFP4 guide; the same generation rule applies here.

Format VRAM (weights) Native on Reality in June 2026
bf16 ~52 GB All Needs 2× 24GB or a 48GB+ card
FP8 ~26 GB Ada, Blackwell, Hopper Datacenter path; tight on 24GB consumer
NVFP4 ~18 GB Blackwell (RTX 50-series, RTX PRO) The "18GB" headline number
GGUF int4 ~16 GB (projected) Any (via llama.cpp) Not available yet

That last row is the one that stings for most home labs. As of mid-June 2026, llama.cpp GGUF support for DiffusionGemma's block-diffusion sampler is still an open PR, not a release. No GGUF means no Ollama and no LM Studio yet — those wrap llama.cpp. Day-zero support shipped for vLLM, HuggingFace Transformers, MLX, Unsloth, and NVIDIA NeMo, so the supported local path today is vLLM or raw Transformers, not the one-line ollama pull most readers want. If your stack is Ollama-first, see our vLLM vs Ollama breakdown for what switching costs you.

Real speed numbers, and the 1,000 tok/s ceiling

The "1,000 tokens per second" headline is a datacenter number. At batch size 1, the FP8 build reaches about 1,008 tok/s on a single H100 and 1,288 tok/s on an H200; NVIDIA quotes up to 2,000 tok/s on a DGX Station. Those are the figures behind "4× faster" — for reference, autoregressive Gemma 4 27B does roughly 40 tok/s on an RTX 4090.

What you actually get at home:

Hardware Memory BW DiffusionGemma speed Notes
H100 SXM ~3.35 TB/s ~1,008 tok/s (FP8) The "1,000 tok/s" headline
H200 ~4.8 TB/s ~1,288 tok/s (FP8) Datacenter
RTX 5090 32GB 1,792 GB/s ~700 tok/s Best consumer number, native NVFP4
DGX Spark 128GB (LPDDR5X) 150+ tok/s Compact deskside, huge context headroom
RTX 4090 24GB 1,008 GB/s ~200–400 tok/s Community estimate; no native NVFP4

So no consumer card saturates the 1,000 tok/s ceiling — that requires H100-class bandwidth. The RTX 5090 gets closest at ~700 tok/s because it pairs 1,792 GB/s of bandwidth with native NVFP4 tensor cores. It is, today, the only consumer GPU that runs DiffusionGemma the way it was designed to run. The RTX PRO 6000 Blackwell also qualifies and adds 96GB for long-context work, but at workstation prices.

The RTX 4090 is the interesting tweener. It has the VRAM and the bandwidth, but no NVFP4 acceleration, so you're stuck running a heavier format through Transformers — community reports land around 200–400 tok/s. That's still several times faster than autoregressive Gemma 4 on the same card, but it's not the 4× story, and you're paying ~$2,250 used for a card that's now mid-pack for this model.

What about RTX 3090 and the budget tier?

This is where DiffusionGemma diverges hard from the usual local-AI advice. Normally a used RTX 3090 — around $1,050 on eBay in June 2026, down from its peak but no longer the $500 bargain it once was — is the value king for 24GB workloads. Here it's a poor fit:

  • No NVFP4. Ampere can't accelerate the format that makes 18GB possible.
  • No GGUF yet. The int4 path that would let a 3090 run this hasn't shipped.
  • Bandwidth gap. At 936 GB/s the 3090 trails the 5090's 1,792 GB/s by nearly half, and diffusion's compute-heavy steps don't favor Ampere.

If you own a 3090, the right move is to keep running autoregressive Gemma 4 or Qwen on it and revisit DiffusionGemma when llama.cpp lands a GGUF. Buying a 3090 for DiffusionGemma makes no sense to

Top comments (0)