DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at aifoss.dev

DiffusionGemma 26B Review 2026: 4x Faster, At a Cost

This article was originally published on aifoss.dev

TL;DR: DiffusionGemma generates text in parallel 256-token blocks instead of one token at a time, which makes it roughly 4x faster than a comparable Gemma 4 model — over 1,000 tokens/sec on an H100. You pay for that speed with a measurable quality drop (about 5 points on MMLU Pro) and an awkward runtime story. It's Apache 2.0 and self-hostable, but it is explicitly experimental.

DiffusionGemma 26B-A4B Gemma 4 26B-A4B Ollama + Qwen3.6 35B
Best for Low-latency drafting, high-throughput generation Quality-critical chat, RAG, coding Balanced local assistant
Decoding Parallel block diffusion (256 tok/block) Autoregressive (1 tok/step) Autoregressive
Speed 1,000+ tok/s (H100), ~700 (RTX 5090) ~250 tok/s class ~40–60 tok/s on 24GB
VRAM (Q4) ~18 GB ~15 GB ~22 GB
License Apache 2.0 Apache 2.0 Apache 2.0
The catch Lower benchmarks, runtime still maturing Slower Slower, bigger

Honest take: If you're streaming output to a user and latency is the product, DiffusionGemma is worth testing today. For anything where the answer has to be right, run standard Gemma 4 and don't look back.

What DiffusionGemma actually is

Google DeepMind released DiffusionGemma on June 10, 2026. It's the first open-weight model in the Gemma family to drop autoregressive decoding — the token-by-token loop every other local LLM uses — in favor of text diffusion.

The architecture is built on the Gemma 4 26B-A4B backbone: 26B total parameters, mixture-of-experts, with roughly 4B active per forward pass (the "A4B" in the name). On top of that backbone, Google bolted a diffusion head. Instead of predicting the next token, the model predicts a whole block of up to 256 tokens at once as noise, then refines that block over several denoising steps until it settles into coherent text.

If you've used Stable Diffusion or Flux for images, the mental model is the same: start from noise, denoise toward a target. DiffusionGemma applies that idea to text. The payoff is parallelism — generating 256 tokens in a handful of refinement passes is far less work than 256 sequential forward passes.

The model card lists it as Apache 2.0, which matters: this is genuinely open for commercial self-hosting, unlike the restrictive community licenses on some "open" frontier models. The context window is 256K tokens (262,144), inherited from the Gemma 4 line.

The speed is real

Diffusion decoding is the whole reason this model exists, and the throughput numbers hold up. On a single NVIDIA H100, DiffusionGemma exceeds 1,000 tokens per second. On an RTX 5090, reports put it around 700 tok/s. Google's own framing is "up to 4x faster generation than comparable Gemma models."

That's not a marginal win. A standard 26B-class autoregressive model on the same hardware lands in the low hundreds of tokens per second. For workloads where you're generating long outputs — bulk summarization, synthetic data, draft generation, autocomplete that has to feel instant — 4x is the difference between a tool that feels sluggish and one that feels immediate.

Where the speedup shrinks: very short outputs. If you're generating a 20-token answer, the block-diffusion overhead and refinement passes eat into the advantage. Diffusion wins on long generations, not one-liners.

The quality cost, with numbers

This is the part most launch coverage glosses over, so here are the actual figures. DiffusionGemma 26B-A4B versus standard Gemma 4 26B-A4B:

Benchmark DiffusionGemma Gemma 4 26B-A4B Gap
MMLU Pro 77.6% 82.6% −5.0
GPQA Diamond 73.2% 82.3% −9.1
LiveCodeBench v6 69.1% 77.1% −8.0
Codeforces (Elo) 1429 1718 −289

A 5-point MMLU Pro drop is tolerable for many tasks. The 9-point GPQA Diamond gap and the coding regressions are not noise — they're the kind of gap you feel on hard reasoning and on code that has to compile. Google is unambiguous about this: DiffusionGemma is experimental, and the official recommendation is to use standard Gemma 4 for quality-critical production workloads.

That's a refreshingly honest position from a model maker, and you should take it at face value. This is a research release that happens to be genuinely useful for a specific shape of problem, not a Gemma 4 replacement.

Running it: the runtime story is messy

Here's where you need to pay attention, because "self-hostable" comes with asterisks in June 2026.

vLLM (the clean path). vLLM shipped day-zero support. This is the most reliable way to serve DiffusionGemma right now:

vllm serve google/diffusiongemma-26B-A4B-it \
  --max-model-len 262144 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.85
Enter fullscreen mode Exit fullscreen mode

That gives you an OpenAI-compatible endpoint on port 8000. If you've set up vLLM before, nothing here is new — see our vLLM setup guide for the base install. On a single 80GB H100 this runs comfortably; the --max-model-len flag is what unlocks the full 256K context.

llama.cpp / GGUF (experimental). This is where it gets awkward. DiffusionGemma is a block-diffusion architecture, so the standard llama-cli and llama-server binaries cannot generate from it. You need the dedicated DiffusionGemma branch (PR ggml-org/llama.cpp#24423) and a separate llama-diffusion-cli runner:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
gh pr checkout 24423
cmake -B build -DGGML_CUDA=ON && cmake --build build
Enter fullscreen mode Exit fullscreen mode

Unsloth publishes pre-quantized GGUFs at unsloth/diffusiongemma-26B-A4B-it-GGUF. Pull the Q4_K_M build (~16 GB download):

pip install -U "huggingface_hub[cli]"
hf download unsloth/diffusiongemma-26B-A4B-it-GGUF \
  --local-dir diffusiongemma-gguf --include "*Q4_K_M*"
Enter fullscreen mode Exit fullscreen mode

Then run it through the diffusion runner:

./build/bin/llama-diffusion-cli \
  -m diffusiongemma-gguf/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  -ngl 99 -cnv -n 2048 --diffusion-visual
Enter fullscreen mode Exit fullscreen mode

That --diffusion-visual flag is genuinely fun — it shows the canvas denoising in real time so you can watch text emerge from noise. Quantized to Q4, the model fits in roughly 18 GB of VRAM, which puts 24 GB consumer cards like the RTX 4090 and RTX 3090 comfortably in range.

The catch: as of June 2026, this GGUF path lives on an unmerged PR. If you want a stable, supported 4-bit setup, the recommended route is NVFP4 quantization through HuggingFace Transformers, or just use vLLM. For background on what Q4_K_M and friends actually mean for quality, see our GGUF quantization guide.

Ollama. Not yet. Ollama wraps standard llama.cpp inference, and since the base runner can't do block diffusion, there's no ollama pull diffusiongemma that works today. Watch the upstream PR.

A real gotcha: don't judge it on short prompts

The first thing most people do with a new model is throw a one-line question at it and eyeball the answer. With DiffusionGemma that's the worst possible test. Two reasons.

First, the speed advantage barely shows on short outputs — you're paying block-diffusion overhead for tokens you'd have gotten fast anyway. Second, the quality gap is most visible on exactly the kind of single-shot reasoning question (GPQA-style) where it's weakest. You'll come away thinking "slower than I expected and dumber than Gemma 4," which is the wrong conclusion for the workload it's built for.

Test it the way you'd use it: long generations, batched throughput, latency-sensitive streaming. Run a summarization job over a few hundred documents and compare wall-clock time against autoregressive Gemma 4. That's where the 4x lives.

When NOT to use DiffusionGemma

  • Quality-critical work. Hard reasoning, math proofs,

Top comments (0)