DiffusionGemma: Run Google's 4x Faster Diffusion LLM Locally
Google DeepMind just open-sourced DiffusionGemma — and it's not just another Gemma model. It's a fundamentally different way to generate text: diffusion instead of autoregression.
Here's what you need to know to run it on your own machine.
What Makes It Different
Standard LLMs (GPT, Llama, Qwen) generate text one token at a time, left-to-right. Each token depends on all previous tokens, and once it's generated, it's permanent.
DiffusionGemma works like a text version of Stable Diffusion: it fills in 256 tokens at once through iterative denoising. Every token in that block can attend to every other token. If the model loses confidence in a token mid-generation, it can go back and fix it — something autoregressive models fundamentally cannot do.
Tech specs:
- 26B parameters total, 3.8B activated (Mixture of Experts)
- 256K context window, 140+ languages
- Apache 2.0 license — truly open, no usage restrictions
- Multimodal: text + image + video inputs
The Speed Difference
| Hardware | Tokens/s | Quantization | Source |
|---|---|---|---|
| H200 | 1,288 | BF16 | vLLM (verified) |
| H100 | ~1,000+ | BF16 | Google (self-reported) |
| RTX 5090 | ~700+ | NVFP4 | Google (self-reported) |
| RTX 4090 | ~200-400 | Q4_K_M | Community estimate |
⚡️ This is a fundamentally different speed tier from autoregressive models — a 26B-class model running at 700+ tokens/s on a single GPU.
3 Ways to Run It
Method 1: vLLM (Fastest)
pip install vllm>=0.12.0
from vllm import LLM, SamplingParams
llm = LLM(
model="google/diffusiongemma-26B-A4B-it",
trust_remote_code=True,
tensor_parallel_size=1,
max_model_len=65536,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=2048)
outputs = llm.generate(["Explain how diffusion language models work."], sampling_params)
vLLM has day-zero optimized support with dedicated block-denoising kernels. This is the gold standard for production inference.
Method 2: llama.cpp (For GGUF quantization)
llama.cpp support is via PR #24427 — not merged yet, but functional:
git clone https://github.com/ggml-ai/llama.cpp
cd llama.cpp
git fetch origin pull/24427/head:diffusiongemma
git checkout diffusiongemma
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
Then download a GGUF quantized weight and run:
./build/bin/llama-cli \
-m DiffusionGemma-26B-A4B-it-Q4_K_M.gguf \
-p "Explain how diffusion language models work." \
-n 512 -ngl 99
Method 3: HuggingFace Transformers (Simplest)
pip install transformers>=4.55.0 accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"google/diffusiongemma-26B-A4B-it",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
Quantization Guide
| Method | VRAM | Quality | Best For |
|---|---|---|---|
| NVFP4 | ~15GB | Near-lossless | RTX 5090 / Blackwell only |
| Bitsandbytes 4-bit | ~16GB | Good | RTX 3090/4090 |
| GGUF Q4_K_M | ~16GB | Good | llama.cpp users |
| BF16 | ~52GB | Full | H100/A100 |
The Honest Tradeoffs
What it's great at:
- Short-form generation (1-2 paragraphs): quality competitive with Gemma 4
- Summarization and translation (140+ languages)
- Data augmentation / synthetic text: throughput >> autoregressive
- Real-time interactive applications: sub-100ms latency feels instant
Where it falls short:
- Complex reasoning (math, logic, multi-hop): significantly below Gemma 4
- Long-form writing (3+ paragraphs): coherence degrades after block boundaries
- Code generation: functional for snippets, struggles with multi-file code
Hardware caveat: The speed advantage requires high-compute GPUs (RTX 3090+). Apple Silicon Macs and entry-level GPUs won't see the 4x speedup — Google explicitly warns about this in their blog.
Should You Try It?
✅ Yes, if you have RTX 4090/5090 and need high-throughput generation
✅ Yes, if you're generating synthetic data at scale
✅ Yes, if you're curious about diffusion LLMs as a research direction
❌ No, if you need top-tier reasoning or code quality
❌ No, if you're on Apple Silicon (the speed advantage evaporates)
❌ No, if you need Ollama support (not available yet)
Full Guide
I built a complete deployment guide with hardware requirements, troubleshooting, and honest benchmarks at diffusiongemma.dev — 4 deployment methods with copy-paste commands.
Published 2026-06-11. Model released by Google DeepMind under Apache 2.0.
Top comments (0)