TL;DR: Google released DiffusionGemma, an open Apache 2.0 diffusion-based LLM that generates text up to 4x faster than autoregressive models, hitting 1,000+ tokens/sec on a single H100 and fitting in 18 GB VRAM. It trades some accuracy for speed. Here is what that means in practice.
What DiffusionGemma Actually Is
Google DeepMind released DiffusionGemma, the first production-grade open-weight model that applies discrete diffusion to text generation. The same family of techniques behind image generators like Stable Diffusion, now applied to language.
Instead of predicting one token at a time left-to-right, DiffusionGemma fills a 256-token block with noise and iteratively refines the entire block across multiple denoising passes until confidence thresholds are met. It commits roughly 15-20 tokens per forward pass on average, not one.
This is a fundamentally different compute pattern from everything shipping in production today.
The Numbers
| Metric | Value |
|---|---|
| Tokens/sec (H100, FP8, low batch) | 1,100+ |
| Tokens/sec (RTX 5090) | 700+ |
| Total parameters | 25.2B (marketed as 26B) |
| Active parameters at inference | 3.8B |
| MoE expert config | 8 active / 128 total |
| VRAM required (quantized) | 18 GB |
| Canvas (block) size | 256 tokens |
| Tokens committed per forward pass | ~15-20 |
| Max denoising steps | 48 |
| Context window | 256K tokens |
| License | Apache 2.0 |
For context: comparable autoregressive models on the same H100 generate roughly 200-250 tokens/sec. DiffusionGemma is up to 4x faster on throughput. The jump comes from shifting the decode bottleneck from memory bandwidth to compute.
Why the Architecture Matters
DiffusionGemma is a 26B Mixture of Experts (MoE) model built on the Gemma 4 backbone, but it replaces the autoregressive decoder with a diffusion head.
How a single generation works:
- The model initializes a 256-token block with random placeholder tokens
- It runs up to 48 denoising steps, refining all tokens simultaneously with bidirectional attention (every token attends to every other token in the block)
- Tokens that cross an entropy confidence threshold get committed to the KV cache early via adaptive stopping
- For sequences longer than 256 tokens, committed blocks are cached and the next block begins
The key difference from GPT-style models: token N can see tokens N+1 through N+256 during generation. This enables genuine self-correction across the block. Autoregressive models structurally cannot do this.
Where It Wins and Where It Does Not
Structural advantages
- Code infilling: It sees the code on both sides of the gap before generating the fill, not just the left side
- Inline document editing: Revising a paragraph in context of surrounding paragraphs
- Real-time latency-sensitive apps: 1,100 tokens/sec on H100 vs ~230 tokens/sec from a comparable autoregressive model
- Single-GPU efficiency: 3.8B active parameters means 18 GB VRAM at quantized precision, which fits on an RTX 4090 or 5090
Benchmark trade-offs vs Gemma 4 26B (autoregressive)
| Benchmark | DiffusionGemma | Gemma 4 26B |
|---|---|---|
| MMLU Pro | 77.6% | 82.6% |
| AIME 2026 | 69.1% | 88.3% |
| GPQA Diamond | 73.2% | 82.3% |
| MMMU Pro (Vision) | 54.3% | 73.8% |
Google describes it as experimental. For reasoning-heavy workloads (complex math, multi-step logic, vision understanding) the autoregressive Gemma 4 is still ahead. DiffusionGemma is the right tool when latency and throughput matter more than peak accuracy.
Multi-modal capabilities
The model processes interleaved text, images (5 resolution tiers up to 1120 tokens), and video (up to 60 seconds at 1 fps). It supports OCR, chart comprehension, screen understanding, and handwriting recognition across 35+ languages, with training data covering 140+ languages.
Deploy It in 5 Minutes with vLLM
pip install vllm
vllm serve google/diffusiongemma-26B-A4B-it \
--max-model-len 262144 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.85 \
--attention-backend TRITON_ATTN \
--generation-config vllm \
--hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1}' \
--diffusion-config '{"canvas_length": 256}' \
--enable-chunked-prefill
The endpoint is OpenAI-compatible. Point your existing client at http://localhost:8000 with no other code changes needed.
Supported inference runtimes: vLLM, Hugging Face Transformers, SGLang, MLX (Apple Silicon), NVIDIA NIM containers, Google Cloud Vertex AI Model Garden.
Fine-Tuning
The ecosystem arrived fast for a day-1 release:
- Hackable Diffusion: Google's JAX-based modular research toolbox
- Hugging Face Transformers: standard PEFT/LoRA workflows
- Unsloth: memory-efficient fine-tuning
- NVIDIA NeMo: enterprise training pipelines
A published case study fine-tuned DiffusionGemma on a Sudoku dataset and improved success rate from approximately 0% to 80%. Fine-tuning can also teach the model to stop denoising early when confidence is already high, reducing inference steps further. Autoregressive models have no equivalent lever.
What to Evaluate Right Now
This week:
- [ ] Spin up the model on an H100 or RTX 4090 (18 GB VRAM quantized)
- [ ] Benchmark it on your actual latency-sensitive workload, not synthetic tasks
- [ ] Compare serving cost ($/1M tokens) against your current stack
Next sprint:
- [ ] Test code infilling quality in IDE tooling, its structural sweet spot due to bidirectional attention
- [ ] If you run real-time chat or inline editing, measure UX metrics, not just accuracy scores
- [ ] Follow Unsloth + LoRA support for DiffusionGemma, it is maturing fast
Architecture signal:
This model is built on the same Gemini Diffusion research that will likely inform future proprietary Gemini releases. If diffusion inference stabilizes at this quality level, it rewrites autoregressive serving assumptions at scale.
The Bottom Line
DiffusionGemma is not a production replacement for your current LLM stack today. Accuracy trade-offs are real and Google is transparent about the experimental status.
But the throughput numbers are genuine, the hardware requirements are accessible, and the license is Apache 2.0.
1,100 tokens per second. 18 GB VRAM. Open weights. From Google.
That combination is worth benchmarking on your actual workload this week.
Resources:
Found this useful? Follow for more signal-over-noise breakdowns of AI releases that matter.
Top comments (0)