Vigoss Luke

Posted on Jun 11

DiffusionGemma: Run Google's 4x Faster Diffusion LLM Locally (Full Setup Guide)

#ai #machinelearning #opensource #tutorial

DiffusionGemma: Run Google's 4x Faster Diffusion LLM Locally

Google DeepMind just open-sourced DiffusionGemma — and it's not just another Gemma model. It's a fundamentally different way to generate text: diffusion instead of autoregression.

Here's what you need to know to run it on your own machine.

What Makes It Different

Standard LLMs (GPT, Llama, Qwen) generate text one token at a time, left-to-right. Each token depends on all previous tokens, and once it's generated, it's permanent.

DiffusionGemma works like a text version of Stable Diffusion: it fills in 256 tokens at once through iterative denoising. Every token in that block can attend to every other token. If the model loses confidence in a token mid-generation, it can go back and fix it — something autoregressive models fundamentally cannot do.

Tech specs:

26B parameters total, 3.8B activated (Mixture of Experts)
256K context window, 140+ languages
Apache 2.0 license — truly open, no usage restrictions
Multimodal: text + image + video inputs

The Speed Difference

Hardware	Tokens/s	Quantization	Source
H200	1,288	BF16	vLLM (verified)
H100	~1,000+	BF16	Google (self-reported)
RTX 5090	~700+	NVFP4	Google (self-reported)
RTX 4090	~200-400	Q4_K_M	Community estimate

⚡️ This is a fundamentally different speed tier from autoregressive models — a 26B-class model running at 700+ tokens/s on a single GPU.

3 Ways to Run It

Method 1: vLLM (Fastest)

pip install vllm>=0.12.0

from vllm import LLM, SamplingParams

llm = LLM(
    model="google/diffusiongemma-26B-A4B-it",
    trust_remote_code=True,
    tensor_parallel_size=1,
    max_model_len=65536,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=2048)
outputs = llm.generate(["Explain how diffusion language models work."], sampling_params)

vLLM has day-zero optimized support with dedicated block-denoising kernels. This is the gold standard for production inference.

Method 2: llama.cpp (For GGUF quantization)

llama.cpp support is via PR #24427 — not merged yet, but functional:

git clone https://github.com/ggml-ai/llama.cpp
cd llama.cpp
git fetch origin pull/24427/head:diffusiongemma
git checkout diffusiongemma
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

Then download a GGUF quantized weight and run:

./build/bin/llama-cli \
  -m DiffusionGemma-26B-A4B-it-Q4_K_M.gguf \
  -p "Explain how diffusion language models work." \
  -n 512 -ngl 99

Method 3: HuggingFace Transformers (Simplest)

pip install transformers>=4.55.0 accelerate

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "google/diffusiongemma-26B-A4B-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

Quantization Guide

Method	VRAM	Quality	Best For
NVFP4	~15GB	Near-lossless	RTX 5090 / Blackwell only
Bitsandbytes 4-bit	~16GB	Good	RTX 3090/4090
GGUF Q4_K_M	~16GB	Good	llama.cpp users
BF16	~52GB	Full	H100/A100

The Honest Tradeoffs

What it's great at:

Short-form generation (1-2 paragraphs): quality competitive with Gemma 4
Summarization and translation (140+ languages)
Data augmentation / synthetic text: throughput >> autoregressive
Real-time interactive applications: sub-100ms latency feels instant

Where it falls short:

Complex reasoning (math, logic, multi-hop): significantly below Gemma 4
Long-form writing (3+ paragraphs): coherence degrades after block boundaries
Code generation: functional for snippets, struggles with multi-file code

Hardware caveat: The speed advantage requires high-compute GPUs (RTX 3090+). Apple Silicon Macs and entry-level GPUs won't see the 4x speedup — Google explicitly warns about this in their blog.

Should You Try It?

✅ Yes, if you have RTX 4090/5090 and need high-throughput generation
✅ Yes, if you're generating synthetic data at scale
✅ Yes, if you're curious about diffusion LLMs as a research direction

❌ No, if you need top-tier reasoning or code quality
❌ No, if you're on Apple Silicon (the speed advantage evaporates)
❌ No, if you need Ollama support (not available yet)

Full Guide

I built a complete deployment guide with hardware requirements, troubleshooting, and honest benchmarks at diffusiongemma.dev — 4 deployment methods with copy-paste commands.

Published 2026-06-11. Model released by Google DeepMind under Apache 2.0.

DEV Community