DEV Community

Vigoss Luke
Vigoss Luke

Posted on

DiffusionGemma: Run Google's 4x Faster Diffusion LLM Locally (Full Setup Guide)

DiffusionGemma: Run Google's 4x Faster Diffusion LLM Locally

Google DeepMind just open-sourced DiffusionGemma — and it's not just another Gemma model. It's a fundamentally different way to generate text: diffusion instead of autoregression.

Here's what you need to know to run it on your own machine.

What Makes It Different

Standard LLMs (GPT, Llama, Qwen) generate text one token at a time, left-to-right. Each token depends on all previous tokens, and once it's generated, it's permanent.

DiffusionGemma works like a text version of Stable Diffusion: it fills in 256 tokens at once through iterative denoising. Every token in that block can attend to every other token. If the model loses confidence in a token mid-generation, it can go back and fix it — something autoregressive models fundamentally cannot do.

Tech specs:

  • 26B parameters total, 3.8B activated (Mixture of Experts)
  • 256K context window, 140+ languages
  • Apache 2.0 license — truly open, no usage restrictions
  • Multimodal: text + image + video inputs

The Speed Difference

Hardware Tokens/s Quantization Source
H200 1,288 BF16 vLLM (verified)
H100 ~1,000+ BF16 Google (self-reported)
RTX 5090 ~700+ NVFP4 Google (self-reported)
RTX 4090 ~200-400 Q4_K_M Community estimate

⚡️ This is a fundamentally different speed tier from autoregressive models — a 26B-class model running at 700+ tokens/s on a single GPU.

3 Ways to Run It

Method 1: vLLM (Fastest)

pip install vllm>=0.12.0
Enter fullscreen mode Exit fullscreen mode
from vllm import LLM, SamplingParams

llm = LLM(
    model="google/diffusiongemma-26B-A4B-it",
    trust_remote_code=True,
    tensor_parallel_size=1,
    max_model_len=65536,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=2048)
outputs = llm.generate(["Explain how diffusion language models work."], sampling_params)
Enter fullscreen mode Exit fullscreen mode

vLLM has day-zero optimized support with dedicated block-denoising kernels. This is the gold standard for production inference.

Method 2: llama.cpp (For GGUF quantization)

llama.cpp support is via PR #24427 — not merged yet, but functional:

git clone https://github.com/ggml-ai/llama.cpp
cd llama.cpp
git fetch origin pull/24427/head:diffusiongemma
git checkout diffusiongemma
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
Enter fullscreen mode Exit fullscreen mode

Then download a GGUF quantized weight and run:

./build/bin/llama-cli \
  -m DiffusionGemma-26B-A4B-it-Q4_K_M.gguf \
  -p "Explain how diffusion language models work." \
  -n 512 -ngl 99
Enter fullscreen mode Exit fullscreen mode

Method 3: HuggingFace Transformers (Simplest)

pip install transformers>=4.55.0 accelerate
Enter fullscreen mode Exit fullscreen mode
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "google/diffusiongemma-26B-A4B-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
Enter fullscreen mode Exit fullscreen mode

Quantization Guide

Method VRAM Quality Best For
NVFP4 ~15GB Near-lossless RTX 5090 / Blackwell only
Bitsandbytes 4-bit ~16GB Good RTX 3090/4090
GGUF Q4_K_M ~16GB Good llama.cpp users
BF16 ~52GB Full H100/A100

The Honest Tradeoffs

What it's great at:

  • Short-form generation (1-2 paragraphs): quality competitive with Gemma 4
  • Summarization and translation (140+ languages)
  • Data augmentation / synthetic text: throughput >> autoregressive
  • Real-time interactive applications: sub-100ms latency feels instant

Where it falls short:

  • Complex reasoning (math, logic, multi-hop): significantly below Gemma 4
  • Long-form writing (3+ paragraphs): coherence degrades after block boundaries
  • Code generation: functional for snippets, struggles with multi-file code

Hardware caveat: The speed advantage requires high-compute GPUs (RTX 3090+). Apple Silicon Macs and entry-level GPUs won't see the 4x speedup — Google explicitly warns about this in their blog.

Should You Try It?

Yes, if you have RTX 4090/5090 and need high-throughput generation
Yes, if you're generating synthetic data at scale
Yes, if you're curious about diffusion LLMs as a research direction

No, if you need top-tier reasoning or code quality
No, if you're on Apple Silicon (the speed advantage evaporates)
No, if you need Ollama support (not available yet)

Full Guide

I built a complete deployment guide with hardware requirements, troubleshooting, and honest benchmarks at diffusiongemma.dev — 4 deployment methods with copy-paste commands.


Published 2026-06-11. Model released by Google DeepMind under Apache 2.0.

Top comments (0)