Max Quimby

Posted on Jun 11 • Originally published at computeleap.com

DiffusionGemma: Open-Weight Text at 1,000 Tokens/Sec

#ai #opensource #google #machinelearning

The same week Anthropic launched Claude Fable 5 — the most capable model on every benchmark, and the one that won't help you research frontier AI — Google DeepMind quietly shipped a different kind of milestone. DiffusionGemma is a 26-billion-parameter open-weight model that generates text the way image models generate pixels: by denoising an entire canvas of tokens in parallel, not predicting them one at a time.

📖 Read the full version with charts and embedded sources on ComputeLeap →

The result? Over 1,000 tokens per second on an H100. More than 700 on a consumer RTX 5090. Apache 2.0 licensed, no research restrictions, running in 18 GB of VRAM.

This isn't an incremental checkpoint. It's a fundamentally different architecture for text generation — and it landed at the exact moment the open-source camp needed a proof point.

How DiffusionGemma Works: From Noise to Text

Every large language model you've used — GPT, Claude, Gemma 4, Llama — generates text the same way: one token at a time, left to right. Each token depends on every token before it. This is autoregressive decoding, and it creates a hard bottleneck: no matter how fast your GPU is, you're limited by the sequential dependency chain.

DiffusionGemma breaks that chain. According to Google's developer guide, the model operates on a 256-token "canvas" that starts as random noise. Through iterative refinement passes, each token attends to every other token — including tokens that come after it — using bidirectional attention. Confident predictions lock in early; uncertain positions get renoised and refined in the next pass.

Think of it like editing a paragraph all at once rather than typing it character by character. The model sees the whole block simultaneously, making corrections everywhere in parallel until the text converges.

💡 The architecture is encoder-decoder: a causal encoder prefills the prompt into a KV cache, while a bidirectional decoder denoises the 256-token canvas. The model alternates between these modes using the same weights — no separate models needed.

The vLLM team's integration post calls out a critical innovation: self-conditioning, where the model is "conditioned on its own previous prediction" via probability-weighted embeddings rather than hard tokens. This lets it converge faster — simpler prompts and structured tasks like code need fewer denoising steps, so throughput actually scales with task complexity.

For developers used to autoregressive models, the key mental model shift is this: DiffusionGemma doesn't generate text sequentially. It generates text spatially — refining an entire block simultaneously, the way a painter works across a whole canvas rather than filling it in pixel by pixel from the top-left corner. Each 256-token block goes through multiple denoising passes. An entropy-bound sampler accepts confident predictions and renoises uncertain positions. When the entropy across all positions drops below a threshold, the block commits and the next canvas begins.

This means DiffusionGemma can reference tokens that come after the current position — something autoregressive models fundamentally cannot do. For tasks like code infilling (filling in a function body given the signature and the tests below), markdown formatting, or structured data generation, this bidirectional awareness is a structural advantage, not just a speed trick.

The Numbers: Speed Across the Stack

The headline is speed, and the numbers hold up across hardware tiers. According to NVIDIA's optimization blog and vLLM's benchmarks:

Hardware	Tokens/Sec	vs. Autoregressive
H200 (FP8)	1,288	~6× faster
H100 (FP8)	1,008	~5× faster
DGX Station	2,000	—
RTX 5090	700+	~4× faster
DGX Spark	150	—

The model is built on Gemma 4's mixture-of-experts architecture: 26B total parameters, but only 3.8B activate per step. When quantized with NVIDIA's NVFP4 format, it fits within 18 GB of VRAM — well within range of a consumer RTX 5090 or even a 4090 with careful configuration.

As AK (@_akhaliq) noted on X: "A 26B MoE multimodal model generating text via parallel diffusion, with 256K context and 1,100+ tokens/sec speed on Hopper."

The Quality Tradeoff — And Why It Might Not Matter

Here's the honest part: DiffusionGemma underperforms standard Gemma 4 on every quality benchmark. The Decoder's analysis puts it bluntly — it "runs about three and a half times faster than a same-size Gemma 4 but falls behind in every quality test." Google themselves recommend deploying standard Gemma 4 for applications that demand maximum quality.

But the Hacker News community surfaced a more nuanced take. In a thread with 99 points, the top comment — with 286 upvotes — came from user vineyardmike:

💡 "Recently I had switched to OpenCode to try out many of the Non-US-Frontier-Labs models. My unexpected favorite model to use was Mercury (a diffusion model). Not because it was 'smart' but because it was stupid fast." — vineyardmike, HN

The insight: raw speed changes the interaction pattern. When inference is fast enough, you stop treating the model as a batch oracle and start treating it as a pair-programming partner. You iterate instead of deliberating. You try five approaches instead of carefully crafting one prompt.

DiffusionGemma's sweet spot isn't replacing your frontier model for hard reasoning. It's the code infilling, the rapid iteration, the interactive editing where latency matters more than maximum intelligence — and that's a much larger surface area of daily AI usage than most people realize.

User hmate9 flagged another structural advantage: "The bidirectionality could be a big deal: being able to refine a sentence with both left and right context feels closer to how editing/thinking actually works than committing to each token forever." Autoregressive models can't unsay a token once it's generated. Diffusion models can.

As Merve Noyan (@mervenoyann) from Hugging Face noted: "DiffusionGemma is out — it's compute-bound so 4x faster compared to other Gemma-4 models (1k tok/s on H100) — also great on coding, generate and iterate on any code from 3D generation to front-end."

Where Diffusion Loses: Cloud Economics

Not everyone was bullish. User lambda offered the sharpest counterargument: "Diffusion kind of loses its benefit in hosted models... given that it also reduces accuracy, it's hard to see where you'd really want that."

The logic checks out. In high-QPS cloud serving, autoregressive models batch efficiently across requests — many users sharing the same GPU, each getting their tokens interleaved. Diffusion's advantage is per-request latency, not aggregate throughput. As the vLLM team noted, the speedup is "optimized for low-to-medium batch sizes on single accelerators."

This means DiffusionGemma's real territory is local inference and single-user workloads — the exact use case where you're running on your own hardware and paying the full cost of idle GPU cycles between tokens. That's not a weakness. That's a market.

The Diffusion LLM Landscape: Not Just Google

DiffusionGemma doesn't exist in isolation. Inception Labs' Mercury has been the diffusion LLM pioneer, with Mercury Coder hitting 1,109 tok/s on H100 and outperforming speed-optimized frontier models by up to 10×. Google's Gemini Diffusion demonstrated over 1,400 tok/s. The research direction has real momentum — BlockBatch and Fast-dLLM v2 are pushing the architectural boundaries further.

What makes DiffusionGemma significant isn't that it's the fastest — it's that it's the first major open-weight diffusion LLM with day-one framework support. On launch day, you can run it in vLLM, Hugging Face Transformers, MLX, Unsloth, NVIDIA NeMo, and SGLang. The vLLM integration is particularly notable: the team built a new ModelState abstraction that serves as a "reusable blueprint for integrating future block-diffusion models" — meaning the infrastructure now exists for every diffusion LLM that follows.

Sasha Rush (@srush_nlp) retweeted Sundar Pichai's announcement: "DiffusionGemma… up to 4x faster inference by generating entire blocks of text simultaneously." The signal from the NLP research community: this is being taken seriously.

The Timing: When the Frontier Locks Up, Open Speeds Up

Here's the context that turns DiffusionGemma from a technical curiosity into a strategic moment. Claude Fable 5 launched this same week as the most capable model available — scoring 92% on SWE-bench, topping every major benchmark. But it also shipped with hard restrictions on frontier-AI research, bio, and chemistry. The HN thread hit 2,517 points and 2,015 comments, with Paul Graham retweeting concerns about safety refusals being exploited to blind AI security scanners.

Meanwhile, the open camp shipped a model that runs on your hardware, under your control, with zero usage restrictions.

Chamath Palihapitiya crystallized the economic angle. In a post covered by Benzinga, he wrote: "The capability gap between the best open-weight/source models and the best closed models has narrowed much faster than the pricing gap — the pricing gap remains enormous."

His numbers tell the story: processing 1 billion input and output tokens per month costs roughly $105,000 on GPT-5.5 Pro, $30,000 on Claude Opus 4.8, $5,220 on DeepSeek V4 Pro, and $2,740 on DeepSeek R1. That's a 38× spread between top and bottom.

⚠️ DiffusionGemma doesn't close the quality gap to frontier. But it doesn't need to. For the vast majority of inference tasks — code completion, content iteration, interactive editing — the quality floor is already high enough. What matters is whether the speed and cost advantage compound into a different product category entirely.

Stanford's research found that local models handle 71% of daily coding tasks without reaching for the cloud. DiffusionGemma makes that 71% faster by 4×. When you combine "good enough quality" with "instant response" at zero marginal cost, you get a product experience the API model can't replicate regardless of quality — because latency is a feature, not a limitation.

How to Run DiffusionGemma Today

If you want to try it, the fastest path is through vLLM:

vllm serve google/diffusiongemma-26B-A4B-it \
  --max-model-len 262144 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.85 \
  --attention-backend TRITON_ATTN \
  --generation-config vllm \
  --hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1}' \
  --diffusion-config '{"canvas_length": 256}' \
  --enable-chunked-prefill

The model is available on Hugging Face with pre-quantized checkpoints for FP8 and NVFP4 via RedHatAI. For Mac users, MLX support means you can run it natively — and if you've already set up local Gemma 4 with LM Studio, the deployment story is nearly identical.

NVIDIA's day-one optimization across RTX, DGX Spark, and DGX Station means the performance numbers aren't theoretical — they're what you get out of the box with supported hardware.

For fine-tuning, Google released Hackable Diffusion — a JAX-based research toolbox — alongside training recipes. Early results are promising: supervised fine-tuning on Sudoku demonstrated 80% correctness with reduced inference steps, suggesting that task-specific tuning can close some of the quality gap while preserving the speed advantage. Unsloth and NVIDIA NeMo provide more production-oriented fine-tuning paths.

Who Should Use DiffusionGemma (And Who Shouldn't)

To be concrete about where DiffusionGemma fits today:

Use it for: Code completion and infilling. Interactive editing workflows. Rapid prototyping where you need sub-second responses. Structured text generation (markdown, JSON, config files). Local inference on a single GPU where you're paying for idle cycles. Any task where you'd rather get four decent drafts than one slightly better one.

Don't use it for: Production reasoning tasks requiring maximum accuracy. High-QPS cloud serving where autoregressive batching is more efficient. Long-form creative writing where token-level quality compounds over thousands of words. Anything where you'd currently reach for Claude Opus or GPT-5.5 — the quality tier isn't comparable, and diffusion's speed advantage disappears in batched cloud deployments.

Experiment with it for: Hybrid architectures that use DiffusionGemma for fast drafting and a frontier model for refinement. Multi-model workflows where speed on the inner loop matters more than peak intelligence. Edge deployments where 18 GB of VRAM is the constraint and autoregressive models of equivalent capability don't fit.

What This Means

DiffusionGemma is experimental. Google says so explicitly. The quality gap is real, and for hard reasoning tasks you should still use the best model available.

But the architecture is sound, the ecosystem support is unprecedented for a day-one open model, and the timing couldn't be more pointed. In a week where the frontier model told researchers they couldn't use it for frontier research, the open camp shipped a model that generates text at 1,000 tokens per second on hardware you own, under a license that lets you do whatever you want with it.

The capability gap between open and closed narrowed faster than the pricing gap. Now the speed gap is opening in the other direction.

Your margin is the open camp's opportunity.

Originally published at ComputeLeap

DEV Community