Google Open-Sources DiffusionGemma, 26B Model Hits 1K Tokens/Sec on H100

#ai #machinelearning #research #deeplearning

Google open-sourced DiffusionGemma, a 26B-parameter diffusion text model hitting 1,000 tokens/sec on H100 — 4x faster than autoregressive models, but with lower quality.

Google released DiffusionGemma on June 10, a 26B-parameter open-weight model that generates text via diffusion. Nvidia claims 1,000 tokens per second on a single H100 GPU — roughly 4x faster than autoregressive models like Gemma 4.

Key facts

26 billion total parameters, ~4 billion active per token (MoE).
1,000 tokens per second claimed on a single H100 GPU.
Apache 2.0 license — fully open-weight.
Available on Hugging Face: google/diffusiongemma-26B-A4B-it.
Nvidia hosts free inference on NIM cloud API.

Google released DiffusionGemma, a 26-billion-parameter model that generates text not token by token but through diffusion, similar to how image AI turns noise into a picture. According to The Decoder and Simon Willison's blog, the model is available on Hugging Face as google/diffusiongemma-26B-A4B-it under an Apache 2 license — a significant departure from Google's typically more restricted model releases.

How it works and why speed matters

DiffusionGemma eschews the standard autoregressive approach (predicting one token at a time) for a continuous diffusion process that iteratively denoises a latent representation of the entire output sequence. This parallel generation is what enables the speedup: Nvidia claims it hits about 1,000 tokens per second on a single H100 GPU, roughly four times faster than comparable autoregressive models. Simon Willison tested the model via Nvidia's NIM cloud API, reporting 2,409 tokens generated in 4.4 seconds — at least 500 tokens/second, with overhead from Python tooling, so raw inference is likely faster.

This isn't Google's first diffusion-for-text experiment. Last May, Google briefly released an experimental Gemini Diffusion model; Willison recorded it running at 857 tokens/second at the time. That research has now returned as a fully open-weight Gemma model, suggesting Google is serious about making diffusion-based text generation a production-ready alternative.

Quality trade-off and positioning

Output quality is lower, so Google is positioning it as an experimental tool for developers for now. The model is a 26B-parameter Mixture of Experts (26B-A4B), meaning only ~4B parameters are active per token — a design choice that keeps inference cheap. Nvidia is currently hosting the model for free on their NIM cloud API, lowering the barrier for developers to experiment.

Community reaction and context

Hacker News commenters noted the strategic significance: "Google keeps flexin'. It's surprising that Gemini isn't more competitive against Claude or OpenAI models for code and agentic use, because it's clear Google still has some of the best AI people in the business." The model's speed makes it particularly relevant for on-device and near-realtime use cases — a domain where Google has invested heavily, from Gemini Nano to TPU v6e deployments.

What to watch

Watch for benchmark results on standard NLP tasks (MMLU, HellaSwag, HumanEval) as the community stress-tests DiffusionGemma against Gemma 4 and Llama 4. The key question is whether the quality gap narrows with fine-tuning or larger diffusion steps. Also watch for Nvidia's NIM usage metrics — if developer adoption spikes, it signals real demand for non-autoregressive architectures.