DEV Community: Gemini Team

DiffusionGemma: The Developer Guide

Gemini Team — Thu, 16 Jul 2026 14:27:09 +0000

Following our announcement in our launch blog post, we are sharing this developer guide to help you understand, serve and customize this experimental model.

Built on the Gemma 4 backbone, DiffusionGemma introduces several milestones for developer workflows:

Compute-bound parallel generation: Bypasses memory-bandwidth limitations by shifting the bottleneck to compute, delivering up to 4x faster token generation on GPUs (up to 700+ tokens per second on NVIDIA GeForce RTX 5090 and 1000+ tokens per second on a single NVIDIA H100).
Bidirectional context & self-correction: Uses bidirectional attention to evaluate the entire text block simultaneously during generation, enabling real-time error correction and parallel context propagation.
Developer-friendly sizes: Designed as a 26B Mixture of Experts (MoE) model that activates only 3.8B parameters during inference, allowing quantized deployment within 18 GB VRAM limits.

The Architecture

For developers building with traditional LLMs on GPUs, the primary bottleneck is memory bandwidth. Autoregressive language models must repeatedly load model weights from memory to generate text one token at a time. DiffusionGemma bypasses this limitation by shifting the bottleneck from memory bandwidth to compute, generating and refining a 256-token canvas in parallel. By providing the GPU with a large parallel workload, it utilizes tensor cores that would otherwise sit idle during local serving.

Uniform State Diffusion: Instead of predicting tokens sequentially, DiffusionGemma starts with a canvas of random placeholder tokens and iteratively refines them in parallel. Over multiple denoising passes, highly confident tokens help resolve adjacent positions, causing the entire sequence to snap into focus.
Block Autoregressive Diffusion for Variable Length Generation: For sequences longer than 256 tokens, once a 256-token block is fully denoised, the model processes and commits it to the KV cache. The model then transitions to the next block, initializing a fresh 256-token canvas conditioned on the previously committed history. This combines parallel block speed with the sequential stability of autoregressive models.

Showcase: Solving Sudoku with Parallel Denoising

Traditional autoregressive models struggle with strict, multivariable constrained problems like Sudoku. Because they generate text strictly from left to right, they cannot evaluate future placeholders or backtrack.

To demonstrate customization of DiffusionGemma, we are releasing a fine-tuning recipe and results using Hackable Diffusion, a modular JAX research toolbox. This training setup focuses on a classic multi-variable grid task: the Sudoku Solver.

Why Sudoku is Interesting for Diffusion

In an 81-character Sudoku string representation (where empty cells are marked with periods), every digit is bound by strict intersecting horizontal, vertical, and 9x9 grid constraints.

Bidirectional Context Propagation: Unlike autoregressive models, DiffusionGemma's denoising step allows every canvas query to attend to all positions in parallel. Information flows symmetrically across the board, resolving global dependencies in each step.

Error Correction via Re-Noising: Under Uniform State Diffusion, the model evaluates the entire board simultaneously. If confidence drops, the sampler replaces digits with random ones, allowing for continuous self-correction.
Efficient Early Stopping: Fine-tuning on Sudoku shows that adapters enhance early stopping. The SFT-tuned model stabilizes faster than the base model, allowing the engine to halt sooner, reducing latency and compute costs.

controls
loop
muted
autoplay
playsinline
width="100%"
style="border-radius: 8px; display: block; margin-bottom: 0; padding-bottom: 0px;"
aria-label="">

Your browser does not support the video tag.

Left: DiffusionGemma generating Sudoku output. The base model is unable to solve the Sudoku after 48 steps. Right: Fine-tuned (SFT) DiffusionGemma solves the puzzle after 12 steps. It is able to complete early thanks to adaptive stopping.

The Performance Impact: While the base DiffusionGemma model is not specifically trained to solve Sudoku puzzles (~0% success rate), applying the simple JAX SFT recipe on a Sudoku dataset raises correctness to 80% success, while decreasing the overall inference step count.

Block Autoregressive Denoising

To enable block autoregressive denoising, DiffusionGemma alternates between incremental prefill and denoising during inference:

Prefill / Incremental Prefill (Causal): Uses causal attention to ingest the prompt context and write to the KV cache. This runs once to prefill the initial context and then once per block to append each finalized 256-token canvas to the KV cache before proceeding to denoising the next canvas.
Denoising (Bidirectional): Uses bidirectional attention to iteratively denoise the canvas. Query tokens at any position on the canvas can attend to all other canvas tokens (as well as KV cache), letting the model process context bidirectionally.

This architectural choice makes the following possible:

Global Context Awareness: Unlike autoregressive (AR) models that only "look backward," the Denoiser's bidirectional attention allows every token on the canvas to attend to every other token. This makes the model much more effective at solving non-sequential problems, such as Sudoku, where a digit in the first cell must respect constraints in the last cell.
Self-Correction: Because the model iteratively refines the whole canvas, it can "fix" earlier mistakes. If a token's confidence drops during a pass, the sampler can re-noise and replace it. This is a capability AR models lack since they are "stuck" with a token once it is generated, especially during long output sequences.
Efficient Long-Context Scaling: The "block-autoregressive" approach allows the model to handle long sequences. It combines the parallel speed of diffusion for blocks with the proven sequential stability of AR models for long-form text.
Simplified Deployment: Using the same architecture as the Gemma 4 26B A4B model means developers only need to implement a denoising step, making it easier to integrate into existing serving frameworks like vLLM.

Serving DiffusionGemma

To serve this experimental architecture efficiently, we worked with the vLLM team to implement DiffusionGemma into vLLM. This integration allows the engine to run the iterative parallel denoising loops efficiently across batched request streams.

Developers can deploy DiffusionGemma out of the box using vLLM's standard OpenAI-compatible local server.

vllm serve google/diffusiongemma-26B-A4B-it \
  --max-model-len 262144 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.85 \
  --attention-backend TRITON_ATTN \
  --generation-config vllm \
  --hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1}' \
  --diffusion-config '{"canvas_length": 256}' \
  --enable-chunked-prefill

Getting Started Today

Ready to explore the frontier of non-autoregressive text generation? Take a look at the following resources to find out more:

Download the Weights: Access the weights of the experimental model (released under the Apache 2.0 license) directly on Hugging Face.
Integrate & Learn: Review the Visual Guide to DiffusionGemma to understand the mechanics of text-based diffusion models. Read more about DiffusionGemma in the Gemma documentation.
Use Your Favorite Inference Frameworks: Run the model efficiently using vLLM, Hugging Face Transformers, SGLang, and MLX.
Adapt & Fine-Tune: For rapid experimentation, we are releasing the official training recipes using Hackable Diffusion You can also explore efficient fine-tuning using Unsloth or NVIDIA NeMo.
Deploy Your Way: Instantly deploy on Google Cloud using Model Garden or via NVIDIA NIM. The model is optimized natively across the hardware stack from consumer RTX 4090 and 5090 cards to enterprise Hopper and Blackwell servers.

DiffusionGemma: 4x faster text generation

Gemini Team — Tue, 14 Jul 2026 15:06:56 +0000

Our newest open experimental model delivers up to 4x faster inference on dedicated GPUs and opens the door to exploring speed-critical, interactive local workflows.

Introducing DiffusionGemma, an experimental open model that explores text diffusion, an exceptionally fast approach to text generation. Released under an Apache 2.0 license, this 26B Mixture of Experts (MoE) model moves beyond the sequential token-by-token processing of typical autoregressive Large Language Models (LLMs). Instead, it generates entire blocks of text simultaneously, delivering up to 4x faster text generation on GPUs.

Built upon the industry-leading intelligence-per-parameter of our Gemma 4 family and cutting-edge Gemini Diffusion research, DiffusionGemma integrates a novel diffusion head designed to maximize generation speed. While autoregressive Gemma 4 models remain the standard for high-quality production outputs, DiffusionGemma is designed for researchers and developers exploring speed-critical, interactive local workflows such as in-line editing, rapid iteration, and generating non-linear text structures.

Unlocking new value for developers

Developers building real-time interactive AI applications often struggle with the latency bottlenecks of local inference. DiffusionGemma addresses these challenges directly, with some key trade-offs:

Blazing fast inference: By shifting the decode bottleneck from memory-bandwidth to compute, DiffusionGemma generates up to 4x faster token output on dedicated GPUs. (1000+ tokens per second on a single NVIDIA H100, 700+ tokens per second on NVIDIA GeForce RTX 5090).¹
Accessible hardware footprint: Operating as a 26B total Mixture of Experts (MoE) model that activates only 3.8B parameters during inference, DiffusionGemma fits comfortably within 18GB VRAM limits of high-end dedicated consumer GPUs when quantized.
Bi-directional attention: Generating 256 tokens in parallel with each forward pass allows every token to attend to all others. This provides significant advantages for non-linear domains such as in-line editing, code infilling, amino acid sequences or mathematical graphs.
Intelligent self-correction: The model iteratively refines its own output, allowing it to evaluate the entire text block at once to fix mistakes in real-time.
Experimental status & production recommendations: Because it prioritizes speed and parallel layout generation, DiffusionGemma’s overall output quality is lower than standard Gemma 4. For applications that demand maximum quality, we recommend deploying standard Gemma 4.

You can improve DiffusionGemma's performance on specific tasks through fine-tuning. In the example below, Unsloth fine-tuned DiffusionGemma to play Sudoku — a task autoregressive models struggle with because each token depends on future tokens. DiffusionGemma's bi-directional attention makes this much easier.

Fine-tuned DiffusionGemma solving Sudoku.

Why diffusion for text?

While the AI research community has explored diffusion-based text generation for years, applying it to large models has remained a challenge. DiffusionGemma changes this by shifting how models use hardware.

The trade-off with traditional models

Most language models act like a typewriter, generating one token at a time from left to right. In the cloud, this is efficient because servers can batch thousands of user requests together to share the hardware load. But when run locally for a single user, this word-by-word process leaves your dedicated GPU or TPU underutilized — it spends most of its time simply waiting for the next "keystroke."

DiffusionGemma reverses this inefficiency. Instead of predicting words sequentially, it drafts an entire 256-token paragraph simultaneously. By giving the computer's processor a larger chunk of work at once, DiffusionGemma utilizes your hardware to its full potential. It upgrades your model inference from a single, sequential typewriter to a massive printing press that stamps the entire block of text simultaneously.

controls
loop
muted
autoplay
playsinline
width="100%"
style="border-radius: 8px; display: block; margin-bottom: 0; padding-bottom: 0px;"
aria-label="DiffusionGemma text-to-3D SVG demo by Hugging Face. Step-by-step generation, slowed down for visualization.">

Your browser does not support the video tag.

DiffusionGemma text-to-3D SVG demo by Hugging Face. Step-by-step generation.

This means DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs. The throughput advantage is strongest at low-to-medium batch sizes on a single accelerator.

How text diffusion works

Similar to AI image generators that start with visual static and iteratively refine it into a clear picture, DiffusionGemma applies this to text:

The canvas: The model starts with a canvas of random placeholder tokens.
Iterative refinement: The model makes multiple passes, locking in correct tokens and using them as context clues to refine the rest.
Final polish: The text converges into high-quality output.

controls
loop
muted
autoplay
playsinline
width="100%"
style="border-radius: 8px; display: block; margin-bottom: 0; padding-bottom: 0px;"
aria-label="DiffusionGemma Process">

Your browser does not support the video tag.

Because the model can process the whole paragraph while generating, it unlocks new patterns of model behavior, like perfectly closing complex markdown formatting or generating and rendering code in near real-time.

Get started today

Download the weights: Access the experimental model weights (released under a permissive Apache 2.0 license) right now on Hugging Face.
Integrate & learn: Learn more in our DiffusionGemma developer guide. Or deep dive into A Visual Guide to DiffusionGemma to understand the mechanics under the hood.
Use your favorite development tools: Serve the model efficiently using MLX, vLLM (with integration supported by Red Hat), and Hugging Face Transformers. For rapid experimentation, we are releasing a fine-tuning tutorial using Hackable Diffusion, a modular JAX toolbox designed for composability. You can also explore fine-tuning with Unsloth and NVIDIA NeMo. Additionally, official support for llama.cpp is arriving soon.
Experience optimized performance: We worked with NVIDIA to optimize across their hardware stack, ensuring compatibility with consumer setups (quantized for GeForce RTX 5090 and 4090 GPUs) alongside high performance on enterprise systems (Hopper and Blackwell using advanced NVFP4 kernels), including NVIDIA DGX Spark and DGX Station for local deskside deployment, and RTX PRO for AI professionals. Native support for NVFP4 (4-bit floating-point) accelerates compute throughput, allowing the model to run at faster speeds with near-lossless accuracy.
Try your way: Run on your desktop dedicated GPU or in the cloud through Gemini Enterprise Agent Platform Model Garden or NVIDIA NIM.

Fluid, natural voice translation with Gemini 3.5 Live Translate

Gemini Team — Tue, 30 Jun 2026 15:54:14 +0000

Twenty years ago, translation at Google began as one of our pioneering machine learning experiments to turn the science of language into the magic of human connection. That experiment has come a long way with over a trillion words being translated for billions of users across our products every month.

We’re taking our next step with the release of Gemini 3.5 Live Translate, our latest audio model for live speech-to-speech translation.

The model automatically detects 70+ languages and generates smooth, natural-sounding translated speech that preserves the speakers' intonation, pacing and pitch. Unlike turn by turn systems that wait for the speaker to finish speaking before responding, 3.5 Live Translate generates speech continuously, balancing the trade-off between waiting for context to improve quality and translating immediately to stay in sync with the speaker. It delivers fluid audio without awkward pauses and stays just a few seconds behind the speaker throughout the session.

Gemini 3.5 Live Translate is rolling out across Google products:

For developers in public preview via the Gemini Live API and Google AI Studio
For enterprises in private preview starting this month in Google Meet
For everyone via Google Translate on Android and iOS

Build with 3.5 Live Translate

Gemini 3.5 Live Translate processes speech as it’s streamed, enabling a more seamless connection across languages. The model handles multilingual inputs without the need to manually configure settings. At the same time, its noise robustness ensures applications can handle loud, unpredictable environments. You can use its capabilities to help facilitate live interpretation for multilingual calls, meetings, lessons, broadcasts and more.

Watch the Gemini Live API in action, enabling dubbing and simultaneous multi-language translation. Dive into the demo or more example code in the Gemini Cookbook.

By utilizing the Gemini Live API, developer platforms like Agora, Fishjam, LiveKit, Pipecat, and Vision Agents enable developers to build and deploy voice translation apps with ease. These integrations handle the complex real-time media streaming infrastructure, so developers can focus on the user experience.

Our partners at Grab are testing the model to enable multilingual communication in near real-time between drivers and travelers at pickups. These users make over 10 million voice calls per month through Grab.

See how Grab has been testing 3.5 Live Translate to transform communication between users.

Read the early reviews

In addition to Grab, companies like CJ ENM, LiveKit and others have shared positive feedback on 3.5 Live Translate highlighting its impressive translation quality, accuracy and low latency:

"While testing Gemini 3.5 Live Translate, we’ve valued its ability to auto-detect multiple languages and translate speech accurately with low latency."
Philipp Kandal
Chief Product Office at Grab

"CJ ENM is excited to partner with Google DeepMind on 3.5 Live Translate. Early tests show promising quality for a more authentic experience for global & Korean viewers."
Bella Baek
Chief AI Officer at CJ ENM

"Gemini 3.5 Live Translate makes multilingual voice effortless. I built a demo on LiveKit Agents where everyone speaks their own language and understands each other live."
Jesse Hall
Staff Developer Advocate at LiveKit

"During our time with the 3.5 Live Translate model, we tested across several languages, and our team was blown away by the speed, accuracy, and liveliness of the model."
Nash Ramdial
Director at Vision Agents

"Gemini 3.5 Live Translate paired with Fishjam’s MoQ protocol sets a new frontier for real-time multimedia streaming, allowing speech-to-speech translation into over 70 languages."
Maciej Rys
VP of Engineering at Software Mansion

"We tested the Gemini 3.5 Live Translate model at Agora and in our opinion it provided SOTA results, with low latency and high accuracy that set a new bar for real-time translation."
Mason Adams
Developer Evangelist at Agora

Experience 3.5 Live Translate in your video meetings

Speech translation in Google Meet will soon use 3.5 Live Translate, improving the experience by:

Offering 70+ languages, an improvement from the previous limit of just five languages,
Enabling conversations across over 2000+ language combinations in one meeting, expanding from the previous state of only translating to and from English,
Updating the interface to provide instant access to speech translation.

We’re launching this update in private preview for select business Google Workspace customers starting this month, followed by a broader rollout later this year.

Google Meet participants use speech translation to communicate across English, Mandarin, and Swedish.

Get 3.5 Live Translate in the Google Translate app on Android or iOS

The model is also rolling out on the Google Translate app globally, on both Android and iOS. When using the Live translate feature, simply connect any pair of headphones to experience a more seamless translation that mirrors the speaker’s tone across 70+ languages.

For Android users, we’re also starting to roll out a new ‘listening mode’ with 3.5 Live Translate that lets you hear translations directly through your phone’s earpiece. Simply hold your phone to your ear just like a regular call, and the translated audio streams straight to you. This new experience can be helpful in situations where you want to quickly hear translations without others hearing, and you don’t have your headphones handy.

Your browser does not support the video tag. Using the new listening mode, users can hear a near real-time English translation of a guided tour in Spanish directly through their phone's earpiece.

Watermarked with SynthID

All audio generated by our models is watermarked with SynthID. This imperceptible watermark is woven directly into the audio output, ensuring AI-generated content remains detectable to help prevent misinformation. For details on our approach to safety and responsibility, review the model card.