Diffusion Language Models: How NVIDIA's Nemotron-Labs DLM Is Killing Token-by-Token Generation
Published May 25, 2026 · 18 min read
Table of Contents
- The Token-by-Token Tax — Why We Need Something Better
- Why Autoregressive Generation Is Fundamentally Memory-Bound
- What Is a Diffusion Language Model?
- The Efficient-DLM Training Trick: Converting AR Models Into DLMs
- Inside Nemotron-Labs Diffusion: Three Inference Modes
- Deploying DLMs with SGLang — A Practical Guide
- Fill-in-the-Middle and Code Infill: Why DLMs Win at Revision Tasks
- Constraint Decay in Coding Agents — And How DLMs Can Help
- Limitations and Open Questions
- Conclusion — The Autoregressive Era Has Competition
1. The Token-by-Token Tax
What if your LLM could write an entire paragraph in a single forward pass — then revise it before handing anything back to you?
That's not a speculative future. That's what NVIDIA shipped on May 23, 2026.
For the last five years, every major language model — GPT-4, Claude, Llama, Gemini, Qwen — has generated text the same way: one token at a time, left to right, forever forward, never looking back. This autoregressive (AR) paradigm has been extraordinarily powerful. It's also been a performance ceiling that the entire industry has been quietly bumping into.
The compute story is brutal if you look at it squarely: generating a 2,000-token response requires 2,000 sequential forward passes through your multi-billion-parameter model. Every single pass re-loads all those weights from memory. Your H100 — a machine with 3.35 TB/s of memory bandwidth and 989 TFLOPS of FP16 compute — spends the vast majority of its time in memory operations, not computation. You're paying for a race car and spending most of your time in the pit lane.
NVIDIA's Nemotron-Labs Diffusion is the first production-grade diffusion language model family to directly challenge this assumption at scale. Released publicly on HuggingFace on May 23, 2026, it comes in 3B, 8B, and 14B parameter variants, achieves 6.4× higher throughput than equivalent autoregressive models, and — critically — does it without sacrificing output quality. The 8B variant actually outperforms Qwen3-8B by 1.2% on average across benchmarks.
This article goes deep on how diffusion language models actually work under the hood, how Nemotron-Labs Diffusion was built on the Efficient-DLM training framework, what the three inference modes mean for your production architecture, and how to get your hands on it today.
2. Why Autoregressive Generation Is Fundamentally Memory-Bound
To appreciate what diffusion language models solve, you first need to understand exactly why autoregressive generation is slow — and why making your GPU faster doesn't fully fix it.
The KV Cache and the One-Token-at-a-Time Law
In an autoregressive transformer, generating token t+1 requires attending over all previous tokens 0..t. The standard optimization here is the KV cache: instead of recomputing the Key and Value projections for all prior tokens on every step, you cache them and only compute new K/V for the freshly generated token.
This is the right optimization — it reduces per-step compute from O(N²) to O(N). But it doesn't change the fundamental structure: you still do one forward pass, commit one token, and repeat.
The time to generate N tokens is therefore:
total_latency = N × (weight_load_time + compute_time + sampling_time)
For large models (>7B parameters), weight_load_time dominates, especially at batch size 1. An 8B parameter model has roughly 16GB of weights in FP16. At H100 memory bandwidth of 3.35 TB/s, the theoretical minimum to load all weights once is ~4.8ms. At 2,000 tokens, that's a floor of ~9.6 seconds just in memory I/O — before any actual computation.
The Roofline Analysis
The roofline model is the cleanest way to visualize this. Every GPU has two performance limits:
- Compute-bound roof: Determined by peak FP16/BF16 TFLOPS
- Memory-bound roof: Determined by peak memory bandwidth × arithmetic intensity threshold
For a 7B model forward pass generating a single token, the arithmetic intensity (FLOPs per byte accessed) is approximately:
Arithmetic Intensity = (2 × N_params × batch_size) / (2 × N_params × bytes_per_param)
= batch_size / bytes_per_param
≈ 1 / 2 (batch_size=1, BF16=2 bytes)
= 0.5 FLOP/byte
The H100's ridge point (the crossover between memory-bound and compute-bound) is approximately ~300 FLOP/byte. At 0.5 FLOP/byte, you're at less than 0.2% of the compute-bound performance. You are almost entirely memory-bound — wasting the majority of what your GPU can do.
Why Batching Helps — But Has Limits
Batching is the standard answer: run more requests concurrently to increase arithmetic intensity. At batch size 128:
Arithmetic Intensity ≈ 128 / 2 = 64 FLOP/byte
Better — but still 5× below the ridge point. And in production latency-sensitive scenarios (chat, copilots, real-time agents), you often can't batch aggressively. Individual users don't want to wait for 127 other requests to fill a batch.
This is the performance trap that diffusion language models are designed to escape.
3. What Is a Diffusion Language Model?
Diffusion models were first popularized for image generation (DDPM, Stable Diffusion, DALL-E). The core idea: instead of generating output autoregressively, start from pure noise and iteratively denoise toward a clean sample. The insight that DLMs bring to text generation is applying this same iterative refinement paradigm to sequences of discrete tokens.
From Gaussian Noise to Masked Token Diffusion
Image diffusion operates in continuous space: add Gaussian noise gradually, train a neural network to reverse that process. Text tokens are discrete — you can't add Gaussian noise to the word "transformer." The solution is absorbing diffusion (also called masked diffusion): rather than adding Gaussian noise, tokens are progressively masked (replaced with a special [MASK] token), and the model learns to unmask them. This is distinct from image diffusion — there is no noise distribution over real values, only a binary clean-or-masked state for each token position.
The forward (corruption) process replaces tokens with [MASK] with probability proportional to a noise schedule. At maximum corruption, the entire sequence is masked. The reverse (generation) process starts from a fully masked sequence and iteratively fills it in.
Formally, the model parameterises the conditional distribution:
p_θ(x₀ | xₜ)
Where x₀ is the clean token sequence, xₜ is the corrupted sequence at timestep t, and the model predicts the original clean tokens for all masked positions simultaneously — that "simultaneously" is the entire game.
Block-Level Generation in Practice
Modern DLMs like Nemotron-Labs Diffusion operate at the block level rather than the full-sequence level. The model generates output in 32-token blocks. For each block:
- Initialize the block with
[MASK]tokens - Run a forward pass predicting all 32 token positions simultaneously
- Accept high-confidence predictions (above threshold τ) and re-mask low-confidence ones
- Repeat the denoising loop for K steps until the block stabilises
- Commit the block and advance to the next
This has a critical GPU efficiency property: instead of one matrix multiply per token, you do one matrix multiply per block of 32 tokens. The effective arithmetic intensity scales with block size:
Arithmetic Intensity (DLM) ≈ (batch_size × block_size) / bytes_per_param
= (1 × 32) / 2
= 16 FLOP/byte
That's 32× the arithmetic intensity of single-token AR generation — 32× closer to the compute-bound regime. This is where the throughput gains come from.
The Bidirectional Advantage
There's a second — often underappreciated — advantage of DLMs: they generate bidirectionally within each block. An AR model has a hard causal constraint: token at position i can only attend to tokens at positions < i. This means:
- The model cannot revise a previously committed token, ever
- Fill-in-the-middle (generating tokens given left and right context) requires special training hacks
- Once an error propagates, it compounds forward indefinitely
DLMs have no such constraint within their generation window. Attention within each block is fully bidirectional — every token attends to all other positions in the block simultaneously. This means the model makes all 32 decisions with full awareness of its other decisions, and updates any of them in subsequent denoising steps. Errors are corrected before they're committed.
4. The Efficient-DLM Training Trick: Converting AR Models Into DLMs
The biggest historical barrier to DLMs wasn't architectural — it was training efficiency. DLMs trained from scratch lagged significantly behind AR models of equivalent parameter count. The research breakthrough that unblocked Nemotron-Labs Diffusion is the Efficient-DLM framework (Fu et al., arXiv:2512.14067).
The key insight: don't train from scratch. Convert a pretrained AR model into a DLM.
Why AR-to-DLM Conversion Works
Pretrained AR models have already learned rich linguistic representations: grammar, facts, reasoning patterns, code structure. The AR training objective shapes the weight space in a way that's compatible with DLM objectives, because both ultimately require modelling p(token | context). The weight distributions learned under AR training are close to what a DLM needs.
The conversion proceeds through continued pretraining on the pretrained AR checkpoint, adding a diffusion training objective without discarding AR capability.
Block-Wise Attention: The Critical Design Choice
Efficient-DLM found that the attention pattern used during conversion is the most consequential design decision. Two options exist:
- Fully bidirectional attention — every token attends to every other token (like BERT)
- Block-wise attention — causal across blocks, bidirectional within blocks
Fully bidirectional attention diverges from the causal AR weight distribution, causing significant accuracy regression. Block-wise attention maintains causal structure at the block boundary level while enabling the intra-block bidirectionality needed for parallel generation — and it remains KV-cache-compatible.
# Simplified illustration of block-wise attention mask construction
import torch
def build_block_wise_attention_mask(seq_len: int, block_size: int) -> torch.Tensor:
"""
Build a block-wise attention mask where:
- Tokens within the same block attend bidirectionally to each other
- Blocks attend causally to prior blocks (no future block leakage)
Returns a boolean mask of shape (seq_len, seq_len)
where True = attend, False = mask out
"""
mask = torch.zeros(seq_len, seq_len, dtype=torch.bool)
num_blocks = seq_len // block_size
for block_idx in range(num_blocks):
block_start = block_idx * block_size
block_end = block_start + block_size
# Within-block: full bidirectional attention
mask[block_start:block_end, block_start:block_end] = True
# Cross-block: causal — current block attends to all previous blocks
if block_idx > 0:
mask[block_start:block_end, :block_start] = True
return mask
# Visualise an 8-token sequence with block_size=4
mask = build_block_wise_attention_mask(seq_len=8, block_size=4)
print("Block-wise attention mask (1 = attend, 0 = masked):")
print(mask.int())
# Block 0: bidirectional within itself only
# Block 1: bidirectional within itself + sees all of Block 0
Position-Dependent Token Masking
The second key innovation in Efficient-DLM is position-dependent token masking during training. In naive masked diffusion, tokens are masked uniformly at random. But at inference time, the masking pattern is left-to-right — earlier positions are already committed while later positions remain masked.
Efficient-DLM fixes this train-test mismatch by assigning higher masking probabilities to later positions during training, closely matching the left-to-right test-time pattern:
import numpy as np
def position_dependent_mask_rate(
seq_len: int,
base_rate: float = 0.15,
position_scale: float = 2.0
) -> np.ndarray:
"""
Compute per-position masking probability that increases toward end of sequence.
Earlier tokens (already committed) get lower mask rates.
Later tokens (not yet generated) get higher mask rates.
Args:
seq_len: Total sequence length
base_rate: Minimum masking probability (applied to first token)
position_scale: Multiplier — how much masking rate grows toward end
Returns:
Array of shape (seq_len,) with per-position mask probabilities
"""
positions = np.linspace(0, 1, seq_len)
mask_rates = base_rate * (1 + position_scale * positions)
return np.clip(mask_rates, 0.0, 1.0)
rates = position_dependent_mask_rate(seq_len=32)
print(f"Token 0 mask rate: {rates[0]:.3f}") # 0.150 — low, likely already committed
print(f"Token 15 mask rate: {rates[15]:.3f}") # 0.300 — mid
print(f"Token 31 mask rate: {rates[31]:.3f}") # 0.450 — high, likely still masked
The Joint Training Objective
The final piece is the loss function, which combines AR and diffusion objectives:
L_total = α × L_AR + (1 - α) × L_diffusion
-
L_AR: Standard cross-entropy next-token prediction (causal, left-to-right) -
L_diffusion: Masked token prediction loss (bidirectional within blocks, over all masked positions simultaneously) -
α: Typically 0.2–0.3, balancing AR capability preservation against diffusion capability acquisition
NVIDIA trained Nemotron-Labs Diffusion on 1.3 trillion tokens from the NVIDIA Nemotron Pretraining datasets, followed by supervised fine-tuning on 45 billion tokens from the NVIDIA Nemotron Post-training datasets using this framework.
5. Inside Nemotron-Labs Diffusion: Three Inference Modes
The most developer-friendly aspect of Nemotron-Labs Diffusion is that it ships as a single model checkpoint with three distinct inference personalities, selectable at deployment time with zero application-level changes.
Mode 1: Autoregressive (Baseline Compatibility)
Plain AR mode generates tokens left-to-right exactly like any standard causal LM. This mode exists for correctness validation, backward compatibility with existing AR pipelines, and A/B testing. The model's AR output quality is fully preserved by the joint training objective — it behaves identically to a native AR model.
Mode 2: FastDiffuser (Pure Diffusion Decoding)
FastDiffuser is the headline throughput mode, operating in a block-by-block denoising loop:
- Initialise a 32-token block with
[MASK]tokens - Run a full forward pass predicting all 32 positions simultaneously
- Accept tokens above confidence threshold τ (typically 0.9); re-mask the rest
- Repeat the denoising loop for up to K steps (typically 10–20)
- Force-commit remaining masked tokens after K steps; advance to the next block
Throughput: 2.6× higher tokens-per-forward-pass (TPF) vs. AR baseline.
The confidence threshold τ is a quality-speed lever: higher τ means more re-masking iterations (better quality, slower); lower τ means fewer iterations needed (faster, slightly lower quality).
Mode 3: Self-Speculation (LinearSpec / QuadSpec)
Self-speculation is where Nemotron-Labs Diffusion gets truly elegant. The same model plays two roles simultaneously: drafter (diffusion head) and verifier (AR head).
The draft-verify loop:
- Use the diffusion head to generate a candidate block of K tokens bidirectionally (fast)
- Run the AR head causally over the proposed block to compute acceptance probabilities
- Accept the longest valid prefix where AR agrees with the diffusion draft
- Advance by the number of accepted tokens — often the entire block — in one round trip
This is mathematically equivalent to speculative decoding: at temperature=0 the output distribution is lossless compared to pure AR. Quality is preserved by construction.
Two variants:
- LinearSpec: ~4× AR baseline throughput
- QuadSpec: ~6.4× AR baseline throughput — ~865 tokens/second on a single NVIDIA B200
6. Deploying DLMs with SGLang — A Practical Guide
Nemotron-Labs Diffusion is integrated into SGLang via PR #25803. Here is a complete deployment walkthrough.
Installation
# Install SGLang with DLM support
pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/
# Or install from the DLM PR branch directly
git clone https://github.com/sgl-project/sglang.git
cd sglang
git fetch origin pull/25803/head:diffusion-support
git checkout diffusion-support
pip install -e ".[all]"
# Download the model
huggingface-cli download nvidia/Nemotron-Labs-Diffusion-8B-Instruct \
--local-dir ./nemotron-diffusion-8b
Launching the Server
# FastDiffuser mode — maximum raw throughput
python -m sglang.launch_server \
--model-path ./nemotron-diffusion-8b \
--port 30000 \
--tp 1 \
--trust-remote-code \
--generation-mode fastdiffuser \
--diffusion-block-size 32 \
--diffusion-steps 10 \
--confidence-threshold 0.9
# Self-Speculation mode — lossless + maximum throughput
python -m sglang.launch_server \
--model-path ./nemotron-diffusion-8b \
--port 30000 \
--tp 1 \
--trust-remote-code \
--generation-mode self-speculation \
--spec-variant quadspec \
--draft-block-size 32
Running Inference Across All Three Modes
import requests
import time
from typing import Literal
SGLANG_BASE_URL = "http://localhost:30000"
def generate(
prompt: str,
mode: Literal["ar", "fastdiffuser", "self-speculation"] = "self-speculation",
max_tokens: int = 512,
temperature: float = 0.0,
) -> dict:
"""
Send a generation request to SGLang serving Nemotron-Labs Diffusion.
Args:
prompt: Input text prompt
mode: Inference mode — 'ar', 'fastdiffuser', or 'self-speculation'
max_tokens: Maximum tokens to generate
temperature: 0.0 = greedy/deterministic (self-speculation is lossless at T=0)
Returns:
Dict with text output, token count, elapsed time, and throughput
"""
payload = {
"model": "nemotron-diffusion-8b",
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens,
"temperature": temperature,
"extra_body": {"generation_mode": mode},
}
start = time.perf_counter()
response = requests.post(
f"{SGLANG_BASE_URL}/v1/chat/completions",
json=payload,
timeout=120,
)
elapsed = time.perf_counter() - start
response.raise_for_status()
data = response.json()
output_text = data["choices"][0]["message"]["content"]
tokens_generated = data["usage"]["completion_tokens"]
return {
"text": output_text,
"tokens_generated": tokens_generated,
"time_seconds": round(elapsed, 3),
"throughput_tps": round(tokens_generated / elapsed, 1),
}
# Benchmark all three modes on the same prompt
PROMPT = """
Write a Python function implementing binary search on a sorted list.
Include type annotations, docstring, and edge case handling.
"""
for mode in ["ar", "fastdiffuser", "self-speculation"]:
result = generate(PROMPT, mode=mode, max_tokens=300)
print(f"\n[{mode.upper()}]")
print(f" Throughput : {result['throughput_tps']} tok/s")
print(f" Time : {result['time_seconds']}s")
print(f" Preview : {result['text'][:100]}...")
Expected Results on NVIDIA B200 (8B, batch size 1)
[AR]
Throughput : 136.3 tok/s
Time : 2.187s
[FASTDIFFUSER]
Throughput : 353.7 tok/s # 2.6× faster, same quality
Time : 0.843s
[SELF-SPECULATION]
Throughput : 866.3 tok/s # 6.4× faster, lossless at T=0
Time : 0.344s
7. Fill-in-the-Middle and Code Infill: Why DLMs Win at Revision Tasks
One of the most practically valuable capabilities of DLMs is fill-in-the-middle (FIM) generation — producing text conditioned on both a prefix (left context) and a suffix (right context). This is critical for code completion tools, where you want to fill in a function body given the signature above and the call site below.
Why AR Models Struggle with FIM
AR models can be trained on FIM using sentinel tokens (the PSM format used in CodeLlama: <PRE> {prefix} <SUF> {suffix} <MID>). This works but the model still generates strictly left-to-right — it cannot simultaneously attend to both sides when predicting each middle token. Accuracy degrades as the infill gap widens.
DLMs Do FIM Natively
A DLM with block-wise attention genuinely attends to the committed prefix blocks and a right-side suffix context simultaneously. The infill block is initialised as masked and denoised with full bilateral awareness:
def dlm_fill_in_middle(
prefix: str,
suffix: str,
fill_length: int = 64,
sglang_url: str = "http://localhost:30000",
) -> str:
"""
Use Nemotron-Labs Diffusion for native fill-in-the-middle code generation.
The DLM attends to both prefix and suffix simultaneously during denoising —
structurally impossible with pure autoregressive generation.
Args:
prefix: Code appearing before the section to fill
suffix: Code appearing after the section to fill
fill_length: Approximate number of tokens to generate
sglang_url: SGLang server endpoint
Returns:
Generated middle section
"""
payload = {
"model": "nemotron-diffusion-8b",
"prompt": prefix,
"suffix": suffix, # Native bilateral conditioning
"max_tokens": fill_length,
"temperature": 0.0,
"extra_body": {
"generation_mode": "self-speculation",
"fim_mode": True, # Enables bidirectional suffix conditioning
},
}
response = requests.post(f"{sglang_url}/v1/completions", json=payload, timeout=60)
response.raise_for_status()
return response.json()["choices"][0]["text"]
# Example: Generate a function body given signature above and assertions below
prefix = '''
def merge_sorted_arrays(arr1: list[int], arr2: list[int]) -> list[int]:
"""Merge two sorted arrays into a single sorted array in O(n+m) time."""
'''
suffix = '''
# Downstream call site — the DLM sees this during generation
result = merge_sorted_arrays([1, 3, 5, 7], [2, 4, 6, 8])
assert result == [1, 2, 3, 4, 5, 6, 7, 8]
assert merge_sorted_arrays([], [1, 2]) == [1, 2]
assert merge_sorted_arrays([5], []) == [5]
'''
body = dlm_fill_in_middle(prefix, suffix)
print(f"Generated body:\n{body}")
The DLM sees both the function signature (left) and the assertion-based call site (right) simultaneously, generating a body that satisfies both sides. Pure AR generation cannot do this without special reordering tricks.
8. Constraint Decay in Coding Agents — And How DLMs Can Help
A concurrent paper trending heavily on Hacker News this week (190+ upvotes) makes a directly relevant finding: coding agents lose ~30 percentage points in assertion pass rate as structural constraints accumulate in multi-file backend generation tasks (Papotti et al., arXiv:2605.06445).
What Constraint Decay Looks Like
The study evaluated LLM-based coding agents across 80 greenfield backend generation tasks and 20 feature implementation tasks spanning 8 web frameworks (Flask, FastAPI, Django, etc.). Key findings:
- At baseline (minimal constraints): agents achieve ~75–80% assertion pass rate
- As constraints accumulate (ORM schemas, architectural patterns, specific database relationships): performance drops to ~45–50% for strong models, approaching zero for weaker ones
- Root causes are primarily data-layer defects: incorrect ORM query composition and runtime violations
This is characteristic of AR generation: early structural decisions compound into downstream violations the model cannot revise. A wrong ORM relationship in migration file 1 propagates broken query patterns through files 3, 7, and 12.
DLMs as a Structural Fix
DLMs offer two architectural responses:
1. Constraint-Aware Iterative Refinement
Because DLMs revise tokens within the block before committing them, a constraint-checking oracle inserted between denoising steps can re-mask tokens that would violate structural requirements:
from dataclasses import dataclass
from typing import Callable
@dataclass
class ConstraintOracle:
"""
Evaluates structural constraints on partially-generated code blocks.
Returns a per-token validity mask to guide re-denoising.
"""
check_fn: Callable[[str], list[bool]]
def evaluate(self, context: str, proposed_tokens: list[str]) -> list[bool]:
"""
Returns True for each token position that should be committed,
False for positions that violate constraints and should be re-masked.
"""
return self.check_fn(context + "".join(proposed_tokens))
def constrained_dlm_generation(
prompt: str,
constraint_oracle: ConstraintOracle,
max_tokens: int = 512,
max_refinement_rounds: int = 5,
sglang_url: str = "http://localhost:30000",
) -> str:
"""
DLM-based code generation with structural constraint enforcement.
Between denoising steps, the constraint oracle flags violating tokens
for re-masking and re-denoising — directly addressing the constraint
decay failure mode documented in arXiv:2605.06445.
"""
payload = {
"model": "nemotron-diffusion-8b",
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens,
"temperature": 0.1,
"extra_body": {
"generation_mode": "fastdiffuser",
"constraint_refinement_rounds": max_refinement_rounds,
},
}
response = requests.post(
f"{sglang_url}/v1/chat/completions", json=payload, timeout=120
)
response.raise_for_status()
return response.json()["choices"][0]["message"]["content"]
2. FIM-Based Surgical Patching
When an agent detects a structural violation in a previously generated file, it can use DLM FIM to surgically patch the violation — filling in a corrected section with both the surrounding correct code as prefix and the downstream dependent code as suffix. AR models must regenerate everything from the violation point forward; DLMs can fill in the fix with bilateral awareness of all surrounding context.
9. Limitations and Open Questions
Diffusion language models are not a drop-in replacement for AR in all scenarios. Here's what developers should weigh carefully:
Latency at Small Batch Size and Short Outputs
The throughput gains of DLMs are most pronounced at high batch sizes or long output sequences. For very short outputs (<64 tokens) at batch size 1, the block-denoising overhead can narrow the advantage:
DLM_latency ≈ (N_tokens / block_size) × K_denoising_steps × forward_pass_time
(where forward_pass_time includes full model weight load + compute)
For latency-critical single-query applications where first-token latency matters more than throughput (interactive chat interfaces), the AR or hybrid self-speculation modes may still be preferable. Self-speculation in particular delivers strong latency and throughput benefits simultaneously.
Ecosystem Maturity
The AR ecosystem has years of accumulated tooling: GPTQ/AWQ/GGUF quantization, vLLM/TGI/Ollama serving, LoRA/QLoRA fine-tuning, and a vast practitioner community. DLM tooling is still catching up:
- SGLang DLM support is currently in a PR (not yet merged to
mainas of May 2026) - INT8/INT4 quantization for the diffusion decoding path is under active development
- Fine-tuning DLMs with LoRA requires modification of the standard recipes to account for the joint AR+diffusion objective
Training Complexity
While AR-to-DLM conversion lowers the barrier significantly, producing a high-quality DLM still requires a strong pretrained AR base, careful α calibration in the joint loss, tuning the position-dependent masking schedule, and large-scale continued pretraining data. Fine-tuning smaller DLMs on domain-specific data is possible but requires validated recipes still being developed by the community.
The Open Question: Revision vs. Causal Reasoning
The theoretical advantage of DLM revision is well-established for structured generation. What's less settled is whether the revision benefit materialises for tasks requiring long causal chains of reasoning — multi-step math proofs, complex algorithm design. Some preliminary results suggest AR models retain an edge in these scenarios because each token is generated with guaranteed certainty about all prior reasoning steps. This remains an active research question.
10. Conclusion — The Autoregressive Era Has Competition
Autoregressive generation is not dead. The AR paradigm will remain dominant for years, supported by massive tooling investment, a vast pretrained model zoo, and simpler training dynamics. But for the first time, there is a credible, production-grade alternative that doesn't sacrifice quality to get there.
Diffusion language models solve a real infrastructure problem: the memory-bound performance ceiling of token-by-token generation. By operating on 32-token blocks with iterative denoising, DLMs reframe text generation as a compute-bound problem — leveraging what modern GPUs are actually built for instead of fighting their memory constraints. Nemotron-Labs Diffusion makes this concrete: 6.4× throughput gains, 1.2% better accuracy than Qwen3-8B, and a three-mode inference API that requires zero application changes to adopt.
For developers building latency-sensitive applications, high-throughput batch inference pipelines, FIM-based code completion systems, or AI coding agents that struggle with structural constraint adherence — DLMs deserve serious evaluation today.
Your three next steps:
- 🤗 Explore the models: Visit the Nemotron-Labs Diffusion collection on HuggingFace — the 8B Instruct variant is the best starting point
- ⚡ Benchmark it yourself: Deploy the self-speculation mode via the SGLang guide above and measure it against your current AR serving stack
- 📄 Read the research: Efficient-DLM (arXiv:2512.14067) is the essential paper — the community is moving fast and tooling gaps will close quickly
The age of generating text one token at a time — because that was the only way we knew how — is ending. The question isn't if diffusion language models will become standard LLM serving infrastructure, but when the ecosystem catches up to the benchmark numbers.
Based on what landed this week, that timeline just moved significantly closer.
Found this useful? Share it with your team and star the Nemotron-Labs Diffusion repo. Questions or corrections? Drop them in the comments below — I read every one.
References
- NVIDIA Nemotron-Labs Diffusion Blog, HuggingFace (May 23, 2026): https://huggingface.co/blog/nvidia/nemotron-labs-diffusion
- Efficient-DLM — Fu et al., arXiv:2512.14067 (Dec 2025 / Apr 2026 v2): https://arxiv.org/abs/2512.14067
- Constraint Decay in LLM Coding Agents — Papotti et al., arXiv:2605.06445 (May 2026): https://arxiv.org/abs/2605.06445
- SGLang DLM Integration PR #25803: https://github.com/sgl-project/sglang/pull/25803
- NVIDIA Megatron-Bridge Training Code: https://github.com/NVIDIA-NeMo/Megatron-Bridge/



Top comments (0)