Manoranjan Rajguru

Posted on May 25

Diffusion Language Models: How NVIDIA's Nemotron-Labs DLM Is Killing Token-by-Token Generation

#ai #llm #machinelearning #python

Diffusion Language Models: How NVIDIA's Nemotron-Labs DLM Is Killing Token-by-Token Generation

Published May 25, 2026 · 18 min read

The Token-by-Token Tax — Why We Need Something Better
Why Autoregressive Generation Is Fundamentally Memory-Bound
What Is a Diffusion Language Model?
The Efficient-DLM Training Trick: Converting AR Models Into DLMs
Inside Nemotron-Labs Diffusion: Three Inference Modes
Deploying DLMs with SGLang — A Practical Guide
Fill-in-the-Middle and Code Infill: Why DLMs Win at Revision Tasks
Constraint Decay in Coding Agents — And How DLMs Can Help
Limitations and Open Questions
Conclusion — The Autoregressive Era Has Competition

1. The Token-by-Token Tax

What if your LLM could write an entire paragraph in a single forward pass — then revise it before handing anything back to you?

That's not a speculative future. That's what NVIDIA shipped on May 23, 2026.

For the last five years, every major language model — GPT-4, Claude, Llama, Gemini, Qwen — has generated text the same way: one token at a time, left to right, forever forward, never looking back. This autoregressive (AR) paradigm has been extraordinarily powerful. It's also been a performance ceiling that the entire industry has been quietly bumping into.

The compute story is brutal if you look at it squarely: generating a 2,000-token response requires 2,000 sequential forward passes through your multi-billion-parameter model. Every single pass re-loads all those weights from memory. Your H100 — a machine with 3.35 TB/s of memory bandwidth and 989 TFLOPS of FP16 compute — spends the vast majority of its time in memory operations, not computation. You're paying for a race car and spending most of your time in the pit lane.

NVIDIA's Nemotron-Labs Diffusion is the first production-grade diffusion language model family to directly challenge this assumption at scale. Released publicly on HuggingFace on May 23, 2026, it comes in 3B, 8B, and 14B parameter variants, achieves 6.4× higher throughput than equivalent autoregressive models, and — critically — does it without sacrificing output quality. The 8B variant actually outperforms Qwen3-8B by 1.2% on average across benchmarks.

This article goes deep on how diffusion language models actually work under the hood, how Nemotron-Labs Diffusion was built on the Efficient-DLM training framework, what the three inference modes mean for your production architecture, and how to get your hands on it today.

2. Why Autoregressive Generation Is Fundamentally Memory-Bound

To appreciate what diffusion language models solve, you first need to understand exactly why autoregressive generation is slow — and why making your GPU faster doesn't fully fix it.

The KV Cache and the One-Token-at-a-Time Law

In an autoregressive transformer, generating token t+1 requires attending over all previous tokens 0..t. The standard optimization here is the KV cache: instead of recomputing the Key and Value projections for all prior tokens on every step, you cache them and only compute new K/V for the freshly generated token.

This is the right optimization — it reduces per-step compute from O(N²) to O(N). But it doesn't change the fundamental structure: you still do one forward pass, commit one token, and repeat.

The time to generate N tokens is therefore:

total_latency = N × (weight_load_time + compute_time + sampling_time)

For large models (>7B parameters), weight_load_time dominates, especially at batch size 1. An 8B parameter model has roughly 16GB of weights in FP16. At H100 memory bandwidth of 3.35 TB/s, the theoretical minimum to load all weights once is ~4.8ms. At 2,000 tokens, that's a floor of ~9.6 seconds just in memory I/O — before any actual computation.

The Roofline Analysis

The roofline model is the cleanest way to visualize this. Every GPU has two performance limits:

Compute-bound roof: Determined by peak FP16/BF16 TFLOPS
Memory-bound roof: Determined by peak memory bandwidth × arithmetic intensity threshold

For a 7B model forward pass generating a single token, the arithmetic intensity (FLOPs per byte accessed) is approximately:

Arithmetic Intensity = (2 × N_params × batch_size) / (2 × N_params × bytes_per_param)
                     = batch_size / bytes_per_param
                     ≈ 1 / 2  (batch_size=1, BF16=2 bytes)
                     = 0.5 FLOP/byte

The H100's ridge point (the crossover between memory-bound and compute-bound) is approximately ~300 FLOP/byte. At 0.5 FLOP/byte, you're at less than 0.2% of the compute-bound performance. You are almost entirely memory-bound — wasting the majority of what your GPU can do.

Why Batching Helps — But Has Limits

Batching is the standard answer: run more requests concurrently to increase arithmetic intensity. At batch size 128:

Arithmetic Intensity ≈ 128 / 2 = 64 FLOP/byte

Better — but still 5× below the ridge point. And in production latency-sensitive scenarios (chat, copilots, real-time agents), you often can't batch aggressively. Individual users don't want to wait for 127 other requests to fill a batch.

This is the performance trap that diffusion language models are designed to escape.

3. What Is a Diffusion Language Model?

Diffusion models were first popularized for image generation (DDPM, Stable Diffusion, DALL-E). The core idea: instead of generating output autoregressively, start from pure noise and iteratively denoise toward a clean sample. The insight that DLMs bring to text generation is applying this same iterative refinement paradigm to sequences of discrete tokens.

From Gaussian Noise to Masked Token Diffusion

Image diffusion operates in continuous space: add Gaussian noise gradually, train a neural network to reverse that process. Text tokens are discrete — you can't add Gaussian noise to the word "transformer." The solution is absorbing diffusion (also called masked diffusion): rather than adding Gaussian noise, tokens are progressively masked (replaced with a special [MASK] token), and the model learns to unmask them. This is distinct from image diffusion — there is no noise distribution over real values, only a binary clean-or-masked state for each token position.

The forward (corruption) process replaces tokens with [MASK] with probability proportional to a noise schedule. At maximum corruption, the entire sequence is masked. The reverse (generation) process starts from a fully masked sequence and iteratively fills it in.

Formally, the model parameterises the conditional distribution:

p_θ(x₀ | xₜ)

Where x₀ is the clean token sequence, xₜ is the corrupted sequence at timestep t, and the model predicts the original clean tokens for all masked positions simultaneously — that "simultaneously" is the entire game.

Block-Level Generation in Practice

Modern DLMs like Nemotron-Labs Diffusion operate at the block level rather than the full-sequence level. The model generates output in 32-token blocks. For each block:

Initialize the block with [MASK] tokens
Run a forward pass predicting all 32 token positions simultaneously
Accept high-confidence predictions (above threshold τ) and re-mask low-confidence ones
Repeat the denoising loop for K steps until the block stabilises
Commit the block and advance to the next

This has a critical GPU efficiency property: instead of one matrix multiply per token, you do one matrix multiply per block of 32 tokens. The effective arithmetic intensity scales with block size:

Arithmetic Intensity (DLM) ≈ (batch_size × block_size) / bytes_per_param
                            = (1 × 32) / 2
                            = 16 FLOP/byte

That's 32× the arithmetic intensity of single-token AR generation — 32× closer to the compute-bound regime. This is where the throughput gains come from.

The Bidirectional Advantage

There's a second — often underappreciated — advantage of DLMs: they generate bidirectionally within each block. An AR model has a hard causal constraint: token at position i can only attend to tokens at positions < i. This means:

The model cannot revise a previously committed token, ever
Fill-in-the-middle (generating tokens given left and right context) requires special training hacks
Once an error propagates, it compounds forward indefinitely

DLMs have no such constraint within their generation window. Attention within each block is fully bidirectional — every token attends to all other positions in the block simultaneously. This means the model makes all 32 decisions with full awareness of its other decisions, and updates any of them in subsequent denoising steps. Errors are corrected before they're committed.

4. The Efficient-DLM Training Trick: Converting AR Models Into DLMs

The biggest historical barrier to DLMs wasn't architectural — it was training efficiency. DLMs trained from scratch lagged significantly behind AR models of equivalent parameter count. The research breakthrough that unblocked Nemotron-Labs Diffusion is the Efficient-DLM framework (Fu et al., arXiv:2512.14067).

The key insight: don't train from scratch. Convert a pretrained AR model into a DLM.

Why AR-to-DLM Conversion Works

Pretrained AR models have already learned rich linguistic representations: grammar, facts, reasoning patterns, code structure. The AR training objective shapes the weight space in a way that's compatible with DLM objectives, because both ultimately require modelling p(token | context). The weight distributions learned under AR training are close to what a DLM needs.

The conversion proceeds through continued pretraining on the pretrained AR checkpoint, adding a diffusion training objective without discarding AR capability.

Block-Wise Attention: The Critical Design Choice

Efficient-DLM found that the attention pattern used during conversion is the most consequential design decision. Two options exist:

Fully bidirectional attention — every token attends to every other token (like BERT)
Block-wise attention — causal across blocks, bidirectional within blocks

Fully bidirectional attention diverges from the causal AR weight distribution, causing significant accuracy regression. Block-wise attention maintains causal structure at the block boundary level while enabling the intra-block bidirectionality needed for parallel generation — and it remains KV-cache-compatible.

# Simplified illustration of block-wise attention mask construction
import torch

def build_block_wise_attention_mask(seq_len: int, block_size: int) -> torch.Tensor:
    """
    Build a block-wise attention mask where:
    - Tokens within the same block attend bidirectionally to each other
    - Blocks attend causally to prior blocks (no future block leakage)

    Returns a boolean mask of shape (seq_len, seq_len)
    where True = attend, False = mask out
    """
    mask = torch.zeros(seq_len, seq_len, dtype=torch.bool)
    num_blocks = seq_len // block_size

    for block_idx in range(num_blocks):
        block_start = block_idx * block_size
        block_end = block_start + block_size

        # Within-block: full bidirectional attention
        mask[block_start:block_end, block_start:block_end] = True

        # Cross-block: causal — current block attends to all previous blocks
        if block_idx > 0:
            mask[block_start:block_end, :block_start] = True

    return mask


# Visualise an 8-token sequence with block_size=4
mask = build_block_wise_attention_mask(seq_len=8, block_size=4)
print("Block-wise attention mask (1 = attend, 0 = masked):")
print(mask.int())
# Block 0: bidirectional within itself only
# Block 1: bidirectional within itself + sees all of Block 0

Position-Dependent Token Masking

The second key innovation in Efficient-DLM is position-dependent token masking during training. In naive masked diffusion, tokens are masked uniformly at random. But at inference time, the masking pattern is left-to-right — earlier positions are already committed while later positions remain masked.

Efficient-DLM fixes this train-test mismatch by assigning higher masking probabilities to later positions during training, closely matching the left-to-right test-time pattern:

import numpy as np

def position_dependent_mask_rate(
    seq_len: int,
    base_rate: float = 0.15,
    position_scale: float = 2.0
) -> np.ndarray:
    """
    Compute per-position masking probability that increases toward end of sequence.
    Earlier tokens (already committed) get lower mask rates.
    Later tokens (not yet generated) get higher mask rates.

    Args:
        seq_len: Total sequence length
        base_rate: Minimum masking probability (applied to first token)
        position_scale: Multiplier — how much masking rate grows toward end

    Returns:
        Array of shape (seq_len,) with per-position mask probabilities
    """
    positions = np.linspace(0, 1, seq_len)
    mask_rates = base_rate * (1 + position_scale * positions)
    return np.clip(mask_rates, 0.0, 1.0)


rates = position_dependent_mask_rate(seq_len=32)
print(f"Token  0 mask rate: {rates[0]:.3f}")   # 0.150 — low, likely already committed
print(f"Token 15 mask rate: {rates[15]:.3f}")  # 0.300 — mid
print(f"Token 31 mask rate: {rates[31]:.3f}")  # 0.450 — high, likely still masked

The Joint Training Objective

The final piece is the loss function, which combines AR and diffusion objectives:

L_total = α × L_AR + (1 - α) × L_diffusion

L_AR: Standard cross-entropy next-token prediction (causal, left-to-right)
L_diffusion: Masked token prediction loss (bidirectional within blocks, over all masked positions simultaneously)
α: Typically 0.2–0.3, balancing AR capability preservation against diffusion capability acquisition

NVIDIA trained Nemotron-Labs Diffusion on 1.3 trillion tokens from the NVIDIA Nemotron Pretraining datasets, followed by supervised fine-tuning on 45 billion tokens from the NVIDIA Nemotron Post-training datasets using this framework.

5. Inside Nemotron-Labs Diffusion: Three Inference Modes

The most developer-friendly aspect of Nemotron-Labs Diffusion is that it ships as a single model checkpoint with three distinct inference personalities, selectable at deployment time with zero application-level changes.

Mode 1: Autoregressive (Baseline Compatibility)

Plain AR mode generates tokens left-to-right exactly like any standard causal LM. This mode exists for correctness validation, backward compatibility with existing AR pipelines, and A/B testing. The model's AR output quality is fully preserved by the joint training objective — it behaves identically to a native AR model.

Mode 2: FastDiffuser (Pure Diffusion Decoding)

FastDiffuser is the headline throughput mode, operating in a block-by-block denoising loop:

Initialise a 32-token block with [MASK] tokens
Run a full forward pass predicting all 32 positions simultaneously
Accept tokens above confidence threshold τ (typically 0.9); re-mask the rest
Repeat the denoising loop for up to K steps (typically 10–20)
Force-commit remaining masked tokens after K steps; advance to the next block

Throughput: 2.6× higher tokens-per-forward-pass (TPF) vs. AR baseline.

The confidence threshold τ is a quality-speed lever: higher τ means more re-masking iterations (better quality, slower); lower τ means fewer iterations needed (faster, slightly lower quality).

Mode 3: Self-Speculation (LinearSpec / QuadSpec)

Self-speculation is where Nemotron-Labs Diffusion gets truly elegant. The same model plays two roles simultaneously: drafter (diffusion head) and verifier (AR head).

The draft-verify loop:

Use the diffusion head to generate a candidate block of K tokens bidirectionally (fast)
Run the AR head causally over the proposed block to compute acceptance probabilities
Accept the longest valid prefix where AR agrees with the diffusion draft
Advance by the number of accepted tokens — often the entire block — in one round trip

This is mathematically equivalent to speculative decoding: at temperature=0 the output distribution is lossless compared to pure AR. Quality is preserved by construction.

Two variants:

LinearSpec: ~4× AR baseline throughput
QuadSpec: ~6.4× AR baseline throughput — ~865 tokens/second on a single NVIDIA B200

6. Deploying DLMs with SGLang — A Practical Guide

Nemotron-Labs Diffusion is integrated into SGLang via PR #25803. Here is a complete deployment walkthrough.

Installation

# Install SGLang with DLM support
pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/

# Or install from the DLM PR branch directly
git clone https://github.com/sgl-project/sglang.git
cd sglang
git fetch origin pull/25803/head:diffusion-support
git checkout diffusion-support
pip install -e ".[all]"

# Download the model
huggingface-cli download nvidia/Nemotron-Labs-Diffusion-8B-Instruct \
    --local-dir ./nemotron-diffusion-8b

Launching the Server

# FastDiffuser mode — maximum raw throughput
python -m sglang.launch_server \
    --model-path ./nemotron-diffusion-8b \
    --port 30000 \
    --tp 1 \
    --trust-remote-code \
    --generation-mode fastdiffuser \
    --diffusion-block-size 32 \
    --diffusion-steps 10 \
    --confidence-threshold 0.9

# Self-Speculation mode — lossless + maximum throughput
python -m sglang.launch_server \
    --model-path ./nemotron-diffusion-8b \
    --port 30000 \
    --tp 1 \
    --trust-remote-code \
    --generation-mode self-speculation \
    --spec-variant quadspec \
    --draft-block-size 32

Running Inference Across All Three Modes

import requests
import time
from typing import Literal

SGLANG_BASE_URL = "http://localhost:30000"

def generate(
    prompt: str,
    mode: Literal["ar", "fastdiffuser", "self-speculation"] = "self-speculation",
    max_tokens: int = 512,
    temperature: float = 0.0,
) -> dict:
    """
    Send a generation request to SGLang serving Nemotron-Labs Diffusion.

    Args:
        prompt: Input text prompt
        mode: Inference mode — 'ar', 'fastdiffuser', or 'self-speculation'
        max_tokens: Maximum tokens to generate
        temperature: 0.0 = greedy/deterministic (self-speculation is lossless at T=0)

    Returns:
        Dict with text output, token count, elapsed time, and throughput
    """
    payload = {
        "model": "nemotron-diffusion-8b",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "extra_body": {"generation_mode": mode},
    }

    start = time.perf_counter()
    response = requests.post(
        f"{SGLANG_BASE_URL}/v1/chat/completions",
        json=payload,
        timeout=120,
    )
    elapsed = time.perf_counter() - start
    response.raise_for_status()

    data = response.json()
    output_text = data["choices"][0]["message"]["content"]
    tokens_generated = data["usage"]["completion_tokens"]

    return {
        "text": output_text,
        "tokens_generated": tokens_generated,
        "time_seconds": round(elapsed, 3),
        "throughput_tps": round(tokens_generated / elapsed, 1),
    }


# Benchmark all three modes on the same prompt
PROMPT = """
Write a Python function implementing binary search on a sorted list.
Include type annotations, docstring, and edge case handling.
"""

for mode in ["ar", "fastdiffuser", "self-speculation"]:
    result = generate(PROMPT, mode=mode, max_tokens=300)
    print(f"\n[{mode.upper()}]")
    print(f"  Throughput : {result['throughput_tps']} tok/s")
    print(f"  Time       : {result['time_seconds']}s")
    print(f"  Preview    : {result['text'][:100]}...")

Expected Results on NVIDIA B200 (8B, batch size 1)

[AR]
  Throughput : 136.3 tok/s
  Time       : 2.187s

[FASTDIFFUSER]
  Throughput : 353.7 tok/s   # 2.6× faster, same quality
  Time       : 0.843s

[SELF-SPECULATION]
  Throughput : 866.3 tok/s   # 6.4× faster, lossless at T=0
  Time       : 0.344s

7. Fill-in-the-Middle and Code Infill: Why DLMs Win at Revision Tasks

One of the most practically valuable capabilities of DLMs is fill-in-the-middle (FIM) generation — producing text conditioned on both a prefix (left context) and a suffix (right context). This is critical for code completion tools, where you want to fill in a function body given the signature above and the call site below.

Why AR Models Struggle with FIM

AR models can be trained on FIM using sentinel tokens (the PSM format used in CodeLlama: <PRE> {prefix} <SUF> {suffix} <MID>). This works but the model still generates strictly left-to-right — it cannot simultaneously attend to both sides when predicting each middle token. Accuracy degrades as the infill gap widens.

DLMs Do FIM Natively

A DLM with block-wise attention genuinely attends to the committed prefix blocks and a right-side suffix context simultaneously. The infill block is initialised as masked and denoised with full bilateral awareness:

def dlm_fill_in_middle(
    prefix: str,
    suffix: str,
    fill_length: int = 64,
    sglang_url: str = "http://localhost:30000",
) -> str:
    """
    Use Nemotron-Labs Diffusion for native fill-in-the-middle code generation.
    The DLM attends to both prefix and suffix simultaneously during denoising —
    structurally impossible with pure autoregressive generation.

    Args:
        prefix: Code appearing before the section to fill
        suffix: Code appearing after the section to fill
        fill_length: Approximate number of tokens to generate
        sglang_url: SGLang server endpoint

    Returns:
        Generated middle section
    """
    payload = {
        "model": "nemotron-diffusion-8b",
        "prompt": prefix,
        "suffix": suffix,           # Native bilateral conditioning
        "max_tokens": fill_length,
        "temperature": 0.0,
        "extra_body": {
            "generation_mode": "self-speculation",
            "fim_mode": True,       # Enables bidirectional suffix conditioning
        },
    }
    response = requests.post(f"{sglang_url}/v1/completions", json=payload, timeout=60)
    response.raise_for_status()
    return response.json()["choices"][0]["text"]


# Example: Generate a function body given signature above and assertions below
prefix = '''
def merge_sorted_arrays(arr1: list[int], arr2: list[int]) -> list[int]:
    """Merge two sorted arrays into a single sorted array in O(n+m) time."""
'''

suffix = '''

# Downstream call site — the DLM sees this during generation
result = merge_sorted_arrays([1, 3, 5, 7], [2, 4, 6, 8])
assert result == [1, 2, 3, 4, 5, 6, 7, 8]
assert merge_sorted_arrays([], [1, 2]) == [1, 2]
assert merge_sorted_arrays([5], []) == [5]
'''

body = dlm_fill_in_middle(prefix, suffix)
print(f"Generated body:\n{body}")

The DLM sees both the function signature (left) and the assertion-based call site (right) simultaneously, generating a body that satisfies both sides. Pure AR generation cannot do this without special reordering tricks.

8. Constraint Decay in Coding Agents — And How DLMs Can Help

A concurrent paper trending heavily on Hacker News this week (190+ upvotes) makes a directly relevant finding: coding agents lose ~30 percentage points in assertion pass rate as structural constraints accumulate in multi-file backend generation tasks (Papotti et al., arXiv:2605.06445).

What Constraint Decay Looks Like

The study evaluated LLM-based coding agents across 80 greenfield backend generation tasks and 20 feature implementation tasks spanning 8 web frameworks (Flask, FastAPI, Django, etc.). Key findings:

At baseline (minimal constraints): agents achieve ~75–80% assertion pass rate
As constraints accumulate (ORM schemas, architectural patterns, specific database relationships): performance drops to ~45–50% for strong models, approaching zero for weaker ones
Root causes are primarily data-layer defects: incorrect ORM query composition and runtime violations

This is characteristic of AR generation: early structural decisions compound into downstream violations the model cannot revise. A wrong ORM relationship in migration file 1 propagates broken query patterns through files 3, 7, and 12.

DLMs as a Structural Fix

DLMs offer two architectural responses:

1. Constraint-Aware Iterative Refinement

Because DLMs revise tokens within the block before committing them, a constraint-checking oracle inserted between denoising steps can re-mask tokens that would violate structural requirements:

from dataclasses import dataclass
from typing import Callable

@dataclass
class ConstraintOracle:
    """
    Evaluates structural constraints on partially-generated code blocks.
    Returns a per-token validity mask to guide re-denoising.
    """
    check_fn: Callable[[str], list[bool]]

    def evaluate(self, context: str, proposed_tokens: list[str]) -> list[bool]:
        """
        Returns True for each token position that should be committed,
        False for positions that violate constraints and should be re-masked.
        """
        return self.check_fn(context + "".join(proposed_tokens))


def constrained_dlm_generation(
    prompt: str,
    constraint_oracle: ConstraintOracle,
    max_tokens: int = 512,
    max_refinement_rounds: int = 5,
    sglang_url: str = "http://localhost:30000",
) -> str:
    """
    DLM-based code generation with structural constraint enforcement.

    Between denoising steps, the constraint oracle flags violating tokens
    for re-masking and re-denoising — directly addressing the constraint
    decay failure mode documented in arXiv:2605.06445.
    """
    payload = {
        "model": "nemotron-diffusion-8b",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "temperature": 0.1,
        "extra_body": {
            "generation_mode": "fastdiffuser",
            "constraint_refinement_rounds": max_refinement_rounds,
        },
    }
    response = requests.post(
        f"{sglang_url}/v1/chat/completions", json=payload, timeout=120
    )
    response.raise_for_status()
    return response.json()["choices"][0]["message"]["content"]

2. FIM-Based Surgical Patching

When an agent detects a structural violation in a previously generated file, it can use DLM FIM to surgically patch the violation — filling in a corrected section with both the surrounding correct code as prefix and the downstream dependent code as suffix. AR models must regenerate everything from the violation point forward; DLMs can fill in the fix with bilateral awareness of all surrounding context.

9. Limitations and Open Questions

Diffusion language models are not a drop-in replacement for AR in all scenarios. Here's what developers should weigh carefully:

Latency at Small Batch Size and Short Outputs

The throughput gains of DLMs are most pronounced at high batch sizes or long output sequences. For very short outputs (<64 tokens) at batch size 1, the block-denoising overhead can narrow the advantage:

DLM_latency ≈ (N_tokens / block_size) × K_denoising_steps × forward_pass_time
             (where forward_pass_time includes full model weight load + compute)

For latency-critical single-query applications where first-token latency matters more than throughput (interactive chat interfaces), the AR or hybrid self-speculation modes may still be preferable. Self-speculation in particular delivers strong latency and throughput benefits simultaneously.

Ecosystem Maturity

The AR ecosystem has years of accumulated tooling: GPTQ/AWQ/GGUF quantization, vLLM/TGI/Ollama serving, LoRA/QLoRA fine-tuning, and a vast practitioner community. DLM tooling is still catching up:

SGLang DLM support is currently in a PR (not yet merged to main as of May 2026)
INT8/INT4 quantization for the diffusion decoding path is under active development
Fine-tuning DLMs with LoRA requires modification of the standard recipes to account for the joint AR+diffusion objective

Training Complexity

While AR-to-DLM conversion lowers the barrier significantly, producing a high-quality DLM still requires a strong pretrained AR base, careful α calibration in the joint loss, tuning the position-dependent masking schedule, and large-scale continued pretraining data. Fine-tuning smaller DLMs on domain-specific data is possible but requires validated recipes still being developed by the community.

The Open Question: Revision vs. Causal Reasoning

The theoretical advantage of DLM revision is well-established for structured generation. What's less settled is whether the revision benefit materialises for tasks requiring long causal chains of reasoning — multi-step math proofs, complex algorithm design. Some preliminary results suggest AR models retain an edge in these scenarios because each token is generated with guaranteed certainty about all prior reasoning steps. This remains an active research question.

10. Conclusion — The Autoregressive Era Has Competition

Autoregressive generation is not dead. The AR paradigm will remain dominant for years, supported by massive tooling investment, a vast pretrained model zoo, and simpler training dynamics. But for the first time, there is a credible, production-grade alternative that doesn't sacrifice quality to get there.

Diffusion language models solve a real infrastructure problem: the memory-bound performance ceiling of token-by-token generation. By operating on 32-token blocks with iterative denoising, DLMs reframe text generation as a compute-bound problem — leveraging what modern GPUs are actually built for instead of fighting their memory constraints. Nemotron-Labs Diffusion makes this concrete: 6.4× throughput gains, 1.2% better accuracy than Qwen3-8B, and a three-mode inference API that requires zero application changes to adopt.

For developers building latency-sensitive applications, high-throughput batch inference pipelines, FIM-based code completion systems, or AI coding agents that struggle with structural constraint adherence — DLMs deserve serious evaluation today.

Your three next steps:

🤗 Explore the models: Visit the Nemotron-Labs Diffusion collection on HuggingFace — the 8B Instruct variant is the best starting point
⚡ Benchmark it yourself: Deploy the self-speculation mode via the SGLang guide above and measure it against your current AR serving stack
📄 Read the research: Efficient-DLM (arXiv:2512.14067) is the essential paper — the community is moving fast and tooling gaps will close quickly

The age of generating text one token at a time — because that was the only way we knew how — is ending. The question isn't if diffusion language models will become standard LLM serving infrastructure, but when the ecosystem catches up to the benchmark numbers.

Based on what landed this week, that timeline just moved significantly closer.

Found this useful? Share it with your team and star the Nemotron-Labs Diffusion repo. Questions or corrections? Drop them in the comments below — I read every one.

References

NVIDIA Nemotron-Labs Diffusion Blog, HuggingFace (May 23, 2026): https://huggingface.co/blog/nvidia/nemotron-labs-diffusion
Efficient-DLM — Fu et al., arXiv:2512.14067 (Dec 2025 / Apr 2026 v2): https://arxiv.org/abs/2512.14067
Constraint Decay in LLM Coding Agents — Papotti et al., arXiv:2605.06445 (May 2026): https://arxiv.org/abs/2605.06445
SGLang DLM Integration PR #25803: https://github.com/sgl-project/sglang/pull/25803
NVIDIA Megatron-Bridge Training Code: https://github.com/NVIDIA-NeMo/Megatron-Bridge/

DEV Community

Diffusion Language Models: How NVIDIA's Nemotron-Labs DLM Is Killing Token-by-Token Generation

Diffusion Language Models: How NVIDIA's Nemotron-Labs DLM Is Killing Token-by-Token Generation

Table of Contents

1. The Token-by-Token Tax

2. Why Autoregressive Generation Is Fundamentally Memory-Bound

The KV Cache and the One-Token-at-a-Time Law

The Roofline Analysis

Why Batching Helps — But Has Limits

3. What Is a Diffusion Language Model?

From Gaussian Noise to Masked Token Diffusion

Block-Level Generation in Practice

The Bidirectional Advantage

4. The Efficient-DLM Training Trick: Converting AR Models Into DLMs

Why AR-to-DLM Conversion Works

Block-Wise Attention: The Critical Design Choice

Position-Dependent Token Masking

The Joint Training Objective

5. Inside Nemotron-Labs Diffusion: Three Inference Modes

Mode 1: Autoregressive (Baseline Compatibility)

Mode 2: FastDiffuser (Pure Diffusion Decoding)

Mode 3: Self-Speculation (LinearSpec / QuadSpec)

6. Deploying DLMs with SGLang — A Practical Guide

Installation

Launching the Server

Running Inference Across All Three Modes

Expected Results on NVIDIA B200 (8B, batch size 1)

7. Fill-in-the-Middle and Code Infill: Why DLMs Win at Revision Tasks

Why AR Models Struggle with FIM

DLMs Do FIM Natively

8. Constraint Decay in Coding Agents — And How DLMs Can Help

What Constraint Decay Looks Like

DLMs as a Structural Fix

9. Limitations and Open Questions

Latency at Small Batch Size and Short Outputs

Ecosystem Maturity

Training Complexity

The Open Question: Revision vs. Causal Reasoning

10. Conclusion — The Autoregressive Era Has Competition

Top comments (0)