DEV Community

Cover image for Diffusion Language Models Are Here: Deep Dive into NVIDIA's Nemotron-Labs DLM Architecture
Manoranjan Rajguru
Manoranjan Rajguru

Posted on

Diffusion Language Models Are Here: Deep Dive into NVIDIA's Nemotron-Labs DLM Architecture

Meta Description: NVIDIA just open-sourced Nemotron-Labs Diffusion — a family of 3B, 8B, and 14B diffusion language models that merge autoregressive and diffusion generation for up to 6.4× faster inference. Here's the complete technical deep dive into the architecture, training methodology, three generation modes, and how to run it today with SGLang.

Diffusion Language Models Hero Banner


Table of Contents

  1. The Speed Wall Autoregressive LLMs Hit
  2. What Are Diffusion Language Models?
  3. Why DLMs Struggled — Until Now
  4. NVIDIA's AR-to-DLM Breakthrough: Efficient-DLM
  5. Nemotron-Labs Diffusion: The Model Family
  6. Three Generation Modes: AR, Diffusion, Self-Speculation
  7. Performance Deep Dive: The Numbers That Matter
  8. Under the Hood: Block-Wise Attention & KV Caching
  9. Getting Started: Running with SGLang
  10. What This Means for Production LLM Infrastructure
  11. Conclusion & The Road Ahead

1. The Speed Wall Autoregressive LLMs Hit

Every language model you've ever used — GPT-4, Claude, Llama, Qwen — generates text the same fundamental way: one token at a time, left to right, each new token conditioned on every previous one. It's called autoregressive (AR) generation, and it's been the undisputed king of language modeling since the original GPT paper in 2018.

But AR generation has a dirty secret. It's not a compute-bound problem. It's a memory-bandwidth-bound problem.

Here's why that matters: each new token requires a full model forward pass. That means loading all the model's weights — potentially tens of gigabytes for a 7B model — from HBM (High Bandwidth Memory) into the GPU's compute cores, every single decoding step. On modern GPUs, the arithmetic throughput is enormous, but the memory bandwidth is the bottleneck. This is why serving an LLM at batch size 1 — a single user chatting with your model — leaves your GPU vastly underutilized.

The math is brutal. An A100 80GB GPU has ~2TB/s of HBM bandwidth. A 7B-parameter model in FP16 takes ~14GB. Reading all weights takes ~7ms minimum per step. At 30 tokens/second, you're spending the vast majority of each step just moving weights, not computing. Scale this to a production API endpoint handling thousands of concurrent users, and the economics become painful.

The community has attacked this problem from many angles: speculative decoding (using a small draft model to propose tokens verified by the large model), quantization (FP8, INT4 to shrink weight footprint), and FlashAttention (optimizing the KV-cache access pattern). These are all incremental improvements on the same fundamental loop.

NVIDIA's Nemotron-Labs Diffusion — released on HuggingFace on May 23, 2026 — is taking a fundamentally different approach. Instead of optimizing the autoregressive loop, it breaks the loop entirely.


2. What Are Diffusion Language Models?

If you've worked with image generation models (Stable Diffusion, DALL·E, Flux), you already know the concept of denoising diffusion. The idea is to start with pure noise and iteratively denoise it, guided by a conditioning signal, until you arrive at a coherent output.

Diffusion Language Models (DLMs) apply this same paradigm to text. Instead of generating tokens left-to-right, a DLM:

  1. Starts with a sequence of masked or noisy tokens (analogous to Gaussian noise in image diffusion)
  2. Runs multiple denoising refinement steps, predicting the clean token distribution at each step
  3. After several iterations, the entire sequence — or a large block of it — converges to the final output

Autoregressive vs Diffusion Language Model Architecture

The key theoretical advantage is parallelism. In a standard AR model, token t can only be generated after token t-1 exists. In a DLM, all positions in a block are refined simultaneously in each forward pass. This changes the computational profile dramatically: instead of being memory-bandwidth-bound by sequential weight loads, the GPU can be kept busy with dense matrix multiplications across the full block.

The conceptual roots of DLMs trace back to Masked Diffusion Language Models (MDLMs) — work like MDLM (Sahoo et al., 2024) and SEDD (Lou et al., 2023) — that framed text generation as a discrete denoising process over masked token sequences. However, these models had significant practical shortcomings when compared to the state-of-the-art AR models of the day. NVIDIA's work specifically addresses why, and more importantly, how to fix it.


3. Why DLMs Struggled — Until Now

The community has known about the theoretical appeal of diffusion language models for years. The reason they haven't taken over is a cluster of practical barriers that made them non-competitive with AR models in production:

1. Accuracy Gap: DLMs trained from scratch consistently underperformed comparably-sized AR models on standard benchmarks. The discrete, iterative denoising process is harder to optimize than the clean causal language modeling objective. Models like Dream 7B were impressive for DLMs, but still lagged behind Qwen3 4B — a smaller AR model — on reasoning and knowledge tasks.

2. Training Instability: Jointly learning to denoise across many noise levels with a bidirectional attention mask creates a different gradient landscape than causal language modeling. Loss curves are noisier, and the model is more sensitive to hyperparameter choices.

3. No KV Cache Compatibility: This was the killer for inference efficiency. KV caching — where you store key/value activations from previous tokens to avoid recomputing them — is the single most important optimization for AR inference. Standard DLMs use fully bidirectional attention across the entire sequence, which means you can't cache anything: every refinement step needs to attend over all positions with the updated token states. This essentially erased the theoretical throughput advantage.

4. Fill-in-the-Middle Mismatch: During DLM training, tokens are masked uniformly at random across the sequence. But at inference time, the model typically has a left-side prefix (the prompt) that is fully unmasked, and must fill in the right side. This creates a training-test distribution mismatch that degrades quality.

Each of these problems has a specific technical solution in NVIDIA's Efficient-DLM framework. Let's dig in.


4. NVIDIA's AR-to-DLM Breakthrough: Efficient-DLM

The foundational insight behind Nemotron-Labs Diffusion (and the academic paper it builds on, arXiv:2512.14067) is deceptively simple: don't train DLMs from scratch — convert pretrained AR models into DLMs.

This avoids the accuracy gap problem entirely. You start with a model that already has world-class knowledge and reasoning capabilities baked into its weights, then teach it to also generate diffusion-style. The result is a model that retains AR accuracy while gaining diffusion parallelism.

But there are two critical technical challenges to solve for this conversion to work.

4.1 Block-Wise Attention: Preserving Weights, Enabling KV Caching

The attention mechanism is the crux of the problem. A standard AR model uses causal (lower-triangular) attention — each token attends only to itself and all previous tokens. A standard DLM uses bidirectional (full) attention — every token attends to every other token.

The issue: if you convert an AR model and suddenly change to fully bidirectional attention, you've broken the statistical assumptions baked into all those attention weights during pretraining. The key-value projections were trained to operate in a causal setting; they "expect" not to see future tokens. Loading them into a fully bidirectional context produces degraded output and requires extensive retraining to recover.

Efficient-DLM introduces block-wise causal attention as the solution:

  • The sequence is divided into non-overlapping blocks of size B (e.g., 32 tokens)
  • Within each block: full bidirectional attention (every token attends to every other token in the block)
  • Across blocks: standard left-to-right causal attention (block i can attend to blocks 0 through i-1)

Block-wise Attention Pattern with KV Caching

This hybrid pattern does something clever: it's structurally similar enough to causal attention that pretrained weight distributions are preserved — the model only needs to learn bidirectionality locally within blocks, not globally across the whole sequence. The result is a much smoother conversion that requires far less compute to recover quality.

Crucially, this also re-enables KV caching. Since attention is still causal across blocks, the KV activations of completed (committed) blocks can be cached and reused exactly like in a standard AR model. Only the current block being refined needs to be recomputed each refinement step.

4.2 Position-Dependent Token Masking

The second innovation addresses the training-test distribution mismatch. Instead of masking tokens uniformly at random during training, Efficient-DLM uses a position-dependent masking strategy that assigns higher masking probabilities to tokens in later positions in the sequence.

The intuition: at inference time, when filling in a response to a prompt, earlier tokens in the response have already been decided (or are more constrained by the left-side context), while later tokens remain more uncertain. By skewing the training mask distribution to match this pattern, the model learns a denoising objective that better mirrors what it actually faces at test time.

4.3 Joint AR + Diffusion Training Objective

Rather than optimizing purely for the diffusion objective, Nemotron-Labs Diffusion is trained with a joint AR and diffusion loss:

L_total = λ · L_AR + (1 - λ) · L_diffusion
Enter fullscreen mode Exit fullscreen mode

Where L_AR is the standard cross-entropy causal language modeling loss and L_diffusion is the masked diffusion objective. This joint training ensures the model remains a first-class AR model while learning the diffusion generation capability.

The pretrained base was trained on 1.3 trillion tokens from NVIDIA's Nemotron pretraining datasets, with an additional 45 billion tokens of supervised fine-tuning data for the instruct-tuned variants.


5. Nemotron-Labs Diffusion: The Model Family

NVIDIA released seven model checkpoints on HuggingFace under the NVIDIA Nemotron Open Model License (commercially friendly for text models):

Model Parameters Type Downloads (Day 1)
nvidia/Nemotron-Labs-Diffusion-3B ~4B Text, Instruct 14.7K
nvidia/Nemotron-Labs-Diffusion-3B-Base ~4B Text, Base 14.2K
nvidia/Nemotron-Labs-Diffusion-8B 8B Text, Instruct 24.1K
nvidia/Nemotron-Labs-Diffusion-8B-Base 8B Text, Base 228K
nvidia/Nemotron-Labs-Diffusion-14B 14B Text, Instruct 3.28K
nvidia/Nemotron-Labs-Diffusion-14B-Base 14B Text, Base 1.18K
nvidia/Nemotron-Labs-Diffusion-VLM-8B ~9B Vision-Language 590

The 8B Base model being the most downloaded (228K in under 2 days) reflects developer interest in using it as a foundation for custom fine-tuning.


6. Three Generation Modes: AR, Diffusion, Self-Speculation

The standout design decision in Nemotron-Labs Diffusion is that all three generation modes are supported from a single checkpoint. You don't need different models — just a different deployment config in SGLang.

Three Generation Modes Performance Comparison

Mode 1: Autoregressive (ar_mode=true)

Standard left-to-right token generation, identical to how you'd run any other causal LM. This mode is the correctness baseline — most useful for debugging, A/B testing against existing pipelines, or when you need strict adherence to specific decoding behaviors.

Use when: Debugging, regression testing, or exact reproduction of AR outputs.

Mode 2: Diffusion / FastDiffuser (diffusion_mode=true)

The model fills in a block of 32 tokens simultaneously, running multiple denoising refinement steps per block. A confidence threshold determines which tokens are "committed" after each refinement pass — tokens whose predicted distribution is peaked enough get locked in, reducing the number of positions that need further refinement.

The process per block:

  1. Initialize block positions with mask tokens
  2. Forward pass with block-wise attention → predict token distributions over all positions
  3. Commit tokens above confidence threshold; keep others masked
  4. Repeat steps 2–3 until all positions are committed or max steps reached
  5. Move to next block, using committed block tokens in KV cache

Achieves 2.6× higher tokens per forward pass (TPF) compared to AR.

Use when: High-throughput batch serving where speed matters more than exact AR equivalence.

Mode 3: Self-Speculation / LinearSpec (self_speculation=true)

This is the most sophisticated mode — it fuses diffusion and autoregressive decoding into a single hybrid loop:

  1. The model uses diffusion to draft a full block of k candidate tokens bidirectionally (fast, parallel)
  2. It then uses autoregressive decoding to verify the draft tokens causally from left to right
  3. Any prefix of the draft that matches what AR would have produced gets committed
  4. The process restarts from the first disagreement position

The same model plays both roles (drafter and verifier). Output is lossless vs AR at temperature=0.

Key numbers: LinearSpec achieves ~6× higher TPF than AR, and ~865 tokens/second on NVIDIA B200 hardware — roughly 4× the AR baseline on identical hardware.

Use when: Production serving where you need maximum speed with no quality compromise.


7. Performance Deep Dive: The Numbers That Matter

Accuracy vs Qwen3 8B:
Nemotron-Labs Diffusion 8B achieves +1.2% higher average accuracy compared to Qwen3 8B across evaluated benchmarks. The DLM conversion doesn't hurt quality — it slightly improves it, likely because the joint AR+diffusion training objective acts as an additional regularizer.

vs Dream 7B (prior DLM SOTA):
Efficient-DLM 8B achieves +5.4% higher accuracy and 4.5× higher throughput compared to Dream 7B — a decisive improvement over the previous DLM state-of-the-art.

Throughput (Tokens Per Forward Pass — TPF):

Mode TPF (relative to AR) Quality vs AR
Autoregressive 1× (baseline) Exact match
Diffusion (FastDiffuser) 2.6× Slightly different
Self-Spec Linear (LinearSpec) ~6× Lossless at T=0
Self-Spec Quadratic (QuadSpec) ~6.4× Lossless at T=0

TPF (Tokens Per Forward Pass) is a hardware-agnostic efficiency metric — it measures how many output tokens you get per model forward pass, making it useful for comparing across different GPU generations.


8. Under the Hood: Block-Wise Attention & KV Caching

Let's look at exactly how the block-wise attention mechanism enables KV caching in a DLM setting.

In standard AR decoding, the KV cache stores the key and value projections for every previously generated token. When generating token t, the model attends to cached KV from tokens 0...(t-1) and computes new Q, K, V for position t only — O(1) cache update per step.

In a standard bidirectional DLM, this is impossible: since every token attends to every other token, and token values change with each refinement step, you'd need to recompute the entire KV matrix every step — O(n²) per refinement, no caching benefit.

Block-wise causal attention resolves this with a two-level hierarchy:

Sequence: [Block 0 | Block 1 | Block 2 | ... | Block N]

For a token in Block i:
  - Attends to ALL tokens in blocks 0...(i-1)  → cached KV (never recomputed)
  - Attends to ALL tokens within Block i        → bidirectional, recomputed each step
  - CANNOT attend to tokens in blocks (i+1)+   → causal constraint maintained
Enter fullscreen mode Exit fullscreen mode

For a 32-token block size and 2048-token sequence, 98.4% of KV computations are served from cache at any given refinement step.

Here's how to build the attention mask in PyTorch:

import torch

def build_block_causal_mask(seq_len: int, block_size: int) -> torch.Tensor:
    """
    Build a block-wise causal attention mask.

    Within each block: full bidirectional attention (True)
    Across blocks: causal left-to-right attention (True only for past blocks)
    Future blocks: masked out (False → -inf in softmax)

    Returns a boolean mask of shape [seq_len, seq_len],
    where True = can attend, False = masked.
    """
    mask = torch.zeros(seq_len, seq_len, dtype=torch.bool)
    num_blocks = seq_len // block_size

    for block_i in range(num_blocks):
        q_start = block_i * block_size
        q_end   = q_start + block_size

        # Attend to all past blocks (causal across blocks)
        for block_j in range(block_i):
            kv_start = block_j * block_size
            kv_end   = kv_start + block_size
            mask[q_start:q_end, kv_start:kv_end] = True

        # Attend fully within current block (bidirectional within block)
        mask[q_start:q_end, q_start:q_end] = True

    return mask


# Example: 4 blocks of 4 tokens each = 16 token sequence
mask = build_block_causal_mask(seq_len=16, block_size=4)
print(mask.int())

# Output (each row = query token, each col = key token):
# Block 0 rows: [1111 | 0000 | 0000 | 0000]
# Block 1 rows: [1111 | 1111 | 0000 | 0000]
# Block 2 rows: [1111 | 1111 | 1111 | 0000]
# Block 3 rows: [1111 | 1111 | 1111 | 1111]
Enter fullscreen mode Exit fullscreen mode

The resulting mask has fully-connected 4×4 diagonal blocks (bidirectional within blocks) with a lower-triangular structure across block boundaries (causal across blocks). It's the AR causal mask, coarsened to block granularity — which is precisely why pretrained AR weight distributions are preserved.


9. Getting Started: Running with SGLang

SGLang is the recommended serving framework for Nemotron-Labs Diffusion, with integration via PR #25803 (merging into main imminently). Here's a complete working example.

9.1 Installation

# Install SGLang with DLM support
pip install "sglang[all]>=0.4.5" --extra-index-url https://flashinfer.ai/whl/cu124/torch2.5/

# If the PR hasn't merged to main yet, install from the DLM branch directly:
# git clone https://github.com/sgl-project/sglang.git
# cd sglang && git fetch origin pull/25803/head:dlm-support
# git checkout dlm-support && pip install -e ".[all]"

# Pull the model weights
pip install huggingface-hub
huggingface-cli download nvidia/Nemotron-Labs-Diffusion-8B \
  --local-dir ./models/Nemotron-Labs-Diffusion-8B
Enter fullscreen mode Exit fullscreen mode

9.2 Serving: Launch the SGLang Server

# Mode 1 — Autoregressive (standard baseline)
python -m sglang.launch_server \
  --model-path ./models/Nemotron-Labs-Diffusion-8B \
  --port 30000 --tp 1 --dtype bfloat16 \
  --algorithm ar_mode

# Mode 2 — Diffusion (FastDiffuser): highest raw throughput
python -m sglang.launch_server \
  --model-path ./models/Nemotron-Labs-Diffusion-8B \
  --port 30000 --tp 1 --dtype bfloat16 \
  --algorithm diffusion \
  --block-size 32 \
  --confidence-threshold 0.9

# Mode 3 — Self-Speculation (LinearSpec): lossless 6x speedup
python -m sglang.launch_server \
  --model-path ./models/Nemotron-Labs-Diffusion-8B \
  --port 30000 --tp 1 --dtype bfloat16 \
  --algorithm linear_spec \
  --draft-block-size 32
Enter fullscreen mode Exit fullscreen mode

9.3 Inference: Python Client (OpenAI-Compatible API)

import openai
import time

# SGLang exposes an OpenAI-compatible API endpoint
client = openai.OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"  # SGLang doesn't require auth by default
)

PROMPT = """You are an expert in distributed systems.
Explain the CAP theorem and its practical implications for a microservices
architecture. Be specific with concrete trade-off examples."""

def benchmark_mode(label: str, mode_hint: str = ""):
    """Run a generation and measure wall-clock tokens/second."""
    start = time.perf_counter()

    response = client.chat.completions.create(
        model="nvidia/Nemotron-Labs-Diffusion-8B",
        messages=[{"role": "user", "content": PROMPT}],
        max_tokens=512,
        temperature=0,        # T=0 → LinearSpec is lossless vs AR
        extra_body={
            "mode": mode_hint  # "ar", "diffusion", or "linear_spec"
        } if mode_hint else {}
    )

    elapsed = time.perf_counter() - start
    tokens  = response.usage.completion_tokens
    tps     = tokens / elapsed

    print(f"\n{'='*60}")
    print(f"Mode        : {label}")
    print(f"Output      : {response.choices[0].message.content[:200]}...")
    print(f"Tokens      : {tokens}")
    print(f"Time (s)    : {elapsed:.2f}")
    print(f"Throughput  : {tps:.1f} tok/s")
    print(f"{'='*60}")
    return tps

# Compare all three modes
ar_tps   = benchmark_mode("Autoregressive",           mode_hint="ar")
diff_tps = benchmark_mode("Diffusion (FastDiffuser)", mode_hint="diffusion")
spec_tps = benchmark_mode("Self-Spec (LinearSpec)",   mode_hint="linear_spec")

print(f"\n📊 Speedup Summary:")
print(f"  Diffusion vs AR   : {diff_tps/ar_tps:.2f}×")
print(f"  LinearSpec vs AR  : {spec_tps/ar_tps:.2f}×")
Enter fullscreen mode Exit fullscreen mode

9.4 Quick Start via HuggingFace Transformers (AR Mode)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "nvidia/Nemotron-Labs-Diffusion-8B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user",   "content": "Explain masked diffusion in 3 sentences."}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        input_ids,
        max_new_tokens=256,
        do_sample=False,
        use_cache=True
    )

response = tokenizer.decode(
    output_ids[0][input_ids.shape[-1]:],
    skip_special_tokens=True
)
print(response)
Enter fullscreen mode Exit fullscreen mode

Note: The transformers path gives AR mode only. For diffusion and self-speculation modes, the SGLang integration is required as it implements the custom decoding loop.


10. What This Means for Production LLM Infrastructure

Latency vs Throughput Trade-off, Revisited

The classic LLM serving dilemma is that throughput optimizations (larger batch sizes, continuous batching) increase latency, and latency optimizations (small batches, low KV cache pressure) hurt throughput. Self-speculation in DLMs partially decouples this: at batch size 1, LinearSpec gives 4–6× more tokens per second than AR on the same hardware. This is the scenario where AR models are most inefficient, and where DLMs provide the biggest relative gain.

Cost Implications

A 4× throughput improvement at batch size 1 means you could serve the same number of users with 1/4 the GPU compute — or equivalently, serve 4× more users from the same GPU fleet. At current B200/H100 pricing of $4–8/hour, that's a meaningful cost reduction for any team running a production LLM API.

Fill-in-the-Middle and Code Editing

DLMs have a natural advantage for fill-in-the-middle (FIM) tasks. AR models handle FIM awkwardly, requiring special training and prompt formatting to look "backwards" at the suffix. A DLM generating a block bidirectionally can natively condition on both prefix and suffix context within the block — making Nemotron-Labs Diffusion well-suited for code editing agents and inline completions.

Inference Budget Control

In diffusion mode, you can control the number of denoising steps as a runtime knob. Fewer steps = faster but potentially lower quality. More steps = slower but higher quality. This gives you a continuous quality-speed trade-off at inference time without retraining — something AR models simply can't offer. A production system could dynamically reduce diffusion steps during traffic spikes and increase them during low-load periods.

When to Stick with AR

For long-context tasks (100K+ tokens) where the KV cache dominates memory, the efficiency story is less clear-cut. For streaming output where users see tokens as they're generated, block-wise generation may feel less smooth without careful rendering logic. And for tasks requiring strict constrained decoding (grammar-constrained generation, beam search), the diffusion loop needs further tooling work.


11. Conclusion & The Road Ahead

Diffusion Language Models have been a promising idea for years, perennially held back by a cluster of practical barriers: accuracy gaps, training instability, and the loss of KV caching. NVIDIA's Efficient-DLM work and Nemotron-Labs Diffusion have systematically addressed each of these barriers with concrete, principled solutions — block-wise causal attention, position-dependent masking, and joint AR+diffusion training objectives.

The result is a model family that is simultaneously:

  • A first-class AR model (backward compatible, lossless in LinearSpec mode)
  • A 2.6–6.4× faster inference engine (depending on mode and hardware)
  • 🔲 A better fill-in-the-middle model by architectural design
  • 🎛️ A tunable quality-speed dial at deployment time — no retraining needed

With 24K+ downloads in the first 24 hours and SGLang integration landing imminently, this is one of the most practically significant open-source releases in the LLM inference space in 2026.

The next frontier: applying the same AR-to-DLM conversion recipe to frontier-scale models (70B+), exploring multimodal DLMs beyond the 8B VLM preview, and building out constrained decoding, streaming token rendering, and fine-tuning tooling for the DLM objective.

If you're building LLM-powered applications and care about inference cost and latency, it's time to start experimenting with Nemotron-Labs Diffusion. The autoregressive loop had a good run — but the next chapter of language model inference looks decidedly more parallel.


🔗 Resources


Written on May 24, 2026 — based on the HuggingFace blog post and arXiv:2512.14067 (Efficient-DLM). Performance numbers reflect published benchmarks; verify against your specific hardware and workload.

Top comments (0)