NVIDIA's Nemotron Diffusion: One Model, Three Generation Modes, 6 Faster

#ai #machinelearning #llm #nvidia

NVIDIA just released Nemotron-Labs Diffusion: a family of open-weight language models (3B, 8B, 14B, plus an 8B VLM) that can run in three distinct generation modes from the same checkpoint — autoregressive, diffusion, or self-speculative — with no application-level changes required. The headline number: 6.4× higher token throughput versus standard autoregressive decoding, with accuracy that matches or beats Qwen3 8B on benchmarks.

"Autoregressive and diffusion generation should not be separate model families. They should be capabilities of the same model."

What actually changed

Autoregressive LLMs have a hard constraint: one token at a time, every token a full model pass. That's fine for quality but brutal for throughput at low batch sizes — the GPU spends most of its time on memory ops, not compute.

Nemotron-Labs Diffusion breaks that constraint by adding parallel drafting on top of a pretrained AR model (rather than training a diffusion model from scratch). Three modes, switchable at deploy time:

Autoregressive — standard left-to-right decoding. Backward compatible with anything you run today.
Diffusion (FastDiffuser) — generates a 32-token block at a time, iteratively denoising until tokens hit a confidence threshold. Raw throughput gains here.
Self-speculation (LinearSpec / QuadraticSpec) — the model drafts a block bidirectionally using diffusion, then verifies it causally with AR. Lossless at temperature 0. Hits ~865 tok/s on an H100/B200 — roughly 4–6× the AR baseline on the same hardware.

Models are available under the NVIDIA Nemotron Open Model License (commercially friendly). SGLang support is landing imminently via an open PR.

Why it matters

Most "fast inference" approaches force you to choose: either a smaller model, a different model, or a speculative decoding setup with a separate draft model you have to maintain. Nemotron bundles all of that into one checkpoint.

The deployment story is what makes this notable for practitioners. You swap inference modes by changing a single config line — same weights, same endpoint, same application code. That makes it much easier to tune the speed/accuracy tradeoff without rebuilding your stack.

The self-speculative mode is particularly interesting: it's essentially speculative decoding without the separate draft model. The AR verification pass means output quality is preserved at temperature 0, which is what you usually want in production.

Training approach is worth noting too: they started from a pretrained AR model and continued pretraining with a joint AR + diffusion objective on 1.3T tokens. Building on existing weights rather than training from scratch is a significant practical shortcut, and it preserves the AR capabilities rather than trading them away.

What to do

If you're evaluating inference infrastructure: Nemotron-Labs Diffusion 8B is a concrete candidate to benchmark against your current setup. The self-speculative mode's 4–6× throughput gain at batch size 1 is worth testing — that's where AR models leave the most performance on the table.

If you're serving a latency-sensitive app: Watch the SGLang PR closely. Once it lands in main, you'll be able to drop Nemotron in as a faster drop-in without touching your API layer.

If you're interested in the architecture: The technical report and training recipe on GitHub are both open. This is a practical implementation of diffusion LMs, not a research demo.

Source: NVIDIA Nemotron-Labs Diffusion on HuggingFace · Model collection

✏️ Drafted with KewBot (AI), edited and approved by Drew.