Prabhakar Chaudhary

Posted on Jun 19

Nemotron 3 Ultra: How NVIDIA Built a 550B Open Model That Runs Faster Than Its Smaller Rivals

#machinelearning #llm #opensource #ai

Nemotron 3 Ultra: How NVIDIA Built a 550B Open Model That Runs Faster Than Its Smaller Rivals

NVIDIA's Nemotron 3 Ultra, released on June 4, 2026, is a 550-billion-parameter open model that manages to outrun several competing models with far fewer active parameters per token. The trick is a hybrid architecture that mixes Mamba state-space layers with standard Transformer attention — a combination that sidesteps the memory bottlenecks that typically make large models slow in long-context settings.

This post walks through what that architecture actually does, why it matters for agentic workloads, and what the training pipeline looks like.

The Core Problem: Attention Doesn't Scale Well to Long Contexts

Standard Transformer attention has quadratic complexity with respect to sequence length. Double the context, and the compute cost for attention quadruples. For agentic tasks — where a model might need to reason over a long conversation history, a large codebase, or many tool-call results — this becomes a real bottleneck.

One response is to replace some attention layers with state-space models (SSMs) like Mamba, which process sequences in linear time. The tradeoff is that SSMs are less precise at retrieving specific facts from long contexts. Nemotron 3 Ultra's hybrid design tries to get the best of both: Mamba layers handle the bulk of sequence processing at sub-quadratic cost, while a subset of full attention layers is retained for precise recall when it matters.

The attention layers themselves are configured with 64 query heads but only 2 key-value heads. This keeps the KV cache small — a meaningful memory saving when you're running a 1-million-token context window.

LatentMoE: More Experts Without More Inference Cost

The model uses a Mixture-of-Experts (MoE) design with 512 total experts, of which 22 are activated per token. What makes this unusual is the "LatentMoE" routing mechanism: before tokens are routed to experts, they're projected into a compressed latent space. This lets NVIDIA pack in more specialized experts without proportionally increasing inference cost, since the routing decision happens in a lower-dimensional space.

The result is a model with 550 billion total parameters but only 55 billion active per token — roughly a 10:1 ratio. That's why the inference throughput numbers are competitive despite the headline parameter count.

Multi-Token Prediction for Native Speculative Decoding

Nemotron 3 Ultra includes Multi-Token Prediction (MTP) heads that predict several future tokens in a single forward pass. During training, these heads share parameters with the main model. At inference time, they enable speculative decoding natively — the model proposes multiple tokens at once, which can then be verified in parallel, reducing the number of sequential forward passes needed.

This is different from the more common approach of using a separate, smaller draft model for speculative decoding. Having MTP built into the architecture means there's no need to maintain a separate model or tune the draft model separately.

NVFP4 Training: 4-Bit Precision From the First Gradient Update

The model was trained using NVFP4, a 4-bit floating-point format (E2M1 datatype with two-dimensional block quantization on weights). NVIDIA describes this as one of the largest demonstrations of stable NVFP4 training to date. The deployed model runs at an average of 5.03 bits-per-element, mixing NVFP4, FP8, and BF16 layers depending on the layer's sensitivity.

For deployment, the model supports W4A16 quantization on Hopper-generation hardware (H100/H200), which lacks native FP4 tensor cores, and can use native FP4 math on Blackwell (B200/GB200). This means the same model weights can be served efficiently across both hardware generations.

Post-Training: RL Across 55 Environments

Pre-training used specialized datasets including 173 billion tokens of GitHub code (up to September 2025), plus synthetic datasets for legal text, factual recall, and moral reasoning. Post-training combined supervised fine-tuning, reinforcement learning across 55 distinct environments, and Multi-teacher On-Policy Distillation (MOPD).

MOPD addresses a known problem with multi-environment RL: when you train across many different task types simultaneously, the learning signal from any one environment gets diluted. NVIDIA's solution was to distill knowledge from over ten domain-specialized teacher models into the student model, concentrating the signal from each domain.

Benchmark Performance

On inference throughput in 8K input / 64K output settings, Nemotron 3 Ultra is reported to be:

5.9x faster than GLM-5.1-754B-A40B
4.8x faster than Kimi-K2.6-1T-A32B
1.6x faster than Qwen-3.5-397B-17B

On the RULER benchmark at 1-million-token context, it outperforms other open LLMs. Accuracy on standard benchmarks is described as matching current state-of-the-art open models.

The model also supports three reasoning modes: "Reasoning-off," "Regular," and "Medium-effort." The medium-effort mode uses 2.5x fewer tokens than regular mode at the cost of roughly 7% accuracy — a useful knob for applications where inference cost matters more than peak accuracy.

Availability

Nemotron 3 Ultra is released under the OpenMDW-1.1 license, with weights, training data, and recipes publicly available. It can be accessed via Hugging Face, NVIDIA NIM, and OpenRouter. NVIDIA also released an Agent Toolkit alongside the model, including NemoClaw and OpenShell components for building agentic pipelines.

The technical report on arXiv covers the architecture and training in detail. The NVIDIA Research page has benchmark comparisons and deployment guidance.

What This Means in Practice

The throughput advantage comes from the architectural choices working together: Mamba layers reduce the per-step cost of processing long sequences, the small KV cache from the 2-head attention configuration reduces memory pressure, and MTP enables speculative decoding without a separate draft model. None of these is new individually, but combining them at 550B scale with stable NVFP4 training is a meaningful engineering result.

For developers building agentic systems that need to process long contexts — code repositories, document collections, extended tool-call histories — the 1M-token window and the throughput numbers make Nemotron 3 Ultra worth evaluating, particularly if you're already running on NVIDIA Blackwell hardware where native FP4 support gives an additional efficiency boost.

DEV Community

Nemotron 3 Ultra: How NVIDIA Built a 550B Open Model That Runs Faster Than Its Smaller Rivals

Nemotron 3 Ultra: How NVIDIA Built a 550B Open Model That Runs Faster Than Its Smaller Rivals

The Core Problem: Attention Doesn't Scale Well to Long Contexts

LatentMoE: More Experts Without More Inference Cost

Multi-Token Prediction for Native Speculative Decoding

NVFP4 Training: 4-Bit Precision From the First Gradient Update

Post-Training: RL Across 55 Environments

Benchmark Performance

Availability

What This Means in Practice

Top comments (0)