DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

NVIDIA Nemotron 3 Ultra for Local AI in 2026: 550B/55B-Active MoE, 1M Context, NVFP4 — Which Consumer GPU Can Actually Run It

This article was originally published on runaihome.com

TL;DR: Nemotron 3 Ultra is NVIDIA's June 4, 2026 open-weight flagship — a 550B-parameter Mixture-of-Experts model with 55B active per token, a 1M-token context window, and a native NVFP4 4-bit checkpoint. The catch: NVFP4 still weighs ~275GB, so it's a datacenter model, not a home-lab one. The right local move is to run the smaller Nemotron 3 family members — Nano 30B-A3B fits a single 24GB card — and reach for Ultra through the API.

Nemotron 3 Ultra (550B) Nemotron 3 Nano (30B-A3B) Ultra via API/cloud
Best for Datacenter agents, 8×H100/H200 Single-GPU home labs Trying Ultra without the rig
Smallest footprint ~275GB NVFP4 / 189GB 1-bit GGUF ~20.7GB Q4_K_M None — managed
Runs on 8× H100 80GB (640GB) min RTX 3090 / 4090 (24GB) Any device
Speed ~40 tok/s (4× B200) to 300+ (cloud) 30B-class tok/s on one card ~140 tok/s blended
The catch No consumer GPU holds it Not the 550B brain Prompts leave your machine

Honest take: Nemotron 3 Ultra is a genuinely strong open model — but for a home lab it's an API model, full stop. If you want NVIDIA's reasoning quality on hardware you own, run Nemotron 3 Nano 30B-A3B on a single 24GB GPU and call the Ultra endpoint only for the hard agentic runs.

What Nemotron 3 Ultra actually is

NVIDIA announced Nemotron 3 Ultra at Computex 2026 on June 1 and published the weights on June 4. It's the top of a three-model family — Nano (30B-A3B), Super (120B-A12B), and Ultra (550B-A55B) — and it's the first one NVIDIA positions as an open frontier model rather than a distillation target.

The headline numbers: 550 billion total parameters, 55 billion active per token, a 1M-token context window, and a license that's unusually generous for a model this size — OpenMDW-1.1, the Linux Foundation's open-weights license, which releases the weights, the training datasets (including 173 billion tokens of code), and the recipes. That's a real differentiator. Kimi K2.7 ships under a modified MIT license and GLM 5.2 under MIT, but neither publishes its training data the way NVIDIA does here.

Architecturally it's not a vanilla transformer MoE. Nemotron 3 Ultra uses a hybrid "LatentMoE" design: interleaved Mamba-2 state-space layers and MoE layers, with select attention layers, plus Multi-Token Prediction (MTP) heads with a shared-weight design. The MTP heads enable native speculative decoding — the model drafts its own next tokens — which is a big part of why NVIDIA can claim the throughput numbers it does. (If the phrase "speculative decoding" is new to you, we broke down why it matters in why local LLMs got good in mid-2026.)

The NVFP4 trick — and why it doesn't save you

The interesting engineering story is NVFP4. NVIDIA quantized the model to its 4-bit floating-point format for weights, activations, and gradients, keeping a few sensitive layers (latent projections, MTP heads, QKV/attention projections, embeddings) in BF16 or MXFP8 for stability. The clever part: the same NVFP4 checkpoint runs on Ampere, Hopper, and Blackwell GPUs thanks to specialized quantization kernels. One file, three architectures.

NVFP4 makes the model dramatically smaller than its BF16 form — but "smaller" is relative when you start at 550 billion parameters. The NVFP4 checkpoint is roughly 275GB. For comparison, the BF16 cache lands around 1.1–1.7TB depending on configuration.

275GB is the number that ends the home-lab dream. To put it in perspective:

  • A used RTX 3090 (24GB) or RTX 4090 (24GB) gives you 24GB each.
  • You'd need roughly 12× RTX 3090 just to hold the NVFP4 weights — before any KV cache for that 1M context.
  • NVIDIA's own recommended deployment is 8× H100 80GB (640GB total, comfortably above 275GB) or 8× H200 SXM5 (1,128GB total).

No single consumer card, and no realistic stack of them, runs the full Ultra at a sane speed. This is the same wall we hit with Kimi K2.7 Code and GLM 5.2: the open-weights frontier has moved decisively past 24GB consumer hardware.

The GGUF / CPU-offload path (for the stubborn)

If you absolutely must run Ultra on hardware you own, the community route is Unsloth's dynamic GGUF quants run through llama.cpp with CPU offload. Here's the memory reality from Unsloth's own guide:

Quant Approx. memory needed Notes
Dynamic 1-bit (UD) ~189GB disk Smallest; surprising accuracy retention
3-bit (UD-IQ3_XXS) ~256GB RAM Unsloth's recommended balance
4-bit ~300GB RAM
8-bit ~600GB RAM

So the cheapest "it technically runs" build is a workstation with 256GB of DDR5 running the 3-bit quant, with a 24GB GPU offloading the active expert and attention layers. Because only 55B of the 550B parameters are active per token, the compute per token is closer to a 55B dense model than a 550B one — that's what makes CPU-offload even thinkable. But you're still streaming hundreds of gigabytes of weights from RAM, so don't expect speed. NVIDIA's own llama.cpp reference shows around 40 tokens/second on 4× B200 — datacenter Blackwell silicon. On a DDR5 CPU build you're realistically looking at single-digit-to-low-teens tok/s once prefill on long prompts is factored in, the same ballpark we measured for other 1T-class MoE models on big-RAM rigs.

A 256GB DDR5 workstation is roughly a $3,500–$4,500 build in mid-2026 — and that's before the DDR5 price surge that's still squeezing home builds (we tracked it in the DDR5/SSD price guide). Compared to the API, the math rarely favors building. Which brings us to the real recommendation.

What you should actually run at home

The good news is that Nemotron 3 is a family, and the two smaller members are built for exactly the hardware most readers have.

Nemotron 3 Nano 30B-A3B — the single-GPU pick

This is the one to run. The Nano is a 30B-total / 3B-active MoE that lands at ~20.7GB at Q4_K_M, which fits on any 24GB card — an RTX 3090, 3090 Ti, or 4090. NVIDIA reports the Nano hitting roughly 3.3× the throughput of Qwen3-30B-A3B on identical hardware (a single H200), and on a 24GB consumer card you can expect the same 30B-MoE-class speeds we've measured elsewhere — think the 100+ tok/s range that Nemotron-Cascade 2 hit on an RTX 3090. For everyday reasoning, coding, and agent loops, the Nano gives you NVIDIA's training quality without the 275GB problem.

# Pull and run the Nano locally via Ollama
ollama pull nemotron-3-nano
ollama run nemotron-3-nano "Refactor this function and explain the change."
Enter fullscreen mode Exit fullscreen mode

Nemotron 3 Super 120B-A12B — the multi-GPU step-up

The Super is a 120B-total / 12B-active MoE. At 4-bit it needs roughly 60–80GB, which puts it beyond a single 24GB card and into multi-GPU or workstation-card territory (think 2× 24GB GPUs with offload, an A100 80GB, or an RTX PRO 6000 96GB). It's the right pick if you've outgrown the Nano's quality but can't justify a datacenter node for Ultra.

Ultra — through the API or a rented GPU

For the actual 550B brain, use it the way it's meant to be used at this scale: managed. Nemotron 3 Ultra is available on Ollama's cloud (nemotron-3-ultra:cloud), through NVIDIA's NIM endpoints, and on third-party hosts. Artificial Analysis estimates a blended cost around $0.52 per million tokens with output speed near 140 tokens/second, and clocked 300+ output tok/s on a pre-release DeepInfra endpoint. If you need to own the inference for privacy reasons, renting an 8×H100 node by the hour on RunPod is far cheaper than buying one — the same rent-vs-buy logic from our [RunPod vs local GPU breakdown](/blog/runpod-vs-lo

Top comments (0)