Nemotron-Cascade 2 for Local AI in 2026: 187 tok/s on RTX 3090 and What 30B Total / 3B Active Really Means for Your GPU

#nvidia #nemotron #localllm #gpu

This article was originally published on runaihome.com

TL;DR: Nemotron-Cascade 2 30B-A3B is NVIDIA's open MoE coding specialist: 3B active parameters per token means RTX 3090 generation speeds of 187 tok/s, while all 30B weights must fit in VRAM simultaneously. The 24GB floor is firm. 16GB cards technically run it, but at 10–11 tok/s it's unusable for interactive coding.

	RTX 3090 24GB (used)	RTX 4090 24GB	RTX 5090 32GB
Best for	Best-value entry	Top single-GPU speed	NVFP4 + 32GB headroom
Quantization	IQ4_XS (18.2 GB)	Q4_K_M (Ollama default)	NVFP4 or Q4_K_M
Generation speed	187 tok/s	~196 tok/s	229 tok/s
Street price (Jun 2026)	$480–$550 used	$1,500–$1,700	$1,999+

Honest take: If you own a 24GB GPU and write code, Nemotron-Cascade 2 is the right model to be running right now. It beats Qwen 3.6 35B-A3B on LiveCodeBench by 12 points and is faster. On a used RTX 3090, IQ4_XS gives you 187 tok/s for under $550.

NVIDIA published Nemotron-Cascade 2 in March 2026 under open weights (arXiv 2603.19220). The training approach — Cascade Reinforcement Learning with multi-domain on-policy distillation — produced a model that wins gold medals at IMO 2025 (35 points) and IOI 2025 (439.3 points). The benchmark that got the local AI community paying attention: Nemotron-Cascade 2 beats Nemotron-3-Super-120B-A12B on math and coding despite requiring roughly 4× less VRAM. That's the headline. But before running ollama pull, you need to understand the memory trap that comes with every MoE model in this class.

The memory math no one explains clearly

"30B total parameters, 3B active" sounds like great news. It is, but not in the way most people assume.

When Nemotron-Cascade 2 processes a token, its routing network selects roughly 3B parameters to activate. The remaining 27B do nothing for that token — idle, not useful. That's what makes generation fast: compute scales with active parameters, and 3B-class compute on modern NVIDIA hardware is genuinely quick.

The constraint is that idle does not mean absent. Every expert layer in the model must be loaded into memory before the router can evaluate which ones to activate. The entire 30B parameter set is resident in VRAM at inference time. You cannot defer experts to system RAM without incurring a PCIe round-trip per token that kills throughput.

The practical result: Nemotron-Cascade 2 has the inference speed of a dense 8B model and the VRAM footprint of a dense 30B model. Same constraint applies to Qwen 3.6 35B-A3B and every other A3B MoE in this tier. What distinguishes Nemotron-Cascade 2 is where it puts the training budget it saves: into coding and math, not general knowledge. That's a choice you either care about or you don't.

VRAM requirements by quantization

Weights only, before KV cache:

Quantization	VRAM (weights)	Fits on	Notes
IQ4_XS	18.2 GB	RTX 3090, RTX 4090, RTX 5090	Verified 187 tok/s on RTX 3090
Q4_K_M	~24 GB	RTX 4090, RTX 5090	Ollama default; tight on RTX 3090
Q2_K	~16.9 GB	RTX 4060 Ti 16GB, RTX 5060 Ti	~10 tok/s; quality degraded
NVFP4	~14–16 GB	RTX 50-series only	229 tok/s on RTX 5090; RTX 40 unsupported
BF16 full	~63 GB	Dual H100 80GB	Not a consumer discussion

KV cache adds on top of model weights. At 8K context: ~1.5 GB. At 16K: ~3 GB. At 32K: ~6 GB. The IQ4_XS quant (18.2 GB) leaves the most headroom on 24GB cards — 5–6 GB free for KV cache, enough for 16K context without tuning. Q4_K_M pushes to ~24 GB, leaving less than 1 GB free on a 24GB card, which means Ollama will aggressively limit context to fit.

See the quantization quality tradeoffs guide for how much perplexity you give up going from Q4 to Q2 on models like this.

GPU compatibility by tier

RTX 3090 — 187 tok/s at IQ4_XS, the best $/tok deal right now

A used RTX 3090 runs $480–$550 on eBay as of June 2026. That's the cheapest path to Nemotron-Cascade 2 at full quality.

The benchmark is concrete: 187 tok/s with IQ4_XS quantization, tested at 625K context, posted in the official NVIDIA model discussion thread on Hugging Face. IQ4_XS weighs 18.2 GB, leaving 5–6 GB of VRAM clear for KV cache. At 16K context — enough for roughly 12,000 lines of code — you stay well within that headroom.

The RTX 3090's 936 GB/s bandwidth does not bottleneck this model at IQ4_XS. Generation speed at 187 tok/s already exceeds comfortable reading pace. The only real drawback is power: the 3090 draws ~285W under full LLM load, which works out to $0.034/hour at $0.12/kWh. Over a full 8-hour coding day, that's less than $0.30. Full RTX 3090 value analysis here.

One practical note: the Ollama default for this model is Q4_K_M (~24 GB). On a 24GB card, that's tight. Pull the IQ4_XS variant explicitly (see the setup section below) for more comfortable headroom and the verified benchmark speeds.

RTX 4090 — Q4_K_M out of the box, ~196 tok/s

RTX 4090 ($1,500–$1,700) runs the Ollama default without any quant selection. The model reports 24 GB in Ollama and loads cleanly into 24GB VRAM because Ollama manages the context window to avoid overflow.

Tested generation speed: approximately 196 tok/s — around 5% faster than RTX 3090 at IQ4_XS, driven by the 4090's 1,008 GB/s bandwidth vs the 3090's 936 GB/s. The gap widens at longer context windows where the KV cache actively stresses bandwidth.

For agentic coding workflows with 32K–64K context windows, the RTX 4090 handles the load without needing to adjust flags. The 3090 requires explicitly using IQ4_XS and capping context. If you're running automated agents that spawn many parallel sessions, that extra VRAM headroom is meaningful.

NVFP4 is not available for RTX 40-series on this model's current quantizations — the NVFP4 variant targets Blackwell (RTX 50-series) only. For RTX 4090, Q4_K_M or IQ4_XS are the practical formats. Full RTX 5090 vs RTX 4090 comparison here.

RTX 5090 — 229 tok/s with NVFP4, 32GB headroom

RTX 5090 ($1,999+) provides two advantages over 24GB cards: 32GB of GDDR7 VRAM and Blackwell's native NVFP4 support.

The HuggingFace benchmark for Nemotron-Cascade 2 NVFP4 on RTX 5090 shows 229.52 tok/s in text generation (tg128 mode) — 22% faster than the RTX 4090 at Q4_K_M. More practically, 32GB means you can load Q4_K_M (~24 GB) and still have 8 GB free for KV cache, enabling 32K+ context without any configuration adjustments. Ollama just works.

A practical caution: as of early June 2026, community reports indicate vLLM has unresolved compatibility issues with NVFP4 on sm12x (RTX 5090 Blackwell) for this specific model. Ollama with Q4_K_M is fully stable. If NVFP4 matters for your use case, check the vLLM issue tracker before switching. Details on NVFP4 formats and RTX 50-series support here.

16GB cards — the hard wall

Every current 16GB consumer GPU (RTX 4060 Ti, RTX 5060 Ti 16GB, RTX 5070, RTX 5080, RX 9070 XT) hits the same constraint: the Q4_K_M quant needs ~24 GB, and 16 GB < 24 GB.

Q2_K (~16.9 GB) is the workaround. On RTX 4060 Ti 16GB: approximately 10–11 tok/s decode speed with a time-to-first-token around 17 seconds. That's not a typo — the speed drops from 187 tok/s on IQ4_XS to under 11 tok/s on Q2_K. The MoE routing computation doesn't rescue you from VRAM pressure; the model still needs all 30B parameters loaded.

For 16GB cards, Qwen 3.6 27B dense is the honest recommendation. It fits at Q4_K_M (~16 GB), scores 77.2% on SWE-bench Verified, and runs at 80+ tok/s on