NVIDIA Rubin CPX for Local AI Inference in 2026: What the New Context-Optimized Blackwell GPU Means for Home Labs vs Consumer Cards

#nvidia #gpu #localai #inference

This article was originally published on runaihome.com

TL;DR: NVIDIA's Rubin CPX is a 30-petaFLOP, 128GB GDDR7 inference chip built for enterprise-scale million-token context workloads, arriving in data-center rack configurations in H2 2026. It's not consumer hardware and there's no home-lab version coming. The consumer Rubin generation (likely an RTX 6090) isn't expected until 2027–2028 at the earliest. An RTX 5060 Ti 16GB or used RTX 3090 is still the right call today.

	Rubin CPX	RTX 5090	RTX 5060 Ti 16GB
Best for	Enterprise prefill, 1M+ token context	Enthusiast local AI, 30B–70B models	Budget home lab, up to 30B models
VRAM	128GB GDDR7	32GB GDDR7	16GB GDDR7
Memory bandwidth	~2 TB/s	1.79 TB/s	448 GB/s
Price (June 2026)	Enterprise rack only — no public price	~$2,000 MSRP	~$429–499 MSRP
The catch	Not for home labs; prefill-only in disaggregated deployments	70B+ models spill into system RAM	16GB caps you at 13B–27B models in practice

Honest take: The Rubin CPX is a data-center signal, not a home-lab buying decision. The key thing it confirms: memory bandwidth governs local inference below 100K context, and your RTX 5060 Ti or used RTX 3090 is still the right hardware to buy in June 2026.

What the Rubin CPX Actually Is

NVIDIA announced the Rubin CPX in September 2025 as a specialized inference accelerator designed for the prefill phase of large-language-model inference — the step where the model processes your entire input prompt before generating its first output token.

The specs are unusual compared to anything in a consumer build: 30 petaFLOPS of NVFP4 compute, 128GB of GDDR7 memory, and approximately 2 TB/s of memory bandwidth. The bandwidth figure looks modest against the H100 SXM's 3.35 TB/s of HBM3e, but GDDR7 costs a fraction of HBM4 per gigabyte and scales to much higher per-die capacities. NVIDIA claims 3× faster attention processing compared to the GB300 NVL72 — the performance gap that disaggregated inference is designed to exploit.

The chip ships inside the Vera Rubin NVL144 CPX rack: 144 Rubin CPX chips for prefill paired with 144 standard Rubin GPUs (HBM4, optimized for decode) and 36 Vera ARM CPUs. The complete rack delivers 8 exaFLOPS of AI compute across 100TB of fast memory. NVIDIA targets H2 2026 availability for enterprise customers. No consumer product has been announced, and no standalone pricing has been disclosed.

The Rubin CPX connects over PCIe Gen 6 — notably without NVLink — because its role in the rack is specialized enough that it doesn't need chip-to-chip interconnect with the standard Rubin GPUs in the same way a symmetric multi-GPU cluster would.

The Disaggregated Architecture: Prefill vs. Decode

Understanding why the CPX exists requires separating the two distinct phases of LLM inference, because they demand completely different hardware.

Prefill is compute-bound. When you send a 100,000-token prompt, the model runs attention across every token before generating word one. Attention complexity scales quadratically with context length — a 1-million-token prompt needs roughly 10,000× more prefill compute than a 10,000-token prompt. Raw FLOPS matter here; bandwidth barely does, because the memory access pattern is sequential and predictable.

Decode is bandwidth-bound. After the prefill completes, the model generates tokens one at a time. Every single output token requires reading the full KV-cache from memory — the accumulated context from all previous tokens. If your KV-cache spans 128K tokens, every new token forces a full read of that cache. At ~2 TB/s, the Rubin CPX is too slow for efficient decode at scale, which is exactly why it's paired with HBM4-equipped standard Rubin GPUs in the same rack.

Research on disaggregated inference (the academic work underlying systems like Splitwise and DistServe) showed this split delivers up to 1.4× higher throughput at 20% lower cost compared to running both phases on identical hardware. NVIDIA is formalizing what the research community proved: specialized hardware for each phase beats general-purpose hardware for both.

Home labs don't run disaggregated inference. Ollama, llama.cpp, and vLLM on a single card handle both phases on the same chip. That's not a problem at the scale home labs operate — but it does explain the performance ceiling you hit when a model partially fits in VRAM and you're doing long-context inference.

Why Your RTX Slows Down at Long Context

The VRAM math is brutal, and worth knowing before you hit the wall.

An RTX 5090 has 32GB of GDDR7. Llama 3.3 70B at Q4_K_M quantization requires approximately 38–42GB of VRAM depending on context length and KV-cache size. The model doesn't fit. Ollama and llama.cpp handle the overflow by offloading layers to system RAM and streaming them over PCIe Gen 5 (64 GB/s peak bidirectional bandwidth), which is roughly 28× slower than GDDR7.

The result: on an RTX 5090 with Llama 3.3 70B Q4_K_M and VRAM overflow, you see 14–22 tok/s — reasonable for a single user, but nowhere near what the GPU could achieve on a model that fits cleanly.

# Check how many layers Ollama is offloading to CPU
OLLAMA_DEBUG=1 ollama run llama3.3:70b-instruct-q4_K_M "Explain quantization" 2>&1 | grep offload

Expected output on a GPU that's overflowing VRAM:

llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/81 layers to GPU
llm_load_tensors: CPU model size = 23.34 GiB

Those lines mean your decode loop is constantly pulling tensor data over PCIe. The fix is either a model that fits your VRAM (try 30B Q4_K_M at ~19GB on the RTX 5090, which runs at 45+ tok/s) or a dual-GPU configuration. For 70B without any offloading, you need 48GB minimum — two RTX 3090s bridged via PCIe, or Apple unified memory at 64GB+.

The Rubin CPX's 128GB GDDR7 solves this for enterprise inference clusters. It doesn't solve it for your home lab, because the chip isn't available to buy.

Spec Comparison: Rubin CPX vs. What's Actually in Home Labs

Spec	Rubin CPX	RTX 5090	RTX 5060 Ti 16GB	Used RTX 3090
Architecture	Rubin (post-Blackwell)	Blackwell	Blackwell	Ampere
VRAM	128GB GDDR7	32GB GDDR7	16GB GDDR7	24GB GDDR6X
Memory bandwidth	~2 TB/s	1.79 TB/s	448 GB/s	936 GB/s
NVFP4 compute	30 PFLOPS	~5 PFLOPS	~1.3 PFLOPS	N/A (Ampere)
PCIe interface	Gen 6	Gen 5	Gen 5	Gen 4
TDP	Not disclosed	~575W	~165W	350W
Price (June 2026)	Enterprise only	~$2,000	~$429–499	~$450–550 eBay
Home lab viable?	No	Yes	Yes	Yes

The bandwidth numbers between the used RTX 3090 (936 GB/s) and RTX 5060 Ti 16GB (448 GB/s) are worth staring at. For decode-heavy single-user inference on models that fit cleanly in VRAM, bandwidth is the dominant variable — which is why the RTX 3090 still competes with much newer hardware on models in the 7B–24B range. That full picture is in Used RTX 3090 in 2026: Still the AI Value King?

The RTX 6090 Speculation: What's Real

A die shot analysis of the Rubin CPX silicon revealed something unexpected: the chip contains graphics-specific hardware blocks — 256 Raster Output Pipelines (ROPs) and four display output pipes — that have no function in a pure AI inference accelerator. This prompted widespread speculation that the CPX die could serve as the foundation for a future RTX 6090.

NVIDIA hasn't confirmed anything. Industry sources cited by Moore's Law Is Dead noted the shipping Rubin CPX is "highly specialized for Prefill/Inference" and the consumer graphics pathway would require enabling those graphics blocks and shipping them w