pueding

Posted on Jun 14 • Originally published at learnaivisually.com

NVIDIA RTX Spark Superchip: Unified CPU–GPU Memory

#ai #machinelearning #llm #agents

What: NVIDIA's RTX Spark "superchip" (unveiled around Computex / Build 2026) pairs a 20-core Grace CPU with a Blackwell RTX GPU that together address one 128GB unified memory pool over NVLink-C2C — the idea this page explains is unified coherent CPU–GPU memory.

Why: On an ordinary discrete GPU, any data the GPU touches must first be copied from CPU system RAM into GPU VRAM across the PCIe bus — a copy that dominates the moment a model is too big to fit in VRAM. A shared pool lets the GPU read the bytes where they already sit, deleting that copy.

vs prior: A discrete GPU walls its VRAM off behind PCIe and shuttles data both ways with explicit host↔device copies (cudaMemcpy); RTX Spark's coherent unified pool removes the wall, so CPU and GPU see the same physical addresses — no staging copy, no PCIe round-trip.

Think of it as

Two chefs sharing one counter instead of passing plates through a hatch.

                      THE DATA TO COOK
                              │
              ┌───────────────┴───────────────┐
              │                               │
      ┌───────▼───────┐               ┌───────▼───────┐
      │ DISCRETE GPU  │               │   RTX SPARK   │
      │  (the hatch)  │               │ (one counter) │
      └───────┬───────┘               └───────┬───────┘
              │                               │
     slide each plate                reach across to the
     through one hatch                same shared counter
     (a PCIe copy)                    (NVLink-C2C, in place)
              │                               │
              ▼                               ▼
   ✗ chefs wait on the              ✓ no copy: grab it
     hatch, not cooking               where it already sits

CPU = the prep chef who gathers and stages the ingredients
GPU = the line chef who does the fast cooking
PCIe copy = sliding every plate through one narrow serving hatch
unified memory pool = one shared counter both chefs reach across
NVLink-C2C = the wide-open pass-through that replaces the hatch

Quick glossary

Unified (coherent) memory — A single physical memory pool that both the CPU and GPU address directly. "Coherent" means a write by one processor is visible to the other without an explicit transfer — so there is no host→device copy step.

PCIe — Peripheral Component Interconnect Express — the bus a discrete GPU sits on. A PCIe 5.0 ×16 link tops out near ~64 GB/s, glacial next to a GPU's on-package bandwidth of roughly several TB/s (order-of-magnitude figure). See GPU & CUDA → Memory Hierarchy → NVLink.

VRAM — The GPU's own high-bandwidth memory (GDDR or HBM), physically separate from CPU system RAM on a discrete card. Once a model's working set exceeds VRAM, data must be streamed in from elsewhere.

NVLink-C2C — NVIDIA's chip-to-chip coherent interconnect that bonds the Grace CPU and the GPU into one memory domain — far wider than PCIe and cache-coherent, which is what makes the shared pool possible.

Grace CPU — NVIDIA's Arm-based server/desktop CPU, designed to sit next to a GPU over NVLink-C2C and share a memory pool rather than talk across PCIe.

Host & Device — CUDA's names for the two sides of the copy: the host is the CPU (and its RAM), the device is the GPU (and its VRAM). The classic pattern allocates device memory, copies host→device, launches the kernel, then copies device→host.

FP4 Tensor Core — A 4-bit floating-point matrix unit (fifth-generation on Blackwell). RTX Spark leans on FP4 to fit large models on-device — quantization shrinks the bytes; unified memory removes the copy of those bytes.

The news. On June 2, 2026, around Computex / Build 2026, NVIDIA unveiled RTX Spark, a consumer "superchip" aimed at on-device AI agents. It combines a Blackwell RTX GPU (6,144 CUDA cores, fifth-generation FP4 Tensor Cores) with a 20-core Grace CPU over NVLink-C2C, delivering up to 1 petaflop of AI compute and 128GB of unified memory. RTX Spark laptops and compact desktops ship this fall from ASUS, Dell, HP, Lenovo, Microsoft Surface, and MSI. Read the coverage →

Picture the two chefs for a second. The prep chef chops and stages every ingredient on his bench; the line chef does the fast searing under the heat. On a normal setup they work in separate rooms joined by one narrow serving hatch — every tray of mise en place has to be slid through that slot before the line chef can touch it, and finished plates slid back. For a two-cover lunch the hatch is fine. For a 200-cover banquet, the hatch is the bottleneck: both chefs spend more time shoving trays through the slot than actually cooking. RTX Spark knocks out the wall. Now both chefs work at one long shared counter — the line chef reaches over and grabs the mise en place exactly where the prep chef left it. No hatch, no sliding, no copy.

In CUDA terms, the hatch is the PCIe bus and the trays are cudaMemcpy. A discrete GPU keeps its fast VRAM physically separate from the CPU's system RAM; before a kernel can run, the input is copied host→device across PCIe, and the result copied back. The classic four-step dance: allocate device memory, copy the input host→device across PCIe, launch the kernel, then copy the result device→host.

A PCIe 5.0 ×16 link tops out around ~64 GB/s — quick in isolation, but glacial next to a GPU's on-package bandwidth of roughly several TB/s (order-of-magnitude figure). For a model that fits in VRAM you pay the copy once and amortize it. For a model bigger than VRAM, you stream weights across PCIe layer by layer, every forward pass, and the copy — not the matmul — sets your token rate. That's the regime where decode goes memory-bandwidth-bound and the GPU's compute cores sit idle waiting for bytes.

RTX Spark deletes the staging copy outright. A Grace CPU and a Blackwell GPU are bonded over NVLink-C2C into a single 128GB coherent pool. Coherent is the load-bearing word: both processors see the same bytes at the same addresses, and a write by one is visible to the other with no explicit transfer. The GPU stops being a walled-off device you ship data to and becomes a peer that reads the data in place — the same shift that the memory ladder work frames as moving the bottleneck back toward on-package bandwidth, where it belongs.

This is why NVIDIA pitches RTX Spark as an on-device agent machine. Local agents juggle big context windows, KV caches, and sometimes several models at once — state that is awkward to shuttle across PCIe but trivial to share in a unified pool. A 70B-class model at 4-bit weights needs ~35GB; it won't fit in a typical discrete laptop GPU's 8–16GB of VRAM, so today it either spills to system RAM over PCIe (slow) or simply won't run. With 128GB of unified memory, the same model just lives in the pool and the GPU addresses all of it. (NVIDIA has not published the consumer part's exact NVLink-C2C bandwidth, so treat the on-package figures below as illustrative.)

Where the copy time actually goes

A back-of-envelope walk-through (illustrative numbers; substitute your own workload). Take a 34GB 4-bit model that does not fit in a 16GB discrete GPU. On the discrete path, running one forward pass means streaming all 34GB of weights across PCIe 5.0 at ~64 GB/s → about ~0.53 s of pure copy per pass. During decode that's roughly one pass per token, so the copy alone caps you near ~1.9 tokens/s before a single multiply happens, and the GPU cores idle the whole time. On the unified path, the GPU addresses all 34GB in the shared pool directly at on-package bandwidth — call it ~0.5 TB/s for a consumer part (illustrative) → reading the same 34GB takes about ~0.07 s, roughly ~8× less wait, and the bottleneck moves back to compute where it should be. The model size didn't change; the copy disappeared.

How systems connect CPU and GPU memory

System	CPU ↔ GPU memory	Interconnect	Host→device copy?
Discrete GPU (PCIe card)	separate VRAM + system RAM	PCIe 5.0 ~64 GB/s (setup-dependent)	Yes — both ways
Integrated GPU (iGPU)	shared system RAM	on-die	No, but low bandwidth
Apple Silicon (UMA)	unified system memory	on-package fabric	No
NVIDIA Grace Hopper (GH200)	unified, coherent	NVLink-C2C ~900 GB/s (GH200 figure)	No
NVIDIA RTX Spark (2026)	unified 128GB, coherent	NVLink-C2C (consumer bandwidth undisclosed)	No — zero-copy

A caveat worth attaching to the headline: unified memory removes the copy, not the bandwidth wall. The pool is still finite-bandwidth memory, so a model that's memory-bandwidth-bound on a discrete card is still bandwidth-bound on RTX Spark — it just stops paying the PCIe tax on top. And NVIDIA quotes "1 petaflop" as a low-precision (FP4) peak, not a sustained number. The structural win is real and narrow: the host↔device copy goes away, which is exactly the tax that makes over-VRAM models painful on today's laptops.

Goes deeper in: GPU & CUDA → Memory Hierarchy → NVLink

Related explainers

Jetson Thor — Edge Blackwell vs datacenter Blackwell — the robotics cousin: the same Blackwell silicon, also on a unified-memory SoC
Vera Rubin NVL72 — rack-scale NVLink domain — NVLink at the other extreme: 72 GPUs as one fabric (GPU↔GPU), versus RTX Spark's CPU↔GPU NVLink-C2C
MobileMoE — DRAM-aware MoE scaling — the algorithmic side of fitting big models in tight on-device memory

FAQ

What is unified CPU–GPU memory, in one paragraph?

Unified memory is a single physical memory pool that both the CPU and the GPU address directly. On a discrete GPU, the CPU's system RAM and the GPU's VRAM are separate, so data must be copied across the PCIe bus before the GPU can use it (host→device) and copied back afterward. A unified, coherent pool — like the 128GB pool RTX Spark shares over NVLink-C2C — lets the GPU read the bytes exactly where they sit. No staging copy, no PCIe round-trip.

Why does eliminating the PCIe copy matter for on-device AI?

Because the copy, not the math, is often the bottleneck. A PCIe 5.0 link moves data at roughly ~64 GB/s. When a model is larger than the GPU's VRAM, the weights must stream across PCIe on every forward pass, and the GPU's compute cores idle while they wait. For a 34GB 4-bit model on a 16GB discrete GPU, that copy alone can cap throughput near ~1.9 tokens/s (illustrative). Sharing one 128GB pool lets the model live in memory and the GPU read it in place, moving the bottleneck back to compute and on-package bandwidth.

How is RTX Spark's unified memory different from a discrete GPU or from Apple Silicon?

A discrete GPU has separate VRAM behind PCIe and needs explicit host↔device copies. Apple Silicon and integrated GPUs already share one memory pool, but typically at standard system-memory bandwidth. RTX Spark's approach bonds a Grace CPU and a Blackwell GPU over NVLink-C2C — a wide, cache-coherent chip-to-chip link — into a 128GB coherent pool, so it gets the no-copy benefit of unified memory while keeping a discrete-class GPU on the other end of the link. NVIDIA's Grace Hopper (GH200) datacenter parts use the same NVLink-C2C idea.

Originally posted on Learn AI Visually.

Top comments (1)

Alex Shev • Jun 14

Unified memory is one of those changes that sounds like a hardware detail until you think about agentic workloads. The bottleneck is often moving state around: documents, embeddings, intermediate tensors, tool outputs, and model context. Reducing that copy tax can change which local workflows feel practical.