DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

EXO Framework in 2026: Can You Pool RTX 3090s to Beat a DGX Spark? The Honest Distributed-Inference Reality

This article was originally published on runaihome.com

TL;DR: EXO turns a pile of computers into one big memory pool so you can load models that won't fit on a single card. That's real and useful. But the viral claim — "three used RTX 3090s beat a $3,999 DGX Spark at 3× the throughput" — does not survive contact with the data. EXO's GPU support is strongest on Apple Silicon; on NVIDIA it's a community fork, and distributed inference adds capacity, not speed.

EXO on Apple Silicon EXO on NVIDIA (exo-cuda fork) DGX Spark (single box)
Best for Pooling Mac unified memory to run 400B–671B models Experimenters who already own multiple NVIDIA cards One-box 120B inference with zero networking
Cost (June 2026) 4× M3 Ultra Mac Studio = $40k+ 3× used RTX 3090 ≈ $3,200–$3,600 + platform $3,999–$4,699
Real throughput DeepSeek V3.1 671B: 32.5 tok/s on 4 nodes Lower than a single native-CUDA GPU (tinygrad backend) gpt-oss-120b: ~38.5 tok/s
The catch Costs more than most home labs will ever spend GPU support not in the official build Capped at ~128GB; no expansion

Honest take: If you want to run a model that doesn't fit on one card, EXO is a genuinely clever tool — and it's at its best clustering Macs. If your real goal is more tokens per second on NVIDIA hardware, pool your 3090s with llama.cpp or vLLM instead, and don't expect distributed inference to multiply your speed.


What EXO actually is

EXO (from exo-explore) is an open-source framework that connects multiple devices on your network into a single inference cluster. Instead of needing one GPU big enough to hold an entire model, EXO splits the model into shards and spreads those shards across whatever hardware you have — Mac Studios, desktops, laptops, even phones — pooling their memory into one virtual machine. It exposes an OpenAI-compatible API, so tools like Open WebUI and Continue talk to it without modification.

The pitch writes itself: you're "one GPU short" of running a 70B model, so instead of buying a $2,000 card, you network two machines you already own and run it across both. The project is actively maintained — the latest release is v1.0.71 (April 23, 2026), built on 2,300+ commits.

That part is all true. The problem is what happens when the marketing turns "you can run bigger models" into "you can beat a dedicated AI box on speed for cheap." Those are different claims, and only one of them holds up.

The viral claim, line by line

The version of this story making the rounds on r/LocalLLaMA goes something like: "Three used RTX 3090s — 48GB pooled — beat a $3,999 DGX Spark at 3× the throughput, 124 tok/s vs 38.5 tok/s on a 120B model." Let's take it apart.

"48GB pooled" from three RTX 3090s. Three 24GB cards is 72GB, not 48GB. 48GB is two cards. Small thing, but it tells you the number wasn't checked.

"$2,400 for three used 3090s." Not anymore. The 2026 memory shortage that doubled DDR5 and SSD prices dragged used GPU prices up with it. As of June 9, 2026, used RTX 3090 listings on eBay average $1,050–$1,210, with a typical range of roughly $900–$1,500 across hundreds of completed sales. Three of them is $3,150–$3,630before you add a motherboard with three usable PCIe slots, a 1,500W power supply to feed roughly 1,050W of GPU draw, and risers or a mining frame to physically fit them. The all-in cost lands above a DGX Spark, not below it.

"3× the throughput, 124 tok/s." This is the claim that matters, and it's where the whole thing falls apart. Distributed inference does not work the way the number implies.

Why distributed inference adds capacity, not speed

Here's the part the clickbait skips. There are two ways to split a model across devices:

  • Pipeline parallelism assigns each device a contiguous block of layers. A token flows through device 1's layers, then gets passed over the network to device 2, and so on. For a single request, only one device is busy at a time — the others wait. You get the memory of all of them, but roughly the speed of one, minus network overhead.
  • Tensor parallelism splits each layer across devices so they work simultaneously. This can speed things up, but it hammers the interconnect with synchronization traffic on every layer. EXO's own numbers put tensor parallelism at "up to 1.8× on 2 devices and 3.2× on 4 devices" — and that's a best case on a fast link, not a guarantee.

EXO's own published benchmarks make the point better than any argument. On a 4-node cluster of M3 Ultra Mac Studios connected with RDMA over Thunderbolt 5 — one of the fastest consumer interconnects that exists — Jeff Geerling measured DeepSeek V3.1 671B at:

Nodes DeepSeek V3.1 671B (8-bit)
1 node 21.1 tok/s
2 nodes 27.8 tok/s
4 nodes 32.5 tok/s

Four nodes deliver 1.5× the throughput of one — not 4×, not even 2×. And that's on Thunderbolt 5 RDMA, which Apple says cuts inter-device latency by ~99% versus regular networking. The reason four Macs help at all here is that a single 256GB node can barely hold 671B in 8-bit; the cluster's real job is fitting the model, and the modest speed bump is a bonus.

Now imagine doing that over gigabit or even 10GbE Ethernet between three desktops, with EXO's NVIDIA path running on a less-optimized backend. The idea that this configuration produces 124 tok/s — more than 3× what a purpose-built, tightly-integrated DGX Spark does on a similar-sized model — isn't supported by anything EXO has published. The honest expectation for a 3× 3090 EXO cluster is throughput in the neighborhood of a single 3090, with the upside being the 72GB of combined VRAM.

The NVIDIA asterisk nobody mentions

There's a bigger problem for the 3090 fantasy: EXO's official builds don't run on NVIDIA GPUs.

Per EXO's own README, on Linux the framework currently "runs on CPU," with GPU support listed as under development. Its first-class GPU backend is MLX — Apple Silicon. That's why every headline EXO benchmark is a stack of Mac Studios, not a rack of 3090s.

NVIDIA acceleration exists only through a community fork, exo-cuda by developer Scottcjn, which restores CUDA inference via the tinygrad backend (tinygrad was removed from mainline EXO during the v1 rewrite). It's been confirmed working on older data-center cards like the Tesla V100 and M40. It's a legitimate project, but it's a fork — you're installing unofficial code, and tinygrad's CUDA kernels are not as optimized as the native kernels in vLLM or llama.cpp. EXO's NVIDIA throughput is therefore lower than running the same model on a single GPU that has enough VRAM. The framework's advantage on NVIDIA is strictly about enabling models that won't otherwise fit.

If you want to actually pool VRAM across NVIDIA cards today, you don't need EXO at all.

What you should use to pool NVIDIA GPUs instead

For multiple cards in one box — the realistic home-lab setup — the mature tools are llama.cpp and vLLM.

llama.cpp splits a model's layers across GPUs with the --tensor-split flag, putting some layers on each card. A 70B Q4 model needing ~40GB simply spreads across two 24GB cards. It's pipeline-style, so single-request speed is roughly that of one card, but it's rock-solid and supports mixed hardware. llama.cpp also has an RPC mode for spreading across separate machines on a LAN, much like EXO — but with native CUDA kernels. (For the deeper trade-offs between layer-splitting and NVLink, see our multi-GPU NVLink vs PCIe guide.)

vLLM is the choice when you want throughput from multiple cards, not just capacity. Its tensor-parallel implementation keeps all GPUs working at once and is built for batched, concurrent servi

Top comments (0)