EXO Framework Setup Guide 2026: Pool Devices for Big LLMs

#exo #distributedinference #selfhosted #ai

This article was originally published on aifoss.dev

TL;DR: EXO pools the memory of several devices into one cluster so you can run models bigger than any single machine holds. In mid-2026 it's a genuinely good Apple Silicon tool and a rough one for NVIDIA on Linux, where it still defaults to CPU. Set expectations accordingly.

	EXO	llama.cpp RPC	vLLM + Ray
Best for	Mixing Macs/devices over LAN	Splitting one model across a few nodes	Multi-GPU production serving
Setup effort	Low (auto-discovery)	Medium (manual node list)	High (cluster config)
GPU on Linux	CPU by default; NVIDIA via fork	Full CUDA/Metal	Full CUDA
The catch	Network latency tax; Linux GPU is roadmap	You wire up every node	Needs real GPUs, not laptops

Honest take: If you have two or more Apple Silicon Macs sitting around, EXO is the fastest way to run a 70B+ model across them. If you have NVIDIA cards on Linux, use vLLM or llama.cpp RPC instead — EXO isn't there yet.

What EXO actually is

EXO (the exo-explore/exo project) connects every device on your network into a single AI cluster. The pitch is simple: you probably don't own one machine with 128GB of unified memory, but you might own three machines with 48GB each. EXO shards a model across them so the cluster can hold what no single node can.

It's licensed Apache 2.0, which matters — you can use it commercially without the license asterisks attached to "open weights" model releases. The repo is active (latest tagged release v1.0.71, April 23 2026, with commits landing through late June 2026), and it's a full rewrite of the original project, which is now archived under exo-explore/ex-exo. If you find an old tutorial referencing the archived repo, ignore it.

Two architectural choices make EXO different from a typical inference server:

No master-worker. Devices connect peer-to-peer. There's no head node to babysit — any device that's reachable on the network can join the ring and contribute memory.
Ring memory-weighted partitioning. EXO splits the model into layers and assigns each device a number of layers proportional to its memory. A 64GB Mac Studio carries more of the model than a 16GB MacBook Air, automatically.

Devices discover each other with no manual config. Start EXO on two machines on the same LAN and they find each other. The cluster exposes a web UI and API at http://localhost:52415, and the API speaks three dialects: OpenAI Chat Completions, Anthropic's Claude Messages format, and the Ollama API. That last one is the headline for self-hosters — anything you've already wired to talk to Ollama can point at EXO instead with a URL change.

The hardware reality check (read this before you buy anything)

Most EXO write-ups skip the single most important sentence in the documentation. Here it is, verbatim from the README:

"On macOS, exo uses the GPU. On Linux, exo currently runs on CPU. We are working on extending hardware accelerator support."

Read that twice. On macOS, EXO uses the Metal GPU through Apple's MLX framework — this is the path the project optimizes for. On Linux, the default backend (tinygrad) runs on CPU. GPU acceleration on Linux is a roadmap item, not a shipped feature in the upstream project.

That single fact reshapes the whole "build a home cluster" story. The viral benchmarks you've seen — pooling three RTX 3090s for a frontier model — are not something stock EXO on Linux delivers out of the box today. The maintainers' own showcase runs are Apple Silicon: community demos pool 4× M3 Ultra Mac Studios to run Qwen3-235B, DeepSeek v3.1, and Kimi K2-class models. That's where EXO is real in 2026.

If you have NVIDIA hardware, there's a path, but it's a fork-and-fiddle path — covered below.

Installing EXO

EXO is not a pip install package anymore — the v1.0 rewrite builds a Node.js dashboard and runs through the uv Python toolchain. You need uv, node, and rust installed first.

macOS — the easy path:

brew install --cask exo

That installs the prebuilt app. Prefer source? Clone and run it:

git clone https://github.com/exo-explore/exo
cd exo/dashboard && npm install && npm run build && cd ..
uv run exo

Linux:

git clone https://github.com/exo-explore/exo
cd exo/dashboard && npm install && npm run build && cd ..
uv run exo

Either way, you should see the node come up and print the dashboard URL:

$ uv run exo
exo node started
dashboard + API: http://localhost:52415
discovering peers on local network...

Open http://localhost:52415 in a browser and you get a chat UI plus a topology view of the cluster.

A real problem you'll hit: Python version. EXO is happiest on Python 3.12. Installs on 3.13 have failed for users on Apple Silicon (tracked in GitHub issue #446 and the tinygrad version-incompatibility issue #867). If uv run exo dies during dependency resolution, pin the interpreter:

uv venv --python 3.12
uv run exo

Building a cluster

This is where EXO earns its keep. Run the same uv run exo command on a second machine on the same network. No config file, no IP list, no head node. The two nodes discover each other and the topology view in the dashboard updates to show both, with their combined memory.

To run a model, request it through the API. Because EXO speaks the OpenAI format, the call is ordinary:

curl http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Explain ring partitioning in one sentence."}]
  }'

EXO downloads the weights from Hugging Face on first use and shards them across the ring based on each device's available memory. You don't pick which layers go where — the partitioner does. Pull a bigger model than any single node can hold, and EXO spreads it; that's the entire point.

On the interconnect: EXO ships day-0 support for RDMA over Thunderbolt 5, which the project claims cuts inter-device latency dramatically versus Wi-Fi. If you're chaining Macs, a Thunderbolt bridge or 10GbE is a meaningful upgrade over Wi-Fi 6 — the network hop is the tax you pay for pooling, so the faster the link, the less you lose.

What performance actually looks like

Honest numbers matter more than hype. These are community-reported figures, not official benchmarks, so treat them as ballpark:

Setup	Model	Throughput (reported)
Single M2 Ultra 192GB	Llama 3.1 70B	~12–18 tok/s
2× M3 Max over Wi-Fi 6	Llama 3.1 70B	~6–10 tok/s
4× M3 Ultra Mac Studio	Qwen3-235B / DeepSeek v3.1	demoed, usable for chat

The pattern is the lesson: a single machine that can hold the model is faster than two that split it, because there's no network hop. Pooling doesn't make inference faster — it makes inference possible for models that wouldn't otherwise fit. You reach for EXO when "doesn't fit on one box" beats "a few tokens per second slower." For a single GPU that already fits your model, a plain Ollama or MLX setup will be faster — see our Ollama MLX backend guide for the single-Mac path.

NVIDIA on Linux: the fork situation

If you searched for EXO because you want to pool consumer NVIDIA GPUs, here's the unvarnished state in mid-2026:

Upstream EXO on Linux defaults to tinygrad on CPU. GPU users have hit the "GPU detected but showing 0.0 TFLOPS" wall (issue #821).
A community fork, Scottcjn/exo-cuda, restores NVIDIA CUDA inference through tinygrad and reports confirmed runs on Tesla V100 and M40 cards. You'll need the NVIDIA driver, CUDA toolkit, a