Kunal Jaiswal

Posted on Mar 31

Distributed LLM Inference Across NVIDIA Blackwell and Apple Silicon Over 10GbE

#ai #homelab #machinelearning #llm

I connected an NVIDIA DGX Spark to a Mac Studio with a direct 10-gigabit Ethernet cable and split a large language model across both GPUs. Here's what actually happened.

The Problem

I have two machines that are excellent at different things:

NVIDIA DGX Spark (GB10 Blackwell, 120 GB unified memory) — screaming fast tensor cores, CUDA 13
Mac Studio (M2 Ultra, 128 GB unified memory) — great Metal GPU, massive memory bandwidth

Combined: 248 GB of GPU-accessible memory. Enough to run models that don't fit on either machine alone — 100B+ parameter models at reasonable quantization levels.

The question: can you actually get useful performance by splitting a model across heterogeneous GPUs over a network link?

The Physical Setup

I connected both machines with a direct 10GbE cable — no switch, no router. Just a CAT6A cable between:

DGX: Realtek 10GbE NIC (enP7s7) → 192.168.100.2/24
Mac Studio: 10GbE port (en0) → 192.168.100.1/24

Measured throughput: 9.41 Gbps. Both machines keep WiFi for LAN/internet access — the direct cable is a dedicated inference-only link.

Why llama.cpp RPC (and Why Not Exo)

I tried two approaches:

Exo (MLX Ring) — Failed

Exo is a distributed inference framework that uses MLX on both Metal and CUDA backends. I got peer discovery working, placed a 128 GB MiniMax M2.5 model across both nodes, but hit a wall: mx.distributed.init(backend="ring") hangs indefinitely on the CUDA backend. The MLX CUDA ring implementation simply doesn't work yet (as of MLX 0.31.1). Even single-node ring init hangs on DGX.

I fixed several other bugs along the way (election instability, edge oscillation, model path mismatches, Linux interface detection) and submitted a P2P model distribution PR, but the core distributed inference path is blocked until Apple adds CUDA ring support to MLX.

llama.cpp RPC — Works

llama.cpp's RPC backend takes a different approach. Instead of requiring the same ML framework on both ends, it exposes a simple RPC server that provides raw compute. The host machine (Mac Studio) runs llama-server, loads the model, and offloads layers to remote RPC servers (DGX) as needed.

# DGX — start RPC server
cd /home/kjaiswal/llama.cpp
LD_LIBRARY_PATH=build/bin build/bin/rpc-server -H 192.168.100.2 -p 50052

# Mac Studio — start llama-server with RPC
cd /Users/chimpoo/llama.cpp
build/bin/llama-server \
  -m /path/to/model.gguf \
  --rpc 192.168.100.2:50052 \
  -ngl 99 \
  --host 0.0.0.0 --port 9999 \
  -c 4096

Both were built from the same commit (b0f0dd3e5) with their respective GPU backends:

Mac Studio: GGML_METAL=ON GGML_RPC=ON
DGX: GGML_CUDA=ON GGML_RPC=ON

The model file only needs to exist on the Mac Studio. llama.cpp automatically splits layers across available compute based on memory.

Benchmarks

Qwen2.5-7B Q4_K_M (4.4 GB) — Fits one machine easily

Mode	Prompt Processing	Token Generation
Local Metal only	76 tok/s	92 tok/s
RPC (Metal + CUDA)	318 tok/s	53 tok/s

Qwen2.5-72B Q4_K_M (44.2 GB) — Fits Mac Studio alone

Mode	Prompt Processing	Token Generation	Model Split
Local Metal only	28 tok/s	11 tok/s	44 GB on Metal
RPC (Metal + CUDA)	30 tok/s	6 tok/s	31 GB Metal + 14 GB CUDA

What the Numbers Mean

Prompt processing (prefill) benefits from RPC. The DGX Blackwell tensor cores accelerate the matrix multiplications needed to process input tokens. For the 7B model, prefill was 4.2x faster with RPC. Even the 72B model saw a slight improvement.

Token generation (decode) is slower with RPC. Each generated token requires a round-trip over the network to synchronize KV cache states. At 10 Gbps, this adds ~0.2ms per layer per token. With 80 layers, that's 16ms of network overhead per token — enough to cut generation speed roughly in half.

For models that fit one machine, local is faster. The 72B model runs at 11 tok/s locally vs 6 tok/s over RPC. The network overhead isn't worth it.

The real value is models that DON'T fit one machine. With 248 GB combined, I can run:

MiniMax M2.5 Q4_K_M (138 GB) — 230B parameters, 10B active MoE
Qwen3-235B Q4_K_M (132 GB) — 235B parameters, 22B active MoE
DeepSeek-R1 at higher quantization than either machine could handle alone

At Q4 quantization, a 200B+ MoE model should generate at ~4–8 tok/s across both machines. Not fast, but usable for batch processing, code review, and complex reasoning tasks.

Lessons Learned

Direct cables beat switches. A direct 10GbE link has lower latency and jitter than going through a network switch. For latency-sensitive distributed inference, every microsecond matters.
Prefill and decode have opposite scaling characteristics. Prefill is embarrassingly parallel and benefits from more compute. Decode is sequential and bottlenecked by network latency. This suggests a potential disaggregated architecture: use the DGX for prefill, Mac Studio for decode.
GGUF is the universal format. Ollama GGUFs have custom metadata that upstream llama.cpp can't read (e.g., rope.dimension_sections with wrong array length). Always use HuggingFace community GGUFs (bartowski, etc.) for llama.cpp.
Heterogeneous distributed inference works today — but only with frameworks that abstract the GPU backend behind a network protocol (like llama.cpp RPC). Frameworks that require the same ML runtime on all nodes (like Exo with MLX) are blocked on backend parity.

What's Next

Benchmark MiniMax M2.5 (138 GB) split across both machines — the first model that actually needs distributed inference
Test disaggregated prefill (DGX) + decode (Mac Studio) once both run the same framework
Explore vLLM's distributed serving for production workloads

The full setup — including the Exo debugging saga and the 5 bugs I fixed — is documented in my infrastructure notes. Happy to share details if you're working on similar multi-GPU setups.

If you're working on similar multi-GPU setups, I'd love to hear what's working for you. The full setup notes and Exo bug fixes are on my GitHub.

DEV Community