Multi-GPU for Local AI in 2026: NVLink vs PCIe and When a Second Card Actually Helps

#multigpu #nvlink #pcie #localai

This article was originally published on runaihome.com

If you are researching multi-GPU setups for local AI and NVLink keeps coming up, here is the short version first: NVLink is only available on the RTX 3090 among consumer GPUs. The RTX 4090, RTX 5090, and every other Ada or Blackwell GeForce card does not support it. If you are running any of those, you are doing multi-GPU over PCIe whether you know it or not.

That is not necessarily a problem. But it does change what you should expect, how you should configure your software, and whether adding a second card is worth it at all. This guide covers all three questions with verified numbers.

NVLink on consumer GPUs: the short and definitive history

NVLink is NVIDIA's proprietary high-bandwidth GPU-to-GPU interconnect. On data center hardware it provides extraordinary bandwidth — 600 GB/s on A100s, 900 GB/s on H100s. On consumer hardware the story is much simpler: NVIDIA supported NVLink on exactly one consumer GPU generation (Ampere, 2020–2021), then removed it entirely.

Here is the full consumer NVLink support table:

GPU	Architecture	NVLink support	Bandwidth
RTX 2080 Ti	Turing	Yes (NVLink 2.0)	100 GB/s
RTX 3090	Ampere	Yes (NVLink 3.0)	112.5 GB/s
RTX 3090 Ti	Ampere	No	—
RTX 4070 Ti / 4080 / 4090	Ada Lovelace	No	—
RTX 5060 Ti / 5070 / 5080 / 5090	Blackwell	No	—
RTX PRO 6000 Blackwell	Blackwell (workstation)	Yes (NVLink 5.0)	1,800 GB/s

The RTX 3090 Ti, announced the same generation, did not include the NVLink connector — making the base RTX 3090 the last consumer card with it. The RTX 4090 dropped NVLink entirely; NVIDIA stated it used the freed space for additional AI processing circuitry. The RTX 5090 and the rest of the 50-series continue that pattern.

What this means practically: if you want NVLink in a home lab, your only realistic option is a pair of used RTX 3090s with an NVLink bridge. Everything else is PCIe.

The bandwidth reality

To understand what this costs in performance, the numbers:

Interconnect	Bandwidth (bidirectional)	Typical home-lab hardware
PCIe 4.0 x16	64 GB/s	Most AMD and Intel desktop platforms
PCIe 5.0 x16	128 GB/s	Z790, X670E, AM5 with Gen 5 slot
NVLink 3.0 (RTX 3090 pair)	112.5 GB/s	RTX 3090 + NVLink bridge
NVLink 3.0 (A100 pair)	600 GB/s	Data center, out of home-lab budget
NVLink 4.0 (H100 pair)	900 GB/s	Data center

One important detail for dual-GPU desktop builds: when you install two cards in a typical consumer motherboard, each card gets x8 PCIe lanes rather than x16, because the CPU's PCIe lanes are split between slots. On PCIe 4.0, x8 = 32 GB/s bidirectional. On PCIe 5.0, x8 = 64 GB/s bidirectional.

GPU-to-GPU communication over PCIe also routes through the CPU memory controller — data moves from GPU 0 → CPU → GPU 1 — which adds latency that direct NVLink connections avoid entirely. The RTX 3090's NVLink bridge is a direct GPU-to-GPU connection at 112.5 GB/s with no CPU hop.

For tensor-parallel inference, where each token processed requires all-reduce operations between GPUs, that bandwidth gap translates directly into throughput. Benchmarks from a 4x RTX 3090 cluster found NVLink improves inference throughput by approximately 50% for 2-GPU tensor-parallel pairs, and around 10% for 4-GPU setups where only half of GPU pairs are bridged and the rest communicate over PCIe.

When a second GPU actually helps — and when it makes things worse

Adding a second GPU is not always an upgrade. The outcome depends entirely on the relationship between model size and your GPU's VRAM.

Scenario 1: Model doesn't fit on one card. If you are trying to run Llama 3.3 70B Q4 (requires ~42 GB) on a single RTX 4090 (24 GB), the model simply cannot load. A second 4090 brings you to 48 GB total and the model runs. In this case, the second card is not optional — it is a requirement.

Scenario 2: Model fits on one card, you add a second anyway. This is where people get surprised. If you are running Ollama with a 14B model that fits comfortably in 24 GB of VRAM, Ollama will automatically detect your second GPU and split layers across both cards. The result, counterintuitively, is slower inference — because every token now requires PCIe data transfers between cards that were not necessary when the model lived on one GPU. Ollama's official documentation confirms this behavior: second GPU accelerates large models that require VRAM pooling; it hurts small models that would otherwise run fully on one card.

Scenario 3: High-concurrency serving. If you are running vLLM and serving 10+ simultaneous users, tensor parallelism across two GPUs can roughly double throughput compared to a single-GPU setup, because both GPUs work on each request in parallel. The PCIe overhead is amortized across many concurrent requests. This is the use case where PCIe multi-GPU genuinely earns its keep even without NVLink.

The decision matrix:

Situation	Add second GPU?	Reasoning
70B+ model, single GPU too small	Yes, required	VRAM pooling is the only path
Personal use, <14B models	No — makes it slower	PCIe overhead > compute gain
vLLM serving, 10+ concurrent users	Yes	Throughput scales well
Fine-tuning / QLoRA	Cloud instead	See cloud GPU math
Ollama, model fits on one card	No	Ollama adds overhead, not speed

The RTX 3090 NVLink setup: what it actually buys you

For home-lab users who specifically want NVLink, this is the only practical path. Two used RTX 3090s connected with an NVLink bridge give you:

48 GB combined VRAM — enough for Llama 3.3 70B at Q4_K_M with context headroom
112.5 GB/s GPU-to-GPU bandwidth — ~3.5× the throughput of PCIe 4.0 x8
50% throughput improvement over running the same two 3090s without NVLink in tensor-parallel configurations

Hardware required:

Two RTX 3090 cards (NOT 3090 Ti — that card has no NVLink connector)
One NVIDIA NVLink Bridge 4-slot (ASIN B08S1RYPP6 on Amazon, also available from Newegg). Originally $79 MSRP; as of May 2026, available on Amazon and eBay in the $50–80 range
A motherboard with two PCIe x16/x8 slots with sufficient slot spacing for the 4-slot bridge

The thermal reality: Two RTX 3090s at full inference load draw approximately 350W each, putting the combined GPU power draw at ~700W. The NVLink bridge sits between the cards, blocking airflow between them. A dual-3090 NVLink rig almost always requires aftermarket solutions — open-air cases, additional case fans directly above the GPU stack, or liquid cooling. The dual RTX 3090 cooling problem is well-documented and not optional to address. Plan power supply accordingly: a 1200W+ PSU is prudent.

For more context on the RTX 3090's value proposition individually, see Used RTX 3090 in 2026: Still the AI Value King?

Multi-GPU over PCIe: dual RTX 4090 and beyond

For the majority of multi-GPU home-lab builds in 2026 — dual RTX 4090, dual RTX 5090, any combination without NVLink — PCIe is the interconnect. Here is what to expect.

Dual RTX 4090 running Llama 3.3 70B Q4: approximately 25–30 tokens/sec generation speed with vLLM tensor parallelism. A single RTX 4090 cannot run this model at all (insufficient VRAM), so the comparison i