RoCEv2 (RDMA over Converged Ethernet version 2) has quietly become the dominant GPU interconnect for AI training clusters — and most network engineers haven't noticed yet. For deployments up to ~10K GPUs, properly tuned Ethernet with RoCEv2 delivers 85-95% of InfiniBand's training throughput at a fraction of the cost, using switches and skills you already have.
InfiniBand still wins at the absolute largest scale, but the gap is closing fast. Here's the technical breakdown.
Why RDMA Matters for AI Training
RDMA (Remote Direct Memory Access) lets one server read/write another server's memory without touching either CPU. Traditional TCP/IP requires multiple CPU interrupts, kernel context switches, and memory copies. RDMA eliminates all of that, cutting latency from milliseconds to microseconds.
Distributed AI training makes this essential. When training an LLM across thousands of GPUs, gradient updates (the math that makes the model learn) generate terabytes of east-west traffic that must complete in milliseconds. Any latency or packet loss = idle GPUs = burning money.
Three RDMA implementations exist:
| Technology | Transport | Ecosystem | Best For |
|---|---|---|---|
| InfiniBand | Native IB | NVIDIA proprietary | Largest clusters (10K+ GPUs) |
| RoCEv2 | UDP/IP over Ethernet | Open (Cisco, Arista, Broadcom) | Most AI deployments (256-10K GPUs) |
| iWARP | TCP/IP | Limited adoption | Legacy HPC, declining |
Performance: How Close Is RoCEv2 Really?
Properly configured Ethernet RoCE delivers 85-95% of InfiniBand's throughput for tier 2/3 deployments (256-1,024 GPUs). The remaining gap comes from:
- Congestion management: InfiniBand uses credit-based flow control that's inherently lossless. RoCEv2 relies on PFC + ECN — effective but needs tuning
- Adaptive routing: InfiniBand handles congestion at the fabric level. Ethernet uses ECMP and flowlet switching, which can create hotspots
IBM Research (2026) demonstrated that careful network design with H100 GPUs and 400G ConnectX-7 NICs closes most of the remaining gap.
Meta's 24,000-GPU Proof Point
The strongest evidence for RoCEv2 at scale: Meta built two parallel 24K-GPU clusters — one RoCEv2 on Arista 7800 switches, one InfiniBand on NVIDIA Quantum-2. Both at 400 Gbps.
Key findings from their SIGCOMM 2024 paper:
- RoCEv2 fabric successfully trained LLaMA 3.1 405B and other hundred-billion-parameter models
- Required NIC PCIe credit tuning, relaxed ordering, and topology-aware rank assignment
- The Ethernet cluster matched training requirements despite conventional wisdom
The important part: Meta's RoCE fabric runs on standard Ethernet protocols — spine-leaf topology, ECMP, QoS, standard switching.
Cost and Ecosystem: Where Ethernet Destroys InfiniBand
| Factor | InfiniBand | RoCEv2 |
|---|---|---|
| Switch cost | 2-3x Ethernet | Standard pricing |
| NIC cost | ConnectX (IB mode) | Same NIC, Ethernet mode |
| Cabling | Proprietary IB | Standard Ethernet/fiber |
| Vendor choice | NVIDIA only | Cisco, Arista, Broadcom, etc. |
| Engineering talent | Scarce IB experts | Abundant Ethernet engineers |
| Multi-tenancy | Limited | Full VXLAN EVPN support |
| Reuse existing infra | No | Yes |
Making Ethernet Lossless: The Three Pillars
Standard Ethernet drops packets when buffers fill. RoCEv2 can't tolerate drops (RDMA has no retransmission). Three technologies make it work:
1. Priority Flow Control (PFC) — IEEE 802.1Qbb
PFC sends pause frames for a specific traffic class when buffers approach capacity. Unlike legacy 802.3x PAUSE (which stops everything), PFC only pauses the RDMA priority while other traffic flows normally.
! Cisco Nexus 9000 — enable PFC on RDMA priority class
interface Ethernet1/1
priority-flow-control mode on
priority-flow-control priority 3 no-drop
⚠️ Critical pitfall: PFC can cause deadlocks. A pause cascade can create circular dependencies that freeze traffic. Prevention requires careful buffer allocation and limiting PFC to a single priority class.
2. Explicit Congestion Notification (ECN)
ECN marks packets instead of dropping them. The receiver sees the marking and sends a Congestion Notification Packet (CNP) back to the sender, which throttles its rate. This is the basis of DCQCN (Data Center Quantized Congestion Notification) — the standard congestion control for RoCEv2.
How PFC and ECN work together:
- ECN = early warning → sender throttles before buffers fill
- PFC = safety net → pauses only when ECN wasn't enough
- Together = lossless delivery without PFC storms
! Arista 7800 — ECN at egress queue for AI fabric
interface Ethernet6/1/1
tx-queue 6
random-detect ecn minimum-threshold 500 kbytes maximum-threshold 1500 kbytes
3. Deep Buffers
AI switches need significantly more packet buffer than traditional DC switches. Deep buffer switches (32-64MB) handle the bursty all-reduce patterns where thousands of GPUs synchronize simultaneously.
What Vendors Are Shipping for AI Fabrics
Cisco:
- Nexus N9364E-GX2A: 64-port 800G (Silicon One G300), PFC/ECN/deep buffers
- Nexus N9100: Co-developed with NVIDIA (Spectrum-4 ASIC), 64-port 800G for AI/HPC
- Nexus HyperFabric: Turnkey AI infrastructure with integrated GPUs
Arista:
- 7800R Series: Chassis-based 800G with Etherlink AI suite, DCQCN, PFC watchdog
- 7060X Series: Fixed-form 400G/800G leaf switches for AI pods
Where AI Networking Is Heading
- Ultra Ethernet Consortium (UEC): Building next-gen Ethernet with built-in reliability that eliminates PFC entirely — aiming to match InfiniBand's native RDMA capabilities while keeping Ethernet's open ecosystem
- 800G → 1.6T optics: Silicon One G300 and Spectrum-X are 800G today, 1.6T on the roadmap
- Distributed AI clusters: GPU clusters are spanning facilities, requiring deep networking expertise for inter-site RDMA
TL;DR Decision Matrix
| Your Situation | Choose |
|---|---|
| < 10K GPUs, enterprise | RoCEv2 — lower cost, existing skills, multi-tenant capable |
| 10K+ GPUs, pure AI training | InfiniBand — still lowest latency at extreme scale |
| Mixed AI + traditional workloads | RoCEv2 — VXLAN EVPN overlay handles both |
| Budget constrained | RoCEv2 — reuse existing Ethernet fabric and talent |
The bottom line: unless you're building a frontier-model training cluster, RoCEv2 on standard Ethernet is the pragmatic choice — and the skills to build it (lossless Ethernet, VXLAN EVPN, QoS at scale) are standard network engineering.
Originally published at FirstPassLab. For more deep dives on data center networking and network automation, check out firstpasslab.com.
🤖 AI Disclosure: This article was adapted from the original with AI assistance. Technical content has been reviewed for accuracy.
Top comments (0)