FirstPassLab

Posted on Mar 17 • Originally published at firstpasslab.com

RoCE vs InfiniBand: Why Ethernet Is Winning the AI Data Center Networking War

#ai #networking #tutorial #devops

RoCEv2 (RDMA over Converged Ethernet version 2) has quietly become the dominant GPU interconnect for AI training clusters — and most network engineers haven't noticed yet. For deployments up to ~10K GPUs, properly tuned Ethernet with RoCEv2 delivers 85-95% of InfiniBand's training throughput at a fraction of the cost, using switches and skills you already have.

InfiniBand still wins at the absolute largest scale, but the gap is closing fast. Here's the technical breakdown.

Why RDMA Matters for AI Training

RDMA (Remote Direct Memory Access) lets one server read/write another server's memory without touching either CPU. Traditional TCP/IP requires multiple CPU interrupts, kernel context switches, and memory copies. RDMA eliminates all of that, cutting latency from milliseconds to microseconds.

Distributed AI training makes this essential. When training an LLM across thousands of GPUs, gradient updates (the math that makes the model learn) generate terabytes of east-west traffic that must complete in milliseconds. Any latency or packet loss = idle GPUs = burning money.

Three RDMA implementations exist:

Technology	Transport	Ecosystem	Best For
InfiniBand	Native IB	NVIDIA proprietary	Largest clusters (10K+ GPUs)
RoCEv2	UDP/IP over Ethernet	Open (Cisco, Arista, Broadcom)	Most AI deployments (256-10K GPUs)
iWARP	TCP/IP	Limited adoption	Legacy HPC, declining

Performance: How Close Is RoCEv2 Really?

Properly configured Ethernet RoCE delivers 85-95% of InfiniBand's throughput for tier 2/3 deployments (256-1,024 GPUs). The remaining gap comes from:

Congestion management: InfiniBand uses credit-based flow control that's inherently lossless. RoCEv2 relies on PFC + ECN — effective but needs tuning
Adaptive routing: InfiniBand handles congestion at the fabric level. Ethernet uses ECMP and flowlet switching, which can create hotspots

IBM Research (2026) demonstrated that careful network design with H100 GPUs and 400G ConnectX-7 NICs closes most of the remaining gap.

Meta's 24,000-GPU Proof Point

The strongest evidence for RoCEv2 at scale: Meta built two parallel 24K-GPU clusters — one RoCEv2 on Arista 7800 switches, one InfiniBand on NVIDIA Quantum-2. Both at 400 Gbps.

Key findings from their SIGCOMM 2024 paper:

RoCEv2 fabric successfully trained LLaMA 3.1 405B and other hundred-billion-parameter models
Required NIC PCIe credit tuning, relaxed ordering, and topology-aware rank assignment
The Ethernet cluster matched training requirements despite conventional wisdom

The important part: Meta's RoCE fabric runs on standard Ethernet protocols — spine-leaf topology, ECMP, QoS, standard switching.

Cost and Ecosystem: Where Ethernet Destroys InfiniBand

Factor	InfiniBand	RoCEv2
Switch cost	2-3x Ethernet	Standard pricing
NIC cost	ConnectX (IB mode)	Same NIC, Ethernet mode
Cabling	Proprietary IB	Standard Ethernet/fiber
Vendor choice	NVIDIA only	Cisco, Arista, Broadcom, etc.
Engineering talent	Scarce IB experts	Abundant Ethernet engineers
Multi-tenancy	Limited	Full VXLAN EVPN support
Reuse existing infra	No	Yes

Making Ethernet Lossless: The Three Pillars

Standard Ethernet drops packets when buffers fill. RoCEv2 can't tolerate drops (RDMA has no retransmission). Three technologies make it work:

1. Priority Flow Control (PFC) — IEEE 802.1Qbb

PFC sends pause frames for a specific traffic class when buffers approach capacity. Unlike legacy 802.3x PAUSE (which stops everything), PFC only pauses the RDMA priority while other traffic flows normally.

! Cisco Nexus 9000 — enable PFC on RDMA priority class
interface Ethernet1/1
  priority-flow-control mode on
  priority-flow-control priority 3 no-drop

⚠️ Critical pitfall: PFC can cause deadlocks. A pause cascade can create circular dependencies that freeze traffic. Prevention requires careful buffer allocation and limiting PFC to a single priority class.

2. Explicit Congestion Notification (ECN)

ECN marks packets instead of dropping them. The receiver sees the marking and sends a Congestion Notification Packet (CNP) back to the sender, which throttles its rate. This is the basis of DCQCN (Data Center Quantized Congestion Notification) — the standard congestion control for RoCEv2.

How PFC and ECN work together:

ECN = early warning → sender throttles before buffers fill
PFC = safety net → pauses only when ECN wasn't enough
Together = lossless delivery without PFC storms

! Arista 7800 — ECN at egress queue for AI fabric
interface Ethernet6/1/1
  tx-queue 6
    random-detect ecn minimum-threshold 500 kbytes maximum-threshold 1500 kbytes

3. Deep Buffers

AI switches need significantly more packet buffer than traditional DC switches. Deep buffer switches (32-64MB) handle the bursty all-reduce patterns where thousands of GPUs synchronize simultaneously.

What Vendors Are Shipping for AI Fabrics

Cisco:

Nexus N9364E-GX2A: 64-port 800G (Silicon One G300), PFC/ECN/deep buffers
Nexus N9100: Co-developed with NVIDIA (Spectrum-4 ASIC), 64-port 800G for AI/HPC
Nexus HyperFabric: Turnkey AI infrastructure with integrated GPUs

Arista:

7800R Series: Chassis-based 800G with Etherlink AI suite, DCQCN, PFC watchdog
7060X Series: Fixed-form 400G/800G leaf switches for AI pods

Where AI Networking Is Heading

Ultra Ethernet Consortium (UEC): Building next-gen Ethernet with built-in reliability that eliminates PFC entirely — aiming to match InfiniBand's native RDMA capabilities while keeping Ethernet's open ecosystem
800G → 1.6T optics: Silicon One G300 and Spectrum-X are 800G today, 1.6T on the roadmap
Distributed AI clusters: GPU clusters are spanning facilities, requiring deep networking expertise for inter-site RDMA

TL;DR Decision Matrix

Your Situation	Choose
< 10K GPUs, enterprise	RoCEv2 — lower cost, existing skills, multi-tenant capable
10K+ GPUs, pure AI training	InfiniBand — still lowest latency at extreme scale
Mixed AI + traditional workloads	RoCEv2 — VXLAN EVPN overlay handles both
Budget constrained	RoCEv2 — reuse existing Ethernet fabric and talent

The bottom line: unless you're building a frontier-model training cluster, RoCEv2 on standard Ethernet is the pragmatic choice — and the skills to build it (lossless Ethernet, VXLAN EVPN, QoS at scale) are standard network engineering.

Originally published at FirstPassLab. For more deep dives on data center networking and network automation, check out firstpasslab.com.

🤖 AI Disclosure: This article was adapted from the original with AI assistance. Technical content has been reviewed for accuracy.

DEV Community