AICPLIGHT

Posted on Apr 28

NVIDIA B200/B300/GB200/GB300 Cluster Interconnect Architecture Analysis

#b200 #b300 #gb200 #gb300

NVIDIA's latest AI platforms—including B200, B300, GB200, and GB300—introduce cluster interconnect designs that combine NVLink fabrics, high-performance NICs, and large-scale switching networks. This article explores how these technologies work together, from node-level GPU communication to rack-scale NVL72 systems and large-scale SuperPod cluster architectures.

DGX and NVL72 Infrastructure Explained

DGX B200 and DGX B300 Single-Node Architecture

In most enterprise and hyperscale AI deployments, GPUs are organized into standardized compute nodes. NVIDIA B200 and B300 platforms typically follow the same design pattern used in DGX or HGX systems, where a single node integrates eight GPUs within a unified architecture. Inside the node, the 8 GPUs are fully interconnected via NVLink + NVSwitch, ensuring high-speed data interaction between GPUs within the node.

To connect GPU nodes to the cluster network, each system integrates multiple high-speed network interface cards (NICs). These NICs provide the external connectivity required for multi-node training workloads where thousands of GPUs must communicate across racks and data center fabrics. In B200-based systems, high-performance 400Gb/s network adapters (ConnectX-7 SuperNICs) are commonly deployed. B300 platforms are expected to adopt newer 800Gb/s-class adapters (ConnectX-8 SuperNICs), significantly increasing network bandwidth for AI clusters.

Cooling solutions for these systems vary depending on deployment density. While air cooling remains possible in certain configurations, large-scale AI clusters increasingly adopt liquid cooling to support higher power density and improved thermal efficiency.

Figure 1: DGX B300 Single-Node System (Source: NVIDIA)

Rack-Scale Architecture: GB200 and GB300 NVL72

While DGX systems represent node-level building blocks, NVIDIA's GB200 and GB300 platforms introduce a much denser rack-scale architecture designed for hyperscale AI infrastructure. The NVL72 system integrates 72 GPUs within a single rack, creating one of the highest-density GPU computing platforms available today. This design significantly reduces communication distance between GPUs while maximizing compute density inside the data center.

Within the NVL72 architecture, GPUs are distributed across multiple compute trays and interconnected through a dedicated NVLink switching domain. A total of 18 NVSwitch chips form the switching fabric that connects all 72 GPUs within the rack, enabling extremely high internal bandwidth. This NVLink domain allows GPUs to communicate at speeds far exceeding traditional cluster networking, which is particularly beneficial for large AI training jobs that require frequent data exchange.

Each compute tray typically integrates multiple GPU modules together with CPUs and system memory, forming the core building blocks of the rack-level system.

Because of the extremely high compute density, NVL72 racks operate at very high power levels—often exceeding 100 kW per rack. As a result, liquid cooling is generally required to maintain stable operation and improve energy efficiency.

External cluster connectivity is provided through high-speed NICs installed within the compute trays. Earlier deployments such as GB200 systems typically use 400Gb/s ( CX-7) networking, while next-generation GB300 platforms are expected to move toward 800Gb/s (CX-8) cluster networking.

Figure 2: GB200 and GB300 NVL72 Rack System View (Source: NVIDIA)

Cluster Interconnect Hardware: NICs and Switches

Large-scale AI clusters rely on specialized networking hardware designed to deliver extremely high throughput and low latency. NVIDIA has launched multiple generations of specialized hardware for the B/GB series, forming a complete system from NICs to Ethernet and InfiniBand (IB) switches:

Dedicated NICs: CX8/CX9 SuperNIC

ConnectX-8 SuperNIC: As the standard network adapter for B300 servers, it is the core network hardware of the current new-generation computing clusters, with the following core features:

Integration: Features an integrated PCIe Switch with native support for PCIe Gen6 ports. This integrated solution is adopted by all current B300 servers. There is no design that uses PCIe Gen6 Switches independently, and this will remain the mainstream core solution for the long term.
Port Modes: Supports 1 x 800Gb/s port or 2 x 400Gb/s ports in InfiniBand mode. In Ethernet mode, it does not support 800Gb/s ports and can only use 2 x 400Gb/s ports.

CX9 SuperNIC: NVIDIA's next-generation dedicated NIC.

Core Upgrade: Resolves the CX8's lack of 800Gbps support in Ethernet mode, breaking Ethernet bandwidth limits. One of its expected improvements is stronger support for high-bandwidth Ethernet deployments, helping large-scale GPU clusters integrate more easily with standard data center networking infrastructure.

Cluster Switching Infrastructure: InfiniBand and Ethernet
AI clusters require powerful switching platforms capable of handling massive east-west traffic between GPUs. NVIDIA provides both InfiniBand and Ethernet switches to adapt to different cluster needs:

Quantum-2 InfiniBand Switch (QM9700): Quantum-2 switches provide 64 ports operating at 400Gb/s with a total bidirectional bandwidth of 51.2 Tbps (400 * 64 * 2 = 51.2 Tbps). These switches form the backbone of many B200 and GB200 clusters that rely on InfiniBand networking.

Spectrum-X800 SN5600 Ethernet Switch: The Spectrum-X SN5600 is designed for high-performance AI Ethernet networks. It supports up to 64 ports operating at 800Gb/s or 128 ports at 400Gb/s. In a two-tier non-blocking network, it supports up to 2,048 GPUs (6464/2=2048) at 800Gb/s or 8,192 GPUs (128128/2=8192) at 400Gb/s. It can be used for the B300 cluster reference architecture.

Quantum-X800 Q3400 InfiniBand Switch: Core supporting hardware for the GB300 cluster, providing 144 ports operating at 800Gb/s. It supports up to 10,368 GPUs (144*144/2=10368) in a two-tier non-blocking network, making it the highest-scale dedicated InfiniBand switch currently available.

NVIDIA SuperPod GPU Cluster Reference Architectures

NVIDIA's SuperPod architecture provides standardized deployment models for hyperscale GPU clusters. These reference designs combine compute nodes, networking infrastructure, and optimized topology layouts to simplify cluster deployment. Different SuperPod architectures exist for B200, B300, GB200, and GB300 systems, with differences mainly in networking technology and scalability.

B200 SuperPod Reference Architecture

B200 SuperPods typically use Quantum-2 QM9700 InfiniBand switches operating at 64 x 400Gb/s. These clusters can be deployed using either two-tier or three-tier network topologies depending on the desired cluster size.

Two-Tier Non-Blocking Network (4 SUs, 127 nodes): Theoretically supports up to 2,048 GPUs (64*64/2=2048). The actual deployment includes 4 Scalable Units (SUs), with 32 nodes per SU. Since the Leaf Switch of the last SU needs to connect to the UFM, one node will be reduced, and the actual number of GPUs supported is slightly lower than the theoretical value.

Figure 3: Compute fabric for full 127-node DGX B200 SuperPOD (Source: NVIDIA)

Three-Tier Network: Supports ultra-large-scale clusters (consistent with H100 solutions). 64 SUs can support 2,048 nodes and 16,384 B200 GPUs, requiring 1,280 QM9700 IB switches (256 + 512 + 512=1280).

Figure 4: Larger DGX B200 SuperPOD component counts (Source: NVIDIA)

Alternative: Using SN5600 Ethernet switches in a two-tier network can support up to 8,192 B200 GPUs (128*128/2 = 8192).

B300 SuperPod Reference Architecture

B300 SuperPods introduce a stronger focus on high-performance Ethernet networking. NVIDIA adopts the Spectrum-X800 SN5600 Ethernet switch for the back-end network (computing network) solution of the DGX B300 SuperPod, which supports a maximum of 64 x 800 Gbps Ports, and the two-layer non-blocking network architecture supports a maximum of 2048 GPUs.

However, the CX-8 does not support 800Gbps Ethernet Ports. To support more GPUs, NVIDIA adopts a multi-plane design—here are two planes (each 800Gbps NIC is divided into 2 x 400Gbps Ports, each forming a communication plane, and the back-end network can be regarded as 2 parallel and independent 400Gbps networks). The core deployment details are as follows:

Figure 5: Compute fabric for full 512-node DGX B300 SuperPOD (Source: NVIDIA)

Single-node Configuration: A single B300 node contains 8 B300 GPUs and 16 x 400Gbps Ports, with 8 Ports as one communication plane, and the two planes run independently.

Single-SU configuration: Each SU contains 64 B300 Nodes with a total of 512 B300 GPUs connected to Leaf Switches. Each SU is equipped with 16 SN5600 Leaf Switches. The SN5600 runs in the mode of 128 x 400Gb/s Ports to connect 64 Nodes, with 8 switches per plane, corresponding to 8128=1024 400Gbps Ports, half of which are connected to GPU network adapters and the other half to Spine Switches.

Scale expansion: Multiple SUs are interconnected via Spine Switches, and 16 SUs can support 8192 B300 GPUs. The two planes require a total of 128 Leaf Switches and 128 Spine Switches, all of which are SN5600 switches (each plane includes 816=128 Leaf Switches and requires 64 Spine Switches; the two planes need 64*2=128 Spine Switches).

Two-layer non-blocking network: When running in 800Gbps Port mode, it theoretically supports a maximum of 2048 GPUs.

Figure 6: Larger DGX SuperPOD component counts (Source: NVIDIA)

GB200 SuperPod Reference Architecture

The back-end network in NVIDIA's GB200 SuperPod reference architecture also adopts the QM9700 InfiniBand switch, which supports a maximum of 64 x 400 Gbps Ports, resulting in great limitations on the corresponding interconnection scale. The two-layer network has a large number of wasted ports and limited scale support capabilities, and a three-layer network is required to achieve ultra-large-scale expansion.

Two-layer non-blocking network: It only supports 576 GPUs, equipped with 32 Leaf Switches (8 switches form a Rail as a group. Each Leaf Switch in a Rail is connected to one rack, and 18 x 400 Gbps Ports in each rack are connected to one Leaf Switch, with a total of 72 Ports connected to 4 Rails). A large number of Ports on the Leaf Switches are wasted: 18 Ports for downlink, and 18 Ports for uplink to achieve non-blocking (2 Ports connected to each Spine), with 28 Ports unused. 9 Spine Switches correspond exactly to 64*9=576 GPUs for non-blocking connection (Note: Theoretically, only 18 Leaf Switches are needed, but 32 are actually used).

Figure 7: Compute fabric for full 576 GPUs DGX SuperPOD (Source: NVIDIA)

Three-layer network: A three-layer network architecture is the only option to support larger scales:

Each SU includes 8 GPU racks with 576 GPUs, still equipped with 32 Leaf Switches with the same connection method to GPU racks.

24 Spine Switches are configured, with 6 Spine Switches in each Rail connecting to 8 Leaf Switches in the same Rail. Therefore, 818/6=24 Ports on the Spine Switches are used for downlink connection to Leaf Switches.

There are 6 Core Groups, and the number of Core Switches in each Core Group is proportional to the number of SUs (1 SU corresponds to 3 Switches). Taking 16 SUs as an example:
A total of 24 * 16 = 384 Spine Switches are needed, with each Spine Switch having 24 uplink Ports, resulting in a total of 24 * 384 = 9216 uplink Ports.

Each Core Group contains 24 Core Switches, with a total of 624=144 Core Switches corresponding to 144*64=9216 Ports, i.e., 9216 GPUs.

The 24 uplink Ports of each Spine Switch correspond to one Core Group, with 24 Core Switches in each group. Therefore, the 24 Ports of one Spine Switch are connected to 24 Core Switches in one group. Each Rail has 6 Spine Switches corresponding to 6 Core Groups.

Figure 8: Compute Fabric for Scale Out of up to 16 SUs (Source: NVIDIA)

A cluster with 9216 GPUs requires 144+512+384=1040 QM9700 Switches (with a total of 1040*64=66560 400 Gbps Ports).

Figure 9: Larger SuperPOD component counts (Source: NVIDIA)

GB300 SuperPod Reference Architecture

The back-end network of GB300 SuperPod cluster adopts the latest Quantum-X800 Q3400 switches to form an InfiniBand network with 144 x 800 Gbps Ports. The topological design is more concise, the port utilization rate is greatly improved, and it is the optimal solution for current high-density and ultra-large-scale computing clusters. The core deployment details are as follows:

Single-SU configuration: It includes 8 NVL72 racks with 576 GPUs, equipped with 8 Q3400 Leaf Switches (144 x 800 Gbps Ports per Leaf Switch). A single Leaf Switch is connected to 4 racks, occupying 72 (4 x 18 = 72) 800Gb/s Ports, with the remaining 72 Ports used for uplink interconnection and no port waste. Every 2 Leaf Switches form a Rail, and one group of Rails is connected to 8 racks.

Scale expansion: The SuperPod supports a maximum of 16 SUs with a total computing of 9216 GPUs (72816=9216), equipped with 128 Leaf Switches (8 * 16(SUs) = 128 Leaf Switches). Each Spine Switch is connected to 128 Leaf Switches, and there are 72 remaining uplink Ports on the Leaf Switches, so only 72 Spine Switches are needed.

Figure 10: Compute fabric for full 576 GPUs DGX SuperPOD (Source: NVIDIA)

A cluster with 9216 GPUs only requires 128+72=200 Q3400 switches (with a total of 200*128=25600 800 Gbps Ports).

Figure 11: Larger SuperPOD component counts (Source: NVIDIA)

Comparison of NVIDIA AI Cluster Architectures

Conclusion

The evolution from B200 to B300 and from GB200 to GB300 reflects a broader shift in AI infrastructure design. Modern GPU clusters increasingly rely on higher network bandwidth, improved switch density, and more efficient topology designs to support large-scale AI training workloads.

From 400Gb/s InfiniBand fabrics to 800Gb/s networking technologies, each new generation of NVIDIA platforms introduces improvements in bandwidth, scalability, and deployment efficiency. At the same time, rack-scale architectures such as NVL72 significantly increase compute density, allowing hyperscale data centers to deploy more GPUs within a smaller physical footprint.

Together, these innovations form a complete interconnect ecosystem that enables modern AI clusters to scale from individual nodes to thousands of GPUs while maintaining high-performance communication across the entire system.