AI - The AI Supply Chain - Data Center Networks - Part 2a - Different types of AI Data Center Networks

brencronin
1 day ago
5 min read

AI's Multiple Networks: Understanding the Layers of Connectivity

Modern AI infrastructure uses multiple specialized networks, each with distinct performance, reliability, and cost requirements. These networks are typically organized into four or five tiers, ranked below from lowest to highest by speed, performance demands, and connectivity needs.

Because each network tier has different requirements, they use different cabling and hardware at varying price points. Within each tier, multiple logical functions can share the same physical infrastructure while remaining logically separated.

Each AI infrastructure network is detailed below, including its typical purpose and connectivity examples:

Out-of-Band (OOB) Networking
Front End Network (FENW) 'Ethernet Fabric'
2a. Inband Management
2b. Storage
Back End Network (BENW) - Compute Fabric
Accelerator Interconnect

1. Out-of-Band (OOB) Networking

Out-of-band networks provide control, management, and monitoring for all IT systems. This includes:

Server and super NIC Baseboard Management Controller (BMC) access
Power Distribution Units (PDUs) and Cooling Distribution Units (CDUs)
Network Management & Control (M&C) interfaces on network and other devices
OS re-imaging, remote server health monitoring (fan speeds, temperatures, power usage, etc.)

OOB traffic is typically low bandwidth but mission-critical for operational oversight, requiring reliable connectivity. Cheaper copper cabling and ethernet switches are used for this network connectivity.

2a. Front End Network (FENW) 'Ethernet Fabric' - Inband Management

Frontend networking supports general-purpose data operations and orchestration across the AI environment. Common functions include:

Data loading and ingestion
Model checkpointing
Network-attached storage access
Integration with Simple Linux Utility for Resource Management (SLURM), Kubernetes, and other orchestration frameworks

This network often spans clusters, data halls, or entire campuses, and may interface with the Internet for tasks like centralized dataset distribution. While speed is important, it's not as latency-sensitive as backend or accelerator interconnects because it is not used for model training synchronization. Latest speeds for this connectivity are 200 GB.

Note:

In a typical front-end Ethernet switch fabric, supporting in-band management and storage traffic, a pair of leaf switches can provide sufficient connectivity for four racks of servers. Depending on the data center design, these leaf switches may be installed at the top of each adjacent rack or centralized within a dedicated network rack positioned near the server block.

The two spine switches above them can aggregate multiple leaf pairs, effectively serving as the core spine layer for multiple Scalable Units (SUs). Each SU consists of eight server racks, and a full SuperPod is composed of eight SUs, totaling:

64 server racks
32 leaf switches (two per four racks)
Fully redundant uplinks and paths across the fabric

This hierarchical design ensures predictable performance, high availability, and linear scalability as additional SUs and SuperPods are deployed.

2b. Front End Network (FENW) 'Ethernet Fabric' - Storage

The storage fabric delivers high-bandwidth connectivity to shared storage systems and leverages the flexibility of the front-end Ethernet network by allowing switch ports to be configured for varying speeds and services. This adaptability ensures that storage traffic can be optimized without redesigning the underlying infrastructure. Modern deployments commonly use 400 Gbps Ethernet interfaces, enabling the throughput required for accelerated data loading.

A key architectural characteristic is that the storage fabric operates independently from the compute fabric, allowing both storage throughput and application performance to be maximized without introducing contention or cross-fabric bottlenecks.

3. Back End Network (BENW) - Compute Fabric

Backend networks handle high-speed GPU-to-GPU communication across racks, clusters, or buildings. It is built using state-of-the-art, high-performance, low-latency switches that represent the current leading edge of networking technology, while also supporting seamless upgrades to next-generation hardware as it becomes available. Design choices for this network include:

High throughput, low latency 800 GB
Certain architectures allow for higher port radix. Port radix = the number of high-speed ports a switch provides.
In a fat-tree fabric, this number dictates:
- How many downlinks you can provide to servers or leaf switches
- How many uplinks you can provide to spine switches
- How deep and wide the topology can scale. Non-blocking (1:1 bandwidth) fabrics will have the highest performance.
- The oversubscription ratio (if any)
Rail-optimized connectivity is extended all the way to the top tier of the fabric, ensuring predictable bandwidth and consistent performance across the compute network.
The compute fabric is architected as a balanced, full–fat-tree topology, providing uniform latency, deterministic paths, and non-blocking bandwidth between all endpoints.
Managed NDR switches are deployed throughout the design to enhance operational visibility, centralized management, and lifecycle control of the fabric.
The architecture is fully aligned to support SHARPv3 (Scalable Hierarchical Aggregation and Reduction Protocol v3), enabling in-network compute acceleration, optimized collective operations, and improved end-to-end performance for large-scale AI workloads.

Network Speed: Understanding "Rails" in High-Performance Networking

In networking, the term "rail" borrows its origin from power distribution systems in computer hardware, where power lines are laid out like railway tracks on motherboards and power supplies. In modern high-performance computing (HPC) and AI infrastructure, the term has evolved to describe parallel network paths designed for optimized throughput and redundancy.

A rail-optimized GPU interconnect refers to designing GPU nodes with multiple high-speed interfaces, such as InfiniBand (IB), where each interface (or “rail”) connects to a different leaf switch in a spine-leaf network topology. This approach is commonly seen in large-scale AI clusters, which often feature 8-rail InfiniBand architectures, of NVIDIA Scalable Units that have 8 leaf switches with 4 rails.

4. Accelerator Interconnect

The fastest layer in AI networking, accelerator interconnects enable direct GPU-to-GPU communication within a node or tightly coupled system. The most prominent example is:

NVIDIA NVLink & NVSwitch

Built using NVIDIA’s proprietary NVHS signaling
Provides direct memory access between GPUs, bypassing the CPU and system memory
Offers significantly higher bandwidth than PCIe and even InfiniBand in local scenarios

First the, NVLink-C2C feature, bridges the NVIDIA two Superchips on each compute tray and enables them to act as a single logical unit for a single OS instance to operate.

The rack has 18 compute trays so each GPU in the tray features 18 NVL5 links and has one dedicated NVL5 link connectivity to each one of the 18 switch chips in the switch trays. On each of the nine NVLink switch trays there are two NVLink switch chips, which together provide the full-mesh interconnect for all 72 GPUs within a single DGX GB300 rack. The total is 1.8 TB/s low latency bandwidth

Other accelerator interconnect technologies include:

Google TPU Interconnects
AMD Infinity Fabric/UALink

These interconnects are essential for synchronous AI training, where GPUs must stay in constant communication to share and update model parameters. Even small delays can reduce training efficiency, making this the most performance-sensitive network in the AI stack.