top of page

AI - The AI Supply Chain - Data Center Networks - Part 2a - Different types of AI Data Center Networks

  • brencronin
  • 1 day ago
  • 5 min read

AI's Multiple Networks: Understanding the Layers of Connectivity


Modern AI infrastructure uses multiple specialized networks, each with distinct performance, reliability, and cost requirements. These networks are typically organized into four or five tiers, ranked below from lowest to highest by speed, performance demands, and connectivity needs.


Because each network tier has different requirements, they use different cabling and hardware at varying price points. Within each tier, multiple logical functions can share the same physical infrastructure while remaining logically separated.


Each AI infrastructure network is detailed below, including its typical purpose and connectivity examples:


  1. Out-of-Band (OOB) Networking

  2. Front End Network (FENW) 'Ethernet Fabric'

    2a. Inband Management

    2b. Storage

  3. Back End Network (BENW) - Compute Fabric

  4. Accelerator Interconnect


1. Out-of-Band (OOB) Networking


Out-of-band networks provide control, management, and monitoring for all IT systems. This includes:


  • Server and super NIC Baseboard Management Controller (BMC) access

  • Power Distribution Units (PDUs) and Cooling Distribution Units (CDUs)

  • Network Management & Control (M&C) interfaces on network and other devices

  • OS re-imaging, remote server health monitoring (fan speeds, temperatures, power usage, etc.)


OOB traffic is typically low bandwidth but mission-critical for operational oversight, requiring reliable connectivity. Cheaper copper cabling and ethernet switches are used for this network connectivity.

ree

2a. Front End Network (FENW)  'Ethernet Fabric' - Inband Management


Frontend networking supports general-purpose data operations and orchestration across the AI environment. Common functions include:


  • Data loading and ingestion

  • Model checkpointing

  • Network-attached storage access

  • Integration with Simple Linux Utility for Resource Management (SLURM), Kubernetes, and other orchestration frameworks


This network often spans clusters, data halls, or entire campuses, and may interface with the Internet for tasks like centralized dataset distribution. While speed is important, it's not as latency-sensitive as backend or accelerator interconnects because it is not used for model training synchronization. Latest speeds for this connectivity are 200 GB.


ree

Note:


In a typical front-end Ethernet switch fabric, supporting in-band management and storage traffic, a pair of leaf switches can provide sufficient connectivity for four racks of servers. Depending on the data center design, these leaf switches may be installed at the top of each adjacent rack or centralized within a dedicated network rack positioned near the server block.


The two spine switches above them can aggregate multiple leaf pairs, effectively serving as the core spine layer for multiple Scalable Units (SUs). Each SU consists of eight server racks, and a full SuperPod is composed of eight SUs, totaling:


  • 64 server racks

  • 32 leaf switches (two per four racks)

  • Fully redundant uplinks and paths across the fabric


This hierarchical design ensures predictable performance, high availability, and linear scalability as additional SUs and SuperPods are deployed.


2b. Front End Network (FENW) 'Ethernet Fabric' - Storage


The storage fabric delivers high-bandwidth connectivity to shared storage systems and leverages the flexibility of the front-end Ethernet network by allowing switch ports to be configured for varying speeds and services. This adaptability ensures that storage traffic can be optimized without redesigning the underlying infrastructure. Modern deployments commonly use 400 Gbps Ethernet interfaces, enabling the throughput required for accelerated data loading.


A key architectural characteristic is that the storage fabric operates independently from the compute fabric, allowing both storage throughput and application performance to be maximized without introducing contention or cross-fabric bottlenecks.


ree

3. Back End Network (BENW) - Compute Fabric


Backend networks handle high-speed GPU-to-GPU communication across racks, clusters, or buildings. It is built using state-of-the-art, high-performance, low-latency switches that represent the current leading edge of networking technology, while also supporting seamless upgrades to next-generation hardware as it becomes available. Design choices for this network include:


  • High throughput, low latency 800 GB

  • Certain architectures allow for higher port radix. Port radix = the number of high-speed ports a switch provides.

  • In a fat-tree fabric, this number dictates:

    • How many downlinks you can provide to servers or leaf switches

    • How many uplinks you can provide to spine switches

    • How deep and wide the topology can scale. Non-blocking (1:1 bandwidth) fabrics will have the highest performance.

    • The oversubscription ratio (if any)

  • Rail-optimized connectivity is extended all the way to the top tier of the fabric, ensuring predictable bandwidth and consistent performance across the compute network.

  • The compute fabric is architected as a balanced, full–fat-tree topology, providing uniform latency, deterministic paths, and non-blocking bandwidth between all endpoints.

  • Managed NDR switches are deployed throughout the design to enhance operational visibility, centralized management, and lifecycle control of the fabric.

  • The architecture is fully aligned to support SHARPv3 (Scalable Hierarchical Aggregation and Reduction Protocol v3), enabling in-network compute acceleration, optimized collective operations, and improved end-to-end performance for large-scale AI workloads.


ree

Network Speed: Understanding "Rails" in High-Performance Networking


In networking, the term "rail" borrows its origin from power distribution systems in computer hardware, where power lines are laid out like railway tracks on motherboards and power supplies. In modern high-performance computing (HPC) and AI infrastructure, the term has evolved to describe parallel network paths designed for optimized throughput and redundancy.


A rail-optimized GPU interconnect refers to designing GPU nodes with multiple high-speed interfaces, such as InfiniBand (IB), where each interface (or “rail”) connects to a different leaf switch in a spine-leaf network topology. This approach is commonly seen in large-scale AI clusters, which often feature 8-rail InfiniBand architectures, of NVIDIA Scalable Units that have 8 leaf switches with 4 rails.


4. Accelerator Interconnect


The fastest layer in AI networking, accelerator interconnects enable direct GPU-to-GPU communication within a node or tightly coupled system. The most prominent example is:


NVIDIA NVLink & NVSwitch

  • Built using NVIDIA’s proprietary NVHS signaling

  • Provides direct memory access between GPUs, bypassing the CPU and system memory

  • Offers significantly higher bandwidth than PCIe and even InfiniBand in local scenarios


First the, NVLink-C2C feature, bridges the NVIDIA two Superchips on each compute tray and enables them to act as a single logical unit for a single OS instance to operate.

The rack has 18 compute trays so each GPU in the tray features 18 NVL5 links and has one dedicated NVL5 link connectivity to each one of the 18 switch chips in the switch trays. On each of the nine NVLink switch trays there are two NVLink switch chips, which together provide the full-mesh interconnect for all 72 GPUs within a single DGX GB300 rack. The total is 1.8 TB/s low latency bandwidth


ree

Other accelerator interconnect technologies include:


  • Google TPU Interconnects

  • AMD Infinity Fabric/UALink


These interconnects are essential for synchronous AI training, where GPUs must stay in constant communication to share and update model parameters. Even small delays can reduce training efficiency, making this the most performance-sensitive network in the AI stack.


References


Data center design requirements for AI workloads. A Comprehensive guide.


Semianalysis: AI Neocloud Playbook and Anatomy


NVIDIA DGX SuperPOD: Next Generation Scalable Infrastructure for AI Factories


NVIDIA DGXSuperPOD: Next Generation Scalable Infrastructure for AI Leadership Reference Architecture Featuring NVDIA DGX GB200


Supermicro NVIDIA GB300 NVL72


NVIDIA SuperPOD DGXB300Systems, Spectrum-4 Ethernet and DC Busbar Power Reference Architecture


Unlocking Ultra-Fast GPU Communication with NVIDIA NVLink & NVLink Switch


NVIDIA Spectrum-4 SN5000 2U Switch Systems Hardware User Manual


Understanding the Power of NVIDIA’s BlueField-3 DPU


NVIDIA BlueField BMC Software v25.01


NVIDIA Contributes NVIDIA GB200 NVL72 Designs to Open Compute Project


GB200 Hardware Architecture - Component Supply Chain & BOM


NVIDIA GB200 Analysis: Interconnect Architecture and Future Evolution


Macom (MTSI US) – A Hidden Nvidia GB200 Play (updated version)


QCT's Next Leap in Accelerated Computing


Post: Blog2_Post
  • Facebook
  • Twitter
  • LinkedIn

©2021 by croninity. Proudly created with Wix.com

bottom of page