AI - The AI Supply Chain - AI Data Center Networks - Part 2 - Data Center Network Switch fabrics

brencronin
2 days ago
7 min read

Today’s AI Systems and need for High-speed interconnects

Modern AI systems resemble supercomputers far more than traditional web-scale architectures. This is because AI workloads, especially distributed training using data-parallel, model-parallel, or tensor-parallel techniques, place enormous demands on east-west bandwidth between GPUs. These GPU-to-GPU exchanges include:

Gradient sharing
Parameter synchronization
Checkpointing and state transfers
Collective operations such as All-Reduce, All-Gather, and Reduce-Scatter

Like classical HPC systems, today’s AI clusters cannot exist as a single monolithic server. Instead, they are built from thousands of accelerators working in tight proximity. In these massively parallel environments, the performance of the interconnect becomes just as important as the performance of the GPUs themselves.

One of the core metrics that describes the capability of this interconnect is bisection bandwidth. When a system provides full bisection bandwidth, every processing element (GPU, CPU, or node) can communicate with every other processing element at full line rate, simultaneously, without oversubscription or congestion. It is the gold standard for performance in large-scale distributed AI training.

Why Bisection Bandwidth Matters

In AI networking, full bisection bandwidth represents the maximum throughput available between two equal halves of a cluster when all nodes are talking to each other at once.

If you divide the cluster into two sides, full bisection bandwidth guarantees:

Every node on Side A can talk to every node on Side B
At the network’s full rated speed
At the same time
Without collisions, drops, queues, or slowdowns

This capability is essential for training stability, model convergence speed, and overall cluster efficiency.

A Visual Analogy

Imagine a large sheet with holes in it holding dozens of marbles:

Marbles = GPU workloads (data transfers, gradients, tensors)
The sheet = the cluster’s network fabric
The holes = available bandwidth paths (bisection bandwidth)

With only a few holes, marbles pile up, waiting to pass through, just like GPU traffic waiting on congested links. Latency rises, throughput collapses, and the entire training job slows down.

Add more holes, more bandwidth and more parallel network paths, and the marbles flow freely. Congestion disappears, GPUs stay synchronized, and the cluster performs as designed.

Methods for Achieving Full Bisection Bandwidth

Achieving full bisection bandwidth requires more than simply using high-speed links. In theory, the most direct method is to give every processor a full-rate connection to every other processor, a true any-to-any interconnect.

However, two major challenges arise as soon as the system begins to scale:

1. The Explosive Growth of Connections

One of the most underestimated factors in AI infrastructure design is the combinatorial explosion of interconnects required for a full mesh. As node counts increase, the number of required links grows quadratically.

The total number of connections in a full mesh is:

Examples:

8 nodes → 28 connections
16 nodes → 120 connections
32 nodes → 496 connections

Now imagine scaling this to hundreds or thousands of GPUs. Even when partitioned into smaller pools, the infrastructure required to maintain ultra-low-latency, high-bandwidth connectivity becomes enormous.

Cabling, optics, switch ports, rack space, cooling, and operational power all grow quickly, and nonlinearly.

2. Physical Interconnect Limitations Per Node

Beyond cost and complexity, each node has a finite number of high-speed links it can physically support. This imposes a hard limit on how many direct peer connections are possible.

To overcome this constraint, large AI fabrics rely on hierarchical network topologies

Nodes are grouped into local subgroups with high-bandwidth internal connectivity.
Each subgroup connects to a parent tier of switches, which provide non-blocking connectivity to other subgroups.
As the cluster grows, these parent tiers may themselves connect into higher-order grandparent tiers, each with full interconnect bandwidth at their level.

This structured hierarchy allows line-rate communication across the entire cluster without requiring an impossible number of direct links from each node.

Example High-Speed Network Topologies

Modern large-scale AI and HPC environments rely on specialized network topologies designed to maximize throughput, minimize hop count, and deliver predictable, high-bisection bandwidth. Below are several prominent architectures and why they matter for high performance computing.

Dragonfly

The Dragonfly topology is named for its structural resemblance to a dragonfly:

Large, highly interconnected local groups form the “wide body.”
A smaller number of long-reach links connect these groups to others, forming the “wings.”

This design dramatically reduces global network diameter (often to just 3 hops), enabling high performance at scale while minimizing cabling complexity and cost.

Torus

A Torus network derives its name from the geometric torus—a donut-shaped, continuous loop. In networking terms, nodes are arranged in a grid where edges wrap around in each dimension, forming a multi-dimensional ring-like structure.

This topology provides:

Deterministic, predictable paths
High fault tolerance through multiple wrap-around routes
Uniform connectivity that scales well with increases in dimensionality
A 3D torus offers short, direct neighbor-to-neighbor links from a smaller network diameter that requires far fewer switches than a Clos fabric, which can reduce infrastructure costs.

Torus designs have historically been used in supercomputers that require consistent local communication patterns.

Clos / Leaf–Spine

The Clos topology, commonly implemented as a Leaf–Spine architecture, has long been the benchmark for high-performance, non-blocking networks. In this model:

Every leaf switch connects to every spine switch,
Ensuring an even distribution of traffic and low hop counts.
Spine-leaf architectures expand cleanly by adding leaf switches without requiring major topology restructuring.
Clos designs provide multiple high-quality alternate paths, enabling stronger redundancy and more effective traffic distribution than a 3D torus.

While Leaf–Spine works exceptionally well for east–west data center traffic, AI training introduces synchronized, collective traffic patterns that can saturate conventional oversubscribed spine layers.

Fat-Tree Network Architecture

To overcome these limitations, AI and HPC environments often deploy Fat-Tree architectures, an extension of the Clos model that “fattens” the bandwidth near the top of the topology.

Note: Modern leaf-spine deployments typically provision full bandwidth between all layers, making the terms "leaf-spine" and "fat tree" functionally equivalent in current practice.

Key attributes include:

Increased link capacity between upper tiers
Symmetrical bandwidth across the entire fabric
Clos networks deliver consistent, non-blocking throughput, ensuring full-bandwidth connectivity without contention.
High redundancy through multiple parallel paths

These properties make Fat-Tree designs particularly well-suited for AI training clusters, where predictable, high-throughput GPU-to-GPU communication is essential for performance and scalability.

Rail Optimized

A rail-optimized topology is a network architecture that organizes GPUs into separate, parallel communication domains called "rails." In networking, the term "rail" borrows its origin from power distribution systems in computer hardware, where power lines are laid out like railway tracks on motherboards and power supplies. In modern high-performance computing (HPC) and AI infrastructure, the term has evolved to describe parallel network paths designed for optimized throughput and redundancy. Each GPU server has multiple high-speed network interfaces (NICs), and each NIC connects to its own dedicated spine switch or set of spine switches - this is one "rail." For example, a server with 8 NICs might connect to 8 separate rails, with each rail having its own independent switching fabric.

The diagram below shows how rail optimization creates a flatter network by connecting all servers directly to high-speed leaf switches, reducing hops between GPUs and maximizing performance.

Note: HCA (Host Channel Adapter) refers to the specialized network interface card that connects GPU servers to high-speed networks like InfiniBand or high-performance Ethernet.

Switch radix

In data center networking, radix refers to the total number of ports on a switch - essentially, it's the switch's port count.

For example:

A 32-port switch has a radix of 32
A 64-port switch has a radix of 64
Modern high-radix switches can have 128, 256, or even more ports

Why High Radix Improves Performance

High-radix switches provide several significant benefits for data center architecture:

Reduced Network Hops: With more ports per switch, you can connect more servers or other switches directly to a single device. This means packets traveling between endpoints need to pass through fewer intermediate switches (fewer "hops"), which reduces latency and improves throughput.
Simplified Network Topology: Higher radix allows you to build flatter network architectures. Instead of needing multiple tiers of switches (like a traditional three-tier design with access, aggregation, and core layers), you might only need two tiers or even a single-tier topology for smaller deployments. Simpler topologies are easier to manage and troubleshoot.
Lower Oversubscription: With more ports available, you have more flexibility in how you allocate bandwidth. You can provide higher bandwidth ratios between server-facing ports and uplink ports, reducing oversubscription (the ratio of total server bandwidth to uplink bandwidth). This means less congestion when multiple servers communicate simultaneously.
Reduced Equipment and Cabling: Fewer switches are needed overall to connect the same number of servers, which translates to lower capital costs, reduced power consumption, less rack space, and significantly less cabling complexity. This also improves reliability since there are fewer potential failure points.

Balancing Network Investment in AI Infrastructure

The cost of connectivity in large-scale AI clusters extends far beyond hardware, cabling, and components. It also includes power consumption, physical footprint, and operational complexity, factors that can significantly affect the total cost of ownership (TCO) and system performance.

When organizations invest heavily in high-performance GPUs, it makes little sense to cut corners on lower-cost components like network switches or cabling. Doing so creates bottlenecks that diminish the return on investment in the more expensive compute hardware. It's akin to outfitting a factory with state-of-the-art machinery but neglecting the conveyor systems that move parts between them, resulting in wasted potential and underutilized capacity.

At the same time, cost awareness is critical. Networking costs can quickly scale with cluster size, and not all traffic within an AI environment requires the same performance characteristics. AI clusters typically operate multiple distinct network types, each with different latency, bandwidth, and reliability requirements. Understanding these AI-specific network tiers (e.g., training fabric, storage fabric, management/control planes) is essential for making smart architectural decisions that prioritize performance where it's most needed while controlling costs where feasible.