AI - The AI Supply Chain - AI Data Center Networks - Part 1a - History of Data Center Networks

brencronin
2 days ago
6 min read

The Critical Role of Networking in AI

Beyond power and cooling, and beyond the specialized servers and AI accelerators themselves, one of the most essential, and often underestimated, components of any AI infrastructure is the network fabric that interconnects those systems. In NVIDIA’s AI Factory reference model, the network is not a supporting element; it is a core pillar that determines how efficiently AI workloads scale, communicate, and perform.

Just as AI chips continue to push the boundaries of performance and speed, data center networks have undergone and continue their own rapid evolution to deliver higher throughput, lower latency, and more intelligent traffic handling. These advancements are crucial because bottlenecks in the network become bottlenecks in the entire AI pipeline.

While the past should not dictate the future, it provides valuable context for understanding why today’s AI networking architectures look the way they do. This first installment in our series on AI data center networking will begin by exploring the history and foundational principles that shaped the networks powering modern AI systems.

History and Key Terminology of Computer Network Connectivity

Modern data center networking has evolved over decades, primarily driven by the need for faster and more reliable server interconnectivity. While many of the core technologies have been around for years, understanding their origins and key design principles helps make sense of today’s high-speed infrastructure, especially as it relates to the AI supply chain of network connectivity.

The Basics: Ethernet and Local Area Networks (LANs)

At its foundation, computer-to-computer connectivity within a location is built on Ethernet, a protocol developed in 1973 by Bob Metcalfe at Xerox PARC. Ethernet has been around a long time! Ethernet networks operate by connecting computers to switches, which then connect to other switches to expand network coverage within a building or campus. This switch network when interconnected at a single geographic location was/is commonly referred to as a Local Area Network (LAN).

Classic Network Switch Design: Access, Distribution, and Core

As organizations adopted widespread Internet access in the 1990s, network designs became more tiered to support scalability and manageability. This led to a common three-tier architecture:

Access Layer: Switches near end users (e.g., at cubicles or office floors).
Distribution Layer: Aggregates multiple access switches and connects to the core.
Core Layer: High-speed switches in the data center interconnecting major parts of the network.

This design was influenced heavily by hardware costs and signal transmission limits. Fiber optics, needed for high-speed, long-distance connections, were more expensive than copper, both in cables and switch components. The diagram below illustrates a typical business switch network, where cubicle devices connect to access switches housed in local wiring closets. These access switches link to distribution switches via high-speed trunk connections, typically located in a central building closet. Distribution switches then connect to core switches in the main data center, creating a fully interconnected network across the building or campus.

Over time, fiber became more affordable and high-speed switch components more common which allowed for variations to this 3-tier switched network design. There is no one-size-fits-all design for switch networks; their implementation continues to be shaped by the available technology and cost-performance tradeoffs. An example of this is many lower-tier access switches being connected directly to the core layer due to higher speed optics being implemented in lower end switches.

[diagram]

Redundancy and the Challenge of Network Loops

Network reliability is a foundational concept in network connectivity. In a business environment, losing an entire floor's network access is far more disruptive than losing a single device, making redundancy essential. Redundant switch links, however, can create switch loops, where packets circulate endlessly. To address this, Spanning Tree Protocol (STP) was introduced in the 1980s, using graph theory to build a loop-free logical topology by electing a root switch and blocking redundant paths. I told you these network concepts were old! While AI cluster networks typically use more modern alternatives to STP, the core need for redundancy and loop prevention continues to shape network architecture today.

Layer 2 vs. Layer 3 Switching

Traditional network switches operated at Layer 2 (Ethernet), forwarding traffic based on MAC addresses stored in a Content Addressable Memory (CAM) table. When an Ethernet frame arrives, the switch quickly uses its ASICs to look up the destination MAC and forwards the frame to the correct port. If the switch doesn’t know the MAC address associated with an IP, it sends an Address Resolution Protocol (ARP) request to find it. When traffic needs to cross network boundaries (broadcast domains), it is handed off to a router operating at Layer 3. A major evolution in networking has been the integration of Layer 3 capabilities into switches, allowing them to perform both Ethernet switching and IP routing. This convergence enables modern switches to support routing protocols like Border Gateway Protocol (BGP), facilitating the design of scalable, high-performance, and loop-free networks with redundancy.

The Need for Speed: Evolution of High-Performance Network Fabrics

In the 2000s, the demand for greater speed and efficiency in network design surged as the industry pivoted from connecting office users to interconnecting data center servers at massive scale. This shift was driven by the explosion of online services and data-intensive applications, including high-frequency trading (HFT), that required low-latency, high-throughput server to server connectivity. To meet these demands, network architects adopted high-speed switching architectures for data centers referred to as, the Clos network, also known as leaf-spine or switch fabric design.

The Clos network concept originated in the 1950s from Charles Clos, who designed multistage switching systems for telecommunications to efficiently connect many inputs to many outputs. This approach has since been adapted for modern data centers to support dense server-to-server communication.

In leaf-spine architectures, the leaf switches, typically the top-of-rack (ToR) switches that connect directly to servers, are connected to spine switches, which form the high-speed core of the network. Every leaf connects to every spine, creating a non-blocking fabric that ensures minimal latency and maximum throughput. This design excels at handling East-West traffic, which refers to internal server-to-server communication, in contrast to North-South traffic, which flows in and out of the data center.

What does fabric in switching mean?

The term fabric in networking refers to a highly interconnected topology that provides seamless, scalable communication across nodes. It is visualized like a woven matrix, symbolizing dense interconnectivity. In the AI and cloud computing landscape, "fabric" has multiple contexts. These posts on AI networking are focused on network switch fabrics.

Chip Fabric – The internal interconnect within a System-on-Chip (SoC) that links CPU cores, memory controllers, and I/O, enabling efficient intra-chip data movement.
Fabric Controller – In cloud platforms like Microsoft Azure, the fabric controller manages clusters of physical and virtual resources, orchestrating compute, storage, networking, and load balancing via a central management plane.
Network switch Fabric – A dynamic, often virtualized mesh of switches, routers, and links that enables high-performance, low-latency communication across the data center. It abstracts physical constraints while supporting scalable, automated networking.

Understanding North-South vs. East-West Traffic in Modern Data Center Architectures

As server architectures continue to evolve, so too must the way we design and optimize network infrastructure, especially when considering the flow of traffic within and in and out of the data center.

One of the foundational concepts in organizational and data center networking is the distinction between North-South and East-West traffic:

North-South traffic refers to data entering or leaving the data center. This could include client-to-server traffic over the internet or communication between geographically separate data centers.
East-West traffic refers to server-to-server communication within the data center, including traffic between compute, storage, and application components.

Historically, network optimization focused heavily on North-South traffic to support the rise of web-based, client-server architectures, optimizing user-to-server communication for speed and reliability. While this remains important, particularly for AI inference workloads (e.g., a user interacting with ChatGPT), the rise of AI training has significantly shifted the focus toward East-West traffic optimization.

The AI Effect on East-West Traffic

AI training workloads involve distributing large datasets across multiple systems, with constant communication to synchronize model weights and gradients. This demand for low-latency, high-throughput East-West communication has pushed data center design to new levels of performance. To be clear, East-West traffic has been gaining importance well before the AI boom, particularly with the rise of virtualization and hyperconverged infrastructure, but AI has accelerated the need for server-to-server networking optimizations.

Future articles in this series will explore the networking architectures and solutions that have emerged, and continue to evolve, to meet the ever-increasing performance and speed demands of AI.