AI cybersecurity from the ground up - Data Centers - Part 5 'Data Center Cooling & Thermal Attacks'

brencronin
May 5
10 min read

Updated: Jul 25

While power is the foundation of any data center, cooling and thermal management are equally critical, especially in AI-intensive environments where compute loads generate extreme heat. In these high-performance data centers, inadequate cooling can quickly lead to system instability, hardware degradation, or complete failure.

What’s more, targeting cooling infrastructure is an increasingly effective method for disrupting AI data center operations. In some cases, thermal attacks can be more devastating than power-based attacks, making cooling not just an operational concern, but a strategic vulnerability in the AI infrastructure supply chain.

The Physics of Heat in Confined Spaces

Understanding the non-linear progression of heat in data centers is key. For example:

A failure that breaks a steady 70°F cooling setpoint may take 4 hours to reach 80°F
But only 90 minutes to hit 90°F,
Then just 30 minutes to reach 100°F,
And finally, as little as 5–10 minutes to spike to 110°F or higher

This thermal runaway effect leaves responders very little time to act. In severe cases, the most effective mitigation is to shut down servers immediately and prioritize fixing the thermal system to avoid permanent component damage.

Thermal Attacks: A New Vector in AI Infrastructure

Sophisticated threat actors targeting AI environments may exploit cooling systems by:

Testing thresholds, slowly adjusting setpoints to observe operator responses
Injecting false positive thermal signals to desensitize monitoring
Manipulating cooling telemetry or disabling alerts
Launching coordinated attacks during holidays or low-staff periods

By exploiting the gaps in thermal monitoring and response, attackers can cause maximum damage with minimal direct interaction with compute systems.

The Evolving Role of Cooling Technology

This article also explores the evolution of data center cooling, from legacy air-based systems to modern direct-to-chip liquid cooling (DLC). DLC allows higher power densities and tighter rack spacing, which improves performance, but also introduces new cybersecurity and physical attack surfaces.

In an earlier article in this series, we referenced the 10 kW per rack limit, a common standard in colocation environments. However, this threshold is less a power constraint and more a byproduct of traditional air-cooling limitations. In reality, it’s the cooling capacity, not the power supply, that often dictates this limit.

Today, advanced cooling technologies like DLC are redefining what's possible, enabling higher power densities and more compact rack configurations. But with these innovations come new security considerations. As cooling systems become more integrated and intelligent, they introduce additional attack surfaces, requiring modernized threat models and proactive defenses to protect AI and high-performance data center environments.

Traditional Data Center Cooling

At the most basic level, servers are cooled by air. Cold air is drawn into the front of the server chassis, circulated by internal fans across critical components like CPUs and memory, and exhausted out the rear as hot air. This airflow cycle forms the foundation of air-based data center cooling.

To scale this up across rooms full of servers, data centers deploy Computer Room Air Conditioners (CRACs) and Computer Room Air Handlers (CRAHs). These systems generate and distribute cold air into the data halls while simultaneously extracting hot air. The core mechanism behind this heat exchange is relatively straightforward: hot air is passed over chilled liquid-filled coils, transferring the heat to the liquid.

This now-warmed liquid is then pumped to chillers or water towers; the large units often found on the rooftops or perimeters of data center buildings. These chillers remove the heat using a process known as adiabatic expansion cooling and return the now-cooled liquid to the CRAC/CRAH systems, completing the cycle.

A compelling example of how even minor changes to thermal management systems can severely disrupt data center operations is the well-known Singapore data center outage. Initially suspected to be the result of a cyberattack, the incident was later traced to a contractor error during a scheduled system upgrade. The contractor mistakenly sent a signal that closed the valves on the chilled water buffer tanks, cutting off the flow of chilled water to the cooling system and causing a significant service disruption.

While ultimately deemed accidental, the event underscores a critical point: whether through human error, insider threat, or a compromised system, small changes in cooling system control logic can have disproportionately large consequences. This scenario illustrates how easily malicious actors, especially those with access or remote footholds, could exploit similar vulnerabilities to achieve destructive outcomes.

Cooling and the Rack

At the heart of server cooling is the challenge of managing heat generated by individual components, particularly high-performance chips like CPUs and GPUs. These components are equipped with heat sinks, specialized metal structures designed to draw heat away from the chip and disperse it into the surrounding air. This heated air is then expelled from the rear of the server by internal fans.

As chip performance increases, so does heat output. This is why GPU-based servers, which consume significantly more power than standard CPU-based servers, often require larger heat sinks and more airflow, making them physically larger and consuming more rack units (RUs) despite having a similar number of chips.

A key thermal management concept here is ΔT (Delta T), or the temperature differential between the cool air entering the front of the server and the hot air exiting the rear. This differential serves as a measure of cooling effectiveness. The lower the inlet air temperature, the less work the server fans have to do to move enough air to dissipate heat. Conversely, higher inlet temperatures require greater airflow to maintain safe operating conditions. Understanding and optimizing this Delta T is crucial for efficient cooling, especially as densely packed, high-performance racks become more common in AI and High-Performance Compute (HPC) environments.

Controlling Cold and Hot Air in the Data Center

Computer Room Air Conditioners (CRACs) are central to data center cooling, pushing chilled air beneath a raised floor. Perforated floor tiles placed in front of server racks allow pressurized cold air to flow upward into server inlets. Server fans then pull this air across hot components and exhaust the heated air out the back. Some racks also include integrated fans to assist airflow. This cooling approach, illustrated earlier, historically limited power density to around 10 kW per rack, a constraint still common in many U.S. colocation data centers, despite some improvements in airflow design.

One foundational enhancement was the introduction of cold aisle/hot aisle orientation. By alternating the direction of racks, cold air is directed into server inlets from one aisle (cold aisle), and hot air is exhausted into the opposite aisle (hot aisle). This prevents hot exhaust from one server becoming the intake for another.

To further optimize this layout, cold aisle containment systems were introduced. These use doors and panels to enclose the cold aisle, trapping chilled air and preventing it from mixing with warmer ambient air, ensuring servers receive the coolest possible air.

Advanced designs also incorporate hot aisle containment, which encloses the hot aisle to isolate and remove warm exhaust more efficiently. However, hot aisle containment typically requires a custom ceiling plenum and is generally only feasible in newer data center builds.

Fan Walls

An alternative to traditional CRAC/CRAH systems that push cold air through raised floors is the fan wall cooling design. In this approach, wall-mounted air-handling units (AHUs) are integrated into the perimeter or interior walls of the data hall. These AHUs deliver cool air directly into the cold aisles, effectively "flooding" them with chilled air.

After passing through the server racks and absorbing heat, the hot exhaust air is isolated in the hot aisles and either recirculated back to the cooling coils or vented out of the facility, depending on the system configuration.

Fan wall designs are particularly well-suited for slab floor data centers, which are becoming more common in modern builds. Unlike raised floor environments, slab floors offer several advantages:

Improved durability: They support heavier and newer IT equipment without structural concerns.
Lower construction complexity and cost: No need for underfloor cabling or airflow management systems.
Enhanced safety in seismic zones: Slab foundations are inherently more stable during earthquakes.

Rising Server Power Demands Are Reshaping Data Center Cooling

As servers have grown more powerful, driven by denser chips and high-performance workloads like AI training, the demand for cooling has increased dramatically. Modern AI servers often run at full power continuously, generating immense amounts of heat. Today, it's not uncommon for a single server to draw up to 10 kW+ of power, pushing traditional data center designs to their limits.

To manage the heat output, data centers have started adopting new rack configurations, such as 1U to 4U rack designs, often with additional spacing between racks to improve airflow. These changes, while necessary for thermal management, introduce several downstream challenges:

Reduced server density: More space per server means fewer servers per data hall footprint.
Increased latency: As GPUs are spaced farther apart, GPU-to-GPU communication takes longer, potentially degrading performance for tightly coupled workloads.
Fiber limitations: Maintaining high-speed interconnects over longer distances requires more expensive fiber types optimized for longer-range and high-bandwidth performance.

The diagram below illustrates this concept. For instance, consider a medium-power AI server equipped with two 3.3 kW power supplies, drawing a total of 6.6 kW. In a data center where each rack is limited to 10 kW, only one such server can be installed per rack, as shown on the left. With this rack power imitation to deploy four of these servers, they must be spread across four separate racks, wasting floor space, increasing cabling costs, and potentially introducing performance delays due to longer interconnect distances. This example highlights how increasing rack density leads to greater power demands, and, in turn, higher heat output required for the data center to handle.

Rear-Door Heat Exchanger (RDHx)

One of the most effective innovations in data center cooling is the Rear-Door Heat Exchanger (RDHx). This solution places a radiator-style heat exchanger directly on the back of each server rack, where it absorbs the hot exhaust air as it leaves the servers. RDHx systems can operate with chilled water or other coolants, removing heat right at the source, eliminating the inefficiencies of traditional CRAC/CRAH systems that rely on distant cooling units and complex ducting.

Key advantages of RDHx systems include:

High thermal efficiency: Cooling is applied exactly where it's needed, improving energy use and reducing loss.
Neutral room temperature: RDHx systems often eliminate the need for cold or hot aisle containment by keeping the air temperature in the room balanced.
Support for higher rack power densities: RDHx systems can easily cool racks consuming 30–40 kW, and with the addition of rear-door fans, they can exceed 50 kW per rack.

Direct-to-Chip Liquid Cooling (DLC)

Direct-to-Chip Liquid Cooling (DLC) is a cutting-edge thermal management technology enabling data centers to support rack power densities of 100 kW or more. Unlike traditional air-based cooling, DLC uses cold plates, metal plates in direct contact with high-heat components like CPUs or GPUs, to draw heat away through a liquid coolant.

The process works as follows:

Coolant flows through the cold plate, absorbing heat directly from the chip.
The now-heated liquid is routed to a Coolant Distribution Unit (CDU).
At the CDU, a heat exchanger transfers the heat to a secondary medium (air or another liquid) for external heat rejection.

There are two primary types of CDUs:

Liquid-to-Air (L2A) CDU: Transfers heat from the liquid to air, which is then expelled.
Liquid-to-Liquid (L2L) CDU: Transfers heat to another liquid loop, often tied to a facility’s chilled water system.

The example below shows Supermicro liquid-cooled racks used in the xAI data center. Each rack houses eight 4U servers, with each server containing eight NVIDIA H100 GPUs, for a total of 64 GPUs per rack. Each server draws approximately 10 kW, meaning the entire rack consumes around 80 kW. Thanks to liquid cooling, this high-power density is achieved without relying on traditional cooling methods such as raised floors or hot/cold aisle containment.

Rack Cooling Summary by Power Draw

The diagram below, sourced from Vertiv’s article 'Understanding Direct-to-Chip Cooling in HPC Infrastructure: A Deep Dive into Liquid Cooling', provides a clear overview of various cooling technologies and the power levels they support. The top x-axis represents power draw, while the center rectangle bars categorize the different cooling methods and the power levels they support. Each method is color-coded: black shows the typical supported range, orange indicates extended capabilities with modifications, and purple represents the upper operational limits of the technology.

The Data Center as a Chip – Cooling and Increased Attack Surfaces

In 1965, Gordon Moore foresaw a future where computational power would scale exponentially through transistor density, an observation that held for decades. But today, with the limits of miniaturization in sight, the industry is redefining scale, not at the chip level, but across entire data centers. Through System Technology Co-Optimization (STCO), modern data centers now function as unified, high-density compute engines, where performance gains come from tightly integrated infrastructure spanning servers, racks, and beyond.

At the heart of this transformation lies one critical enabler: advanced thermal management. Technologies like Rear Door Heat Exchangers (RDHx) and Direct-to-Chip Liquid Cooling (DLC) make high-density compute possible, but they also introduce new cybersecurity and physical attack surfaces. These cooling systems are no longer passive hardware, they are networked, programmable, and increasingly intelligent, making them viable targets for cyber threats.

As data centers evolve into highly integrated, system-level compute platforms, cooling must be viewed not just as an operational necessity, which also has its own security risks to operations. Protecting these thermal management systems is essential to maintaining both performance and uptime.

References

Adiabatic cooling:

https://www.quora.com/What-are-some-examples-of-adiabatic-processes-occurring-in-nature

The Case for Air-Cooled Data Centers:

https://www.stackinfra.com/resources/thought-leadership/the-case-for-air-cooled-data-centers/

The 4 Delta T’s of Data Center Cooling: What You’re Missing:

https://www.upsite.com/blog/the-4-delta-ts-of-data-center-cooling-what-youre-missing/

CoolShield Containment: Aisle Containment Systems

https://cool-shield.com/aisle-containment/

Move to a Hot Aisle/Cold Aisle Layout:

https://www.energystar.gov/products/data_center_equipment/16-more-ways-cut-energy-waste-data-center/move-hot-aislecold-aisle-layout

Report to Congress on Server and Data Center Energy Efficiency:

https://www.researchgate.net/figure/Typical-Data-Center-HVAC-Hot-Aisle-Cold-Aisle-Layout_fig1_255202176

Differential air pressure in your data centre:

https://datacentremagazine.com/articles/differential-air-pressure-in-your-data-centre

Hot aisle containment(HAC) solutions is used to maximize efficiencies by coolingand removing the heat produced by data storage and processing equipment:

https://www.datacenter-tech.com/ru/new/What-is-hot-aisle-containment.html

A numerical investigation of fan wall cooling system for modular air-cooled data center:

https://www.sciencedirect.com/science/article/abs/pii/S0360132321006867

Rear door vs. traditional cooling for data centers:

https://sealco.net/rear-door-vs-traditional-cooling-for-data-centers/

Data Center Cold Wars - Part 4. Rear Door Heat Exchanger:

https://www.grcooling.com/blog/data-center-cold-wars-part-4-single-phase-immersion-cooling-versus-rear-door-heat-exchangers/

Direct-to-Chip Cooling: The Future Of The Data Center:

https://techlevated.com/direct-to-chip-cooling-the-new-standard-for-ai-data-centers/

Understanding direct-to-chip cooling in HPC infrastructure: A deep dive into liquid cooling:

https://www.vertiv.com/en-emea/about/news-and-insights/articles/educational-articles/understanding-direct-to-chip-cooling-in-hpc-infrastructure-a-deep-dive-into-liquid-cooling/

Inside the 100K GPU xAI Colossus Cluster that Supermicro Helped Build for Elon Musk: