INTRODUCTION
Current high-performance computing (HPC) and Telecom trends have shown that the number of transistors per chip has continued to grow in recent years, and data center cabinets have already surpassed 30 kW per cabinet (or 40.4 kW/m2) [1]. It is not an unreasonable assumption to expect that, in accordance with Moore’s Law [2], power could double within the next few years. However, while the capability of CPUs has steadily increased, the technology related to data center cooling systems has stagnated, and the average power per square meter in data centers has not been able to keep up with CPU advances because of cooling limitations. With cooling systems representing up to ~50% of the total electric power bill for data centers [3], growing power requirements for HPC and Telecom systems present a growing operating expense (OpEx). Brick and mortar and (especially) mobile, container-based data centers cannot be physically expanded to compensate for the limitations of conventional air cooling methods.
In the near future, in order for data centers continue increasing in power density, alternative cooling methods, namely liquid cooling, must be implemented at the data center level in place of standard air cooling. Although microprocessor-level liquid cooling has seen recent innovation, cooling at the blade, cabinet, and data-center level has emerged as a critical technical, economic, and environmental issue.
In this article, three cooling solutions summarized in Table 1are assessed to provide cooling to a hypothetical, near-future computing cluster. Cooling Option 1 is an air-cooled system with large, high-efficiency, turbine-blade fans pushing air through finned, heat-pipe equipped, copper heat sinks on the blade. Rear-door air-to-liquid heat exchangers on each cabinet cool the exiting air back to room temperature so that no additional air conditioning strain is placed on facility air handlers. The water & propylene glycol (water/PG) mixture from the heat exchangers is pumped to the roof of the facility, where the heat absorbed from the CPUs is dissipated to the environment via a rooftop compressor-enabled chiller.
Cooling Option 2 uses water-based touch cooling on the CPUs via a copper cold plate loop on each board. These loops are connected to an in-rack manifold that feeds a water/PG mixture in and out of each blade. A coolant distribution unit (CDU) collects heated water via overhead manifolds from each cabinet, cools the water through an internal liquid-to-liquid brazed-plate heat exchanger, and pumps it back through the overhead manifolds to the cabinets for re-circulation. On the other side of the heat exchanger, a closed water loop runs from the CDU to a rooftop dry cooler, where the heat from the CPUs is ultimately dissipated into the atmosphere.
Cooling Option 3, which considers two approaches, removes water from the server cabinets and instead uses refrigerant (R134a) as the heat transfer fluid in the server room. This approach uses a cold plate and manifold system similar to the water-cooling approach, but may or may not have a refrigerant distribution unit in the building. Cooling Option 3a pumps mounted in the cabinets pump the refrigerant through a water-cooled, brazed-plate heat exchanger in a refrigerant distribution unit (RDU) to condense the refrigerant after it absorbs the heat from the CPUs. In Cooling Option 3b, pumps move the refrigerant straight to the roof, where a rooftop condenser dissipates the heat to the environment.
Table 1 – Overview of cooling options
The goal of this article is to provide an “apples-to-apples” comparison of these three cooling systems by suggesting a hypothetical, high-power near-future data center specification for each method to cool. When the options are compared side-by-side, differences in fluid dynamics and heat transfer will translate into differences in efficiency, and comparisons between various cooling methods become more easily visible. This article is not a complete guide to installing or selecting equipment for each of these cooling systems, but it is a general overview of what power usage advantages each system offers.
HYPOTHETICAL COMPUTER CLUSTER SPECIFICATION AND THERMAL ASSESSMENT
In order to bound the comparison in a meaningful way, a set of specifications was developed from extensive discussions with the Liquid Cooling (LC) forum of LinkedIn professional network [4], an association of motivated multidisciplinary professionals from HPC, Telecom and electronics cooling industries.
After several weeks of discussion, the LC forum agreed on the following system configuration and operation conditions for analysis: the hypothetical computer cluster under consideration should produce ~1MW of IT power. The distribution of power is shown in Figure 1.
Figure 1 – Hypothetical data center module specifications
The cabinet architecture in this specification assumes each horizontal card is inserted from the front, with no cards inserted from the back of the cabinet. Alternative architectures exist, and can be cooled by a variety of means, but for this analysis, a simple, easily-relatable architecture was desired, so only front-facing, horizontal cards were considered.
A hypothetical data center could be equipped either with a dry cooler or compressor equipped chiller located on the roof of the building, at up to 60 m (≈200 feet) above computer HPC cluster level floor. To maintain the 65°C case temperature limit, the air-cooling method requires a compressor-equipped chiller, but a dry cooler is preferred for liquid cooling methods.
COOLING OPTIONS VIEWED VIA EXPLICIT COMPARISON
Option 1: Advanced Air Cooling
Cooling 60 kW in a single rack required a staggered, heatsinked 10-CPU layout, heat pipes, rear door cooling heat exchangers, and powerful, high-pressure turbine-blade fans. In addition, the air in the data center still needed to be significantly lower than room temperature (4°C) to achieve the desired case temperature. Even though air cooling may not be an economically feasible solution at this power density, and even though it would clearly not meet the NEBS GR-63 acoustic noise level standard [5], we still had to devise an air cooling option to compare to the LC options.
In this approach, each cabinet was supplied with a rear-door cold water/PG cooler with 3 “hurricane” turbine-blade fans (with operating point of ~3.7 m3/s at ~3.7 kPa pressure difference [6]). The cold water/PG solution circulates around the system to the roof chiller, where heat from the cabinets is dissipated to the ambient air. The system would be monitored with on-board temperature sensors and would either increase the fan speed or throttle CPU performance if the CPU approached the 65oC case temperature threshold. This approach is illustrated in Figure 2.
Figure 2 – Option 1: Air cooling with rear-door water/PG to air heat exchangers
Option 2: Water/PG Touch Cooling
In Option 2a, (Fig.3, top board layout) 10 CPUs per blade were arranged in two parallel groups of 5 serially connected cold plates (Option 2a). 20 horizontal blades were plugged into vertical supply/return manifolds, and these cabinet manifolds were supplied with water/PG from the Coolant Distribution Unit (CDU) via overhead manifolds. A pump and water/PG-to-facility water heat exchanger was needed to reduce the pressure losses at the cold plates. A separate loop brings the heat from the CDU to a rooftop dry cooler unit, where it is rejected into the ambient air. This method would include a control system in the CDU that would increase the water flow rate in the case of an increased load. Either passive or active flow regulators would also be placed at the inlet of each blade to ensure even flow distribution across the whole cabinet.
After the first pass of Water/PG system simulation it was discovered that water/PG speeds in the blade exceeded allowable ASHRAE [1] velocity limits, an additional LC option (Option 2b) was added – where all 10 cold plates were connected in parallel. This required additional onboard manifolds with additional piping, to allow for a compliant cold plate/CPU interface.
Figure 3 – Option 2: Water/PG touch cooling
Option 3: R134a Touch Cooling
The high heat of vaporization for refrigerant enables refrigerant systems to use a flow rate that is approximately 5 times less than the required flow rate for a water system with the same power. Because of this, the cold plates in Option 3 do not need to be connected in parallel like in Option 2b, and can have an arrangement similar to Option 2a. As before, blades were connected to vertical supply/return (refrigerant) manifolds, but, because of the lower flow rate, a refrigerant pump can be fit into each cabinet to pump refrigerant through the blades and manifolds.
In Option 3a, the manifolds transport the refrigerant to a refrigerant distribution unit (RDU), where the heat is transferred to a closed water loop feeding into a rooftop dry cooler. In Option 3b, the refrigerant is sent straight to the roof, where a rooftop condenser dissipates the heat to the atmosphere. The layouts of these two options are shown schematically in Figure 4.
Figure 4 – Option 3: R134a touch cooling
The control system for a refrigerant cooling system includes built-in headroom to accommodate the refrigerant capacity. The system is designed so that under a full load, the refrigerant quality (the fraction of refrigerant that is vapor, by mass) does not exceed 80%, so a 20% safety factor is already built into the system at the worst-case scenario. Flow regulators at the inlet of each blade ensure even flow distribution across the entire height of the cabinet. In the event of an increase in CPU power, either the RDU would increase the water flow rate or the rooftop condenser would increase its fan speed to fully condense the refrigerant.
Capital and Operating Expenditures
With identical performance specifications (maximum case temperature and environmental ambient temperature), the differences between cooling systems can be easily compared in terms of capital and operating expenses (CapEx and OpEx, respectively). It is important to mention that the presented first-pass analysis was not intended to produce entirely optimal designs for each cooling option. For equipment selection, computational fluid dynamics (CFD) analysis [7], flow network analysis [8], a two-phase analysis software suite, and vendors’ product selection software [9, 10] were used to analyze pressure drops and heat transfer across different system components.
For each cooling option, capital expenses were determined by obtaining quotes of only the main components of cooling hardware (fans, heat exchangers, pumps, cold plates, refrigerant quick disconnects, etc.) from the manufacturer, and 10% of the cost was assessed for installation, piping, etc. The cost for the electric power supply, controls, hose and pipe fittings, UPS, etc. was not included. This cost is certainly an underestimate of the total cost of installation, but it would be representative of the main cost drivers associated with each cooling system.
One important note about the CapEx estimates is that the water and refrigerant cold plates are assumed to be of equal cost. In reality, water cold plates require higher flow rates and therefore larger tube diameters to cool the same power, but refrigerant cold plates must withstand higher pressures. The actual cost of manufacturing depends more on the manufacturing technique than it does on the fluid used, so the two cold plates were assumed to be of similar cost.
Figure 5 – Estimated capital expenses for three cooling strategies
Operating expenses for all cases were calculated by determining the cost of electricity needed to pump the coolant around the loop and to run the fans. To do this, Computational Fluid Dynamics (CFD) and flow network analysis were used to calculate the pressure drop and flow rate of each fluid through the system, and then an average operational efficiency was used to determine the total power draw. This analysis assumed an electricity rate of $0.10 / kW-hr, without demand charges, and that operating hours per year were 8,760 for all methods. Since the refrigerant-based option does not require periodic flushing and replacement, in our analysis the cost of R134a was only added to CapEx. With water cooling, in order to keep electro galvanic corrosion inhibitors and microbiological growth suppressants active, water/PG Coolant mixture requires regular flushing (every 2-3 years), so the cost of water cooling additives was added to OpEx as well as the initial CapEx estimate.
Figure 6 – Estimated operating expenses for three cooling strategies
Figure 6 shows that using a direct, rear-door air-cooling approach, low data center temperature, heat sinks with embedded heat pipes, and turbine-blade high-efficiency fans, air cooling will cost far more than a liquid-cooling option. Again, this best effort for air cooling option was presented only for the sake of comparison.
Figure 6 also indicates that that switching from air-cooling to any of the four liquid-cooling options will cut the operating expense (to run the cooling system, not to run the entire data center) by at least a factor of 5, In addition, the direct refrigerant Cooling Option (3b) shows the lowest operating cost of all the cooling options, with less than 1/30th of the cost of the air-cooled option. With this operating cost, an existing air-cooled data center (per Option 1) would greatly benefit from switching to a refrigerant-cooled data center (Option 3b), and, assuming no additional retrofit expenses, would recover the switching cost within the first year of operation.
CONCLUSION
Although the comparison in this paper is a preliminary, predictive analysis of several different cooling systems, the differences in power consumption revealed here show that a data center outfitted with liquid cooling provides a tremendous advantage over air cooling at the specified power level (60 kW per cabinet). As HPC and Telecom equipment continues toward higher power densities, the inevitable shift to liquid cooling will force designers to choose between water and refrigerant cooling. It is the author’s belief that the industry will eventually choose direct refrigerant cooling because of the advantage it has over other cooling systems in operating cost (at least 2.5 times cheaper) with similar capital cost, the minimal space requirements on the board, the absence of microbiological growth, electro galvanic corrosion, and corresponding need to periodically flush the system.
ACKNOWLEDGEMENTS
This study was inspired, promoted, and scrutinized by dedicated professionals representing all cooling approaches, from the Liquid Cooling forum at LinkedIn, and carried out by Thermal Form & Function, Inc.
REFERENCES
[1] ASHRAE TC 9.9 2011 “Thermal Guidelines for Liquid Cooled Data Processing Environments”, 2011 American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc.
[2] Brock, David. Understanding Moore’s Law: Four Decades of Innovation (PDF). Chemical Heritage Foundation. pp. 67–84. ISBN 0-941901-41-6. http://www.chemheritage.org/community/store/books-and-catalogs/understanding-moores-law.aspx, Retrieved March 15, 2015.
[3] Hannemann, R., and Chu, H., Analysis of Alternative Data Center Cooling Approaches, InterPACK ‘07 July 8-12, 2007 Vancouver, BC, Canada, Paper No. IPACK2007-33176, pp. 743 -750.
[4] LinkedIn Liquid Cooling Forum. https://www.linkedin.com/grp/post/2265768-5924941519957037060, Retrieved May 11, 2015.
[5] Network Equipment-Building System (NEBS) Requirements: Physical Protection(A Module of LSSGR, FR-64; TSGR, FR-440; and NEBSFR, FR-2063). NEBS Requirements: Physical Protection GR–63–CORE, Issue 1, October 1995
[6] Xcelaero Corporation, “Hurricane 200,” A457D600-200 data-sheet, 2010. http://www.xcelaero.com/content/documents/hurricane-brochure_r5.indd.pdf, Retrieved May 11, 2015.
[7] ANSYS Icepak. Ver. 16. Computer Software. Ansys, Inc., 2015.
[8] AFT Fathom. Ver. 7.0. Computer Software. Applied Flow Technology, 2011.
[9] LuvataSelect. Ver. 2.00.48. Computer Software. Luvata, 2015. http://www.luvataselectna.com, Retrieved May 11, 2015.
[10] Enterprise Coil Selection Program. Ver. 3.1.6.0. Computer Software. Super Radiator Coils, 2015. http://www.srcoils.com/resources-support/coil-sizing-software/, Retrieved May 11, 2015.