Readers of Electronics Cooling will no doubt recall our blog Of Deepmind, DCIM, and Data Center Cooling published two years ago where we highlighted some of the good design practices Google uses to improve (i.e., decrease) Power Usage Effectiveness (PUE) metric in its data centers. In that blog we cited one metric for average PUE that steadily decreased from 1.20 in 2008 to 1.12 in 2016. Google continues to publish more data on PUE from its data centers which shows the metric still hovers around 1.12. It is interesting to point out that the best industrial average PUE is 1.17.
Some of the good design practices we highlighted in the earlier blog included: a) using servers that were designed for optimal power consumption and hence minimal thermal dissipation, b) optimizing the fan operating points by running the server fans at the minimal RPM to maintain a given temperature, c) raising the cold aisle temperature to 80°F from 64°F, and, d) ducting for hot air containment. However, Google found that each data center has its behavioral idiosyncrasies that are best understood, managed and controlled by using neural networks. This is what led Google to make an effective use of its Deepmind artificial intelligence (AI) engine in cooling applications to reduce the energy consumption in its data centers.
In the intervening two years since our blog, it appears that Google has carried out more rigorous studies using its AI engine though much of that data remains unpublished. Close on its heels comes the word this week that Google is letting Deepmind to fully control its data centers, an industry first to trust an AI engine to take charge of an operational facility. A recent MIT Technology Review article on this topic states that Google has effectively ceded control of its data centers to the Deepmind algorithm to manage cooling all by itself! Although the AI engine runs the data center independently, there is an engineer who manages it and can intervene if something too risky is imminent –it addresses some of the apprehensions stakeholders may have and also shows how advanced AI systems can work in collaboration with humans. Furthermore, Google claims that its data centers on average consume 50% less energy than most other data centers. It also announced that it achieved multi-site ISO 50001 certification for energy management.
These days are liven with news on self-driving cars and robotic personal assistants so a headline like “Google just gave control over data center cooling to an AI” may go unnoticed in the main stream media. However, readers of Electronics Cooling and other tech publications will be definitely interested in a topic on data-based analytics applications in cooling and may want to learn more.
The algorithm controlling cooling in data centers is based on Reinforcement Learning, a category of machine learning that does not require precise mathematical models and the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).
A typical reinforcement learning agent interacts with its environment in discrete time steps which in this case could be the programmed polling intervals of the infrastructure monitoring sensors and their output. The learning loops involve the reinforcement agent receiving the observation as a reward upon which it chooses an action from the set of available actions. That action is sent to the environment to move to a new state following which the reward associated with the transition is determined. The goal of a reinforcement learning agent is to collect as many rewards as possible. The reinforcement learning agent can choose any action as a function of the history.
The approach in theory is applicable to optimize chassis-level cooling of data center servers and switches. As more vendors of these appliances make access to energy management available to network management systems, it is quite realistic to expect AI engines like Deepmind to operate data center cooling as a cooperative game where each server or switch unilaterally declares its ‘intentions’, i.e., compute load or traffic demand.
Besides the objective of achieving best possible PUE metric, the application of AI engines to operate a facility has larger implications to the professions of cadres like facility managers, network administrators and data center technicians. That will be a topic for a future blog in Electronics Cooling. Your comments welcome!