High-Performance Computing Is Facing an Energy-Efficiency Crisis



Margaret Henschel, one of the more than 50,000 Intel employees in the United States, moves through Fab 32, a high-volume manufacturing facility in Chandler, Arizona. Intel Corporation’s U.S. manufacturing and research and development facilities are in Oregon, Arizona and New Mexico. They operate 24 hours a day, 365 days a year. (Credit: Tim Herman/Intel Corporation)

Fortunately, Data Center Management Solutions Can Help Unlock HPC’s Full Potential, writes Rami Radi, Senior Application Engineer of Intel® Data Center Management Solutions.

Over the years, the prevailing approach to high-performance computing (HPC) environments has been to throw as much money as possible towards solving the problems they are attempting to solve by procuring more and more systems while ignoring the challenges of power, space and cooling.

Yet today, we know this trend is not sustainable, and in the context of large-scale HPC environments, many efforts have recently emerged proposing energy-efficient solutions. Yes, HPC holds the promise of solving the most difficult business, industrial, societal, medical and even existential quandaries troubling humanity. But the energy-efficiency crisis facing this wondrous, next-gen technology must first be addressed to unlock its full potential.

Rami Radi, Software Application Engineer of Intel Data Center Management Solutions

While no widely agreed-upon definition has yet to emerge, generally speaking, HPC describes the ability to process data and perform highly complex calculations at amazingly high speeds, by which we mean not billions, but quadrillions of calculations per second.

Another element that distinguishes HPC is its architecture. Instead of a monolithic, single-box design, compute servers are networked together into a cluster, and software programs and algorithms run simultaneously on the servers in the cluster, which is then networked to data storage. An HPC cluster can comprise hundreds or even thousands of compute servers.

HPC solutions can be deployed on-premises, in the cloud, or at the edge. Used across a wide range of industries, including financial services, healthcare, manufacturing, oil and gas, and research and educational institutions, according to a study by Grand View Research, the global HPC market is expected to reach $59.65 billion by 2025, expanding at a compound annual growth rate (CAGR) of 7.2%, from $34.62 billion in 2017.

Often integrating artificial intelligence (AI) and machine learning technologies, HPC is already helping to detect credit card fraud, track stock trends in real-time, and automate trading. Enabling faster, and more accurate patient diagnosis, and assisting in the development of cures for cancer and diabetes. And assisting scientists to find sources of renewable energy and understand the evolution of our universe.

However, as companies build data science applications, recommendation engines, large-scale analytics, and other new applications driven by HPC, they are finding that legacy data centers, traditional computing platforms and network architectures are not equipped to handle these new demands.


Newsletter

Time is precious, but news has no time. Sign up today to receive daily free updates in your email box from the Data Economy Newsroom.


The Challenges of HPC Environments Are Hardly Unsolvable

Because the performance demands are staggeringly high, HPC requires highly specialized IT infrastructure and data center designs. Among the most significant data center challenges associated with HPC are power requirements, which translate to major energy costs. Because HPC aggregates computing power in a way not typically associated with standard server infrastructure, it also requires denser banks of computer resources to increase capacity and reduce latency, while minimizing floorspace. To avoid the potential for unplanned downtime, careful consideration must therefore be given to future-proofing as it relates to power availability. Running a high power density HPC deployment generates significant heat as well, which brings to the fore the problem of ineffective or insufficient cooling capacity.

The cooling systems used in many older data centers date back to an era of significantly lower power densities. As a result, legacy facilities often struggle to accommodate the intense heat generated by HPC deployments, or do have sufficient cooling capacity but are unable to distribute it where needed. Frequently, a facility may not be operating at the capacity for which it was originally designed. The unfortunate response is to resort to overcooling, wasting electrical energy and expanding the data center’s carbon footprint in the process. Cooling alone can account for 30 to 40 percent of the power costs for the data center.

Faced with these and other challenges, data center management solutions can assist IT managers to gain a better understanding of the power consumption and thermal status of servers in HPC deployments. These tools provide real-time and historical thermal maps and cooling analysis, monitoring not only servers and racks, but also storage, networks and equipment.

Most significantly, in an HPC environment where servers can run approximately 30 percent hotter due to the size of the compute workloads and density, data center management solutions enable IT staff to detect hotspots and cooling anomalies before they cause critical incidents. Moreover, these tools empower IT administrators to reduce cooling costs and improve Power Usage Effectiveness (PUE) by safely raising the temperature of the server room while continuously monitoring data center devices for thermal issues.

Armed with accurate real-time power and thermal consumption data provided by software solutions such as Intel® Data Center Manager (Intel® DCM), IT staff gain the insight needed to manage, plan and forecast power usage, increase rack density, and prolong operation during outages. In fact, monitoring the temperature and power consumption of each server in an HPC test environment was found to reduce power consumption by up to 5 to 8 percent through allocation to nodes with high power efficiency. Intel® DCM also monitors granular sub-component failure analysis and out of band real-time utilization data, including CPU, disk, and memory. These capabilities provide predictive component-level health management, reduce mean time to repair (MTTR), and ultimately increase uptime.

Additionally, data center management solutions enable IT staff to quickly detect and analyze underutilized systems by monitoring their CPU utilization and power consumption over time, which provides the needed power statistics for every rack and server model with no additional hardware or software needed. It is essential to understand how a workload is utilizing the resources of all the systems in the HPC clusters it is running on, and this can be accomplished using a data center management solution. Armed with this knowledge, data center administrators can better plan and manage capacity and utilization in racks, increase their rack densities, and delay adding new racks.

Make no mistake, the challenges of large-scale HPC environments are hardly unsolvable. Rather, data center management solutions can help ensure that the powerful computational capabilities and insights these deployments provide will continue unencumbered and at a rapid pace, whether on-prem, in the cloud, or at the edge.

Read the latest from the Data Economy Newsroom: