Forget Me Not! Server Memory Failures Can Adversely Affect Data Center Uptime, But Fortunately There’s a Solution

by Jeff Klaus, General Manager of Intel® Data Center Management Solutions

Cloud computing has become a foundation of the digital business, if not the cornerstone of organizational digital transformation, representing one of the most valuable innovations in current IT and corporate strategies. In fact, according to Gartner, among the companies currently using cloud services, more than 75 percent indicate they have a cloud-first strategy. China, the world’s largest market for mobile payments and eCommerce, which are drivers of cloud computing, is increasing its use of the technology. Over the past four years, Chinese officials have set their sights on more than doubling the scale of China’s cloud computing industry. McKinsey & Company forecasts that public cloud usage rates in China could grow more than 20 percent annually from 2018 to 2021.

The convergence of emerging cloud technology trends and China’s increasing demand for the use of cloud services have opened up transnational business opportunities. Take for example Tencent, a leading, global cloud solutions provider based in China, with operations in the Asian-Pacific, Europe, and North America. Like other cloud service and online providers, Tencent relies heavily on server reliability, availability and serviceability (RAS) across its data centers.

Memory failures are one of the most critical hardware failures that occur in data centers today. For this reason, Tencent set up Intel® Memory Failure Prediction (Intel® MFP) for a test deployment with thousands of servers based on Intel® Xeon® Scalable Processors to reduce downtime caused by server memory failures. Intel® MFP is vendor-agnostic and works in conjunction with other data center management solutions, including Intel® Data Center Manager (Intel® DCM).

Intel® MFP predicts memory failure events by analyzing historical data to prevent potential catastrophic events before they happen. Intel® MFP monitored the health of the servers’ Dynamic Random Access Memory (DRAM) modules and provided Tencent IT administrators with critical information about them, including a health-score based on their historical data.


Maintaining Server Availability and Uptime

Intel® MFP leverages online machine learning to analyze the historical data collected on server memory down to the Dual Inline Memory Module (DIMM), bank, column, row, and cell levels, providing a memory health-score to predict potential future failures. The resulting analysis and health scores indicated a large number of potential memory issues within Tencent’s test environment, including both Correctable Errors (CE) and Uncorrectable Errors (UE). Why is this information of mission-critical importance to Tencent’s IT staff?

A burst in the number of CEs could result in the performance degradation of a server and even denial-of-service, while UEs can lead to catastrophic failures, typically resulting in system crashes. Using the results from Intel® MFP, memory failure locations at the micro-level were predicted, allowing Tencent to decide on how to migrate critical tasks running on the servers with identified memory issues to other servers, and mitigate the potential impact of UE events that could reduce server availability and uptime.

The Intel® MFP deployment resulted in improved memory reliability due to predictions based on the capture of micro-level memory failure information from the operating system’s Error Detection and Correction (EDAC) driver, which stores historical memory error logs. Additionally, by predicting potential memory failures before they happen, Intel® MFP can help improve DIMM purchasing decisions. As a result, Tencent was able to reduce annual DIMM purchases by replacing only DIMMs that have a high likelihood to cause server crashes.

Because Intel® MFP is able to predict issues at the memory cell level, that information can be used to avoid using certain cells or pages, a feature known as page offlining, which has become very important for large scale data center operations. Tencent was therefore able to improve their page offlinging policies based on Intel® MFP’s results.

Using Intel® MFP, server memory health was analyzed and given scores based on cell level EDAC data. These scores allowed Tencent to make informed decisions on page offlining, replacing DIMMs, and migrating critical workloads away from servers with problematic DIMMs — all of which helped in significantly reducing UE failures and server downtime.
The test deployment of Intel® MFP revealed that if Tencent employed the solution across all its data centers it would yield a significant benefit by substantially improving operational efficiency, and the overall reliability, availability and serviceability of its cloud services.

According to an Information Technology Intelligence Consulting (ITIC) survey, more than 80 percent of businesses now require a guaranteed uptime of 99.99% from their cloud service vendors. It’s self-evident that some systems are so critical to an organization’s business operations that they must be monitored constantly, and cloud solution providers are no exception. For this reason, cloud vendors striving to maintain the highest levels of uptime across their data centers do best to use tools such as Intel® MFP, which predicts problems before they manifest, and indicates preemptive solutions that prevent system failures from occurring in the first place.

