ScaleFlux ECC Technology Set to Change DRAM Reliability
Advancement in DRAM fault tolerance, addressing challenges faced by data centers while unlocking new possibilities for innovation at system level
This is a Press Release edited by StorageNewsletter.com on June 20, 2024 at 2:02 pmScaleFlux, Inc. is set to transform AI and data center RAS (reliability, availability, and serviceability) with its ECC (Error Correction Coding) technology.
As conventional ECC methods buckle under the pressure of climbing memory error rates, undermining system reliability, the company’s innovative approach using list decoding shatters the limitations, offering rapid and efficient correction of complex errors. This technology boosts DRAM reliability and security and also slashes costs by making use of lower-cost DRAM chips viable, paving the way for a more resilient and cost-effective computing infrastructure. The firm is not just meeting the demands of the future – it’s redefining them.
The fact that the global AI market is set to add $15.7 trillion to the world economy by 2030 (1) means that the rapidly evolving landscape of data centers and AI technology makes the need for reliable, high-performance memory solutions more critical than ever to avoid the increasing costs of downtime from component and system failures.
“Under the combination of growing DRAM densities and increasing challenges with DRAM error rates, we saw the need for a new style of error correction coding,” said Tong Zhang, chief scientist, ScaleFlux. He continued, “the hyperbolic growth of AI infrastructure and the emergence of memory expansion with Compute Express Link (CXL) has the cascading effect of driving up DRAM capacities and traffic, exacerbating the need for innovation in ECC.“
As DRAM technology advances towards 10nm and beyond and the error rates in the media grow, the need for DRAM fault tolerance becomes increasingly paramount. This is where ECC steps in, playing a vital role in ensuring the reliability of the data and, subsequently, the data centers. Without this crucial error correction, data can readily become corrupted, resulting in ‘garbage in, garbage out’ calculations and even costly system crashes. Traditional ECC methods face challenges in handling these increased error rates while meeting the stringent latency constraints. A new solution is needed.
Error rates on rise
Four key trends are multiplying the frequency of memory errors:
-
Increasing memory capacity density: Current server and GPU systems can support 1TB or more of DRAM! This capacity expands even further with Compute Express Link (CXL) memory modules.
-
Increasing fault-caused blast radius: As Cloud computing infrastructure continues to scale out, the crash of one server caused by memory errors will potentially compromise more and more connected servers. This will many-fold amplify the fatal impact of memory bit errors.
-
Increasing memory access speeds: As the industry has progressed from DDR3 through DDR4 to DDR5, transfer rates have quadrupled from 1,600Mb/s to 6,400Mb/s with further accelerations on the way.
-
Increasing vulnerability to soft errors and defects in the memory media: As memory manufacturers move to newer manufacturing lithography, the memory bit cells shrink, introducing more susceptibility (2) to soft errors and defects.
Considering that errors/second are a function of the memory capacity, the blast radius, the rate of access to the memory, and the inherent rate of errors in the memory media the increases in all these factors clearly makes for a significant challenge to reliability.
Throw on top of that the combination of increasing complexity of error detection with the system-level performance hits from uncorrectable errors and it’s a real nightmare situation for maintaining system reliability and avoiding costly downtime.
Conventional ECC can no longer cut it
The conventional ECC methods typically follow minimum distance bounded decoding, capable of correcting up to ‘t’ symbol errors using ECC with a minimum distance of ‘2t+1’. To meet stringent latency constraints, many DRAM ECC design solutions protect each data access unit (e.g., 64-byte cache line) by interleaving multiple short-length ECC codewords that each correct only 1 or few symbols at very low latency (1~3 clock cycles).
However, when it comes to tolerating more errors from DRAM devices, conventional methods will become largely inadequate, leading to uncorrectable errors and hence catastrophic failures in data centers.
In layman’s terms, think of computer’s memory like a document. ECC functions as a spell checker, detecting and correcting errors like typos that may occur when saving or retrieving data. However, conventional ECC, like a basic spell checker, has its limitations. It can only catch and fix certain types of errors, like single-letter mistakes. If multiple errors occur or the errors are too complex, conventional ECC may struggle to correct them effectively, leaving your data vulnerable to inaccuracies.
Enter innovative ECC methodology
ScaleFlux, a fabless semiconductor company innovating in the field of data center storage and memory technology, has developed an ECC solution that will change DRAM fault tolerance. It presented this solution at the IEEE RAS in Data Centers Summit on June 12, 2024. Unlike conventional ECC approaches, the firm‘s solution leverages list decoding, a branch of coding theory with origins dating back to the 1950s but largely forgotten in modern applications due to its computational complexity.
Central to this innovation is the application of list decoding, which aims at correcting more-than-‘t’ errors. The company’s solution departs from the conventional ECC paradigm by protecting each 64-byte cache line with a single codeword while ensuring decoding latency as low as 1~3 clock cycles. This approach enables the correction of more-than-‘t’ errors from any combination of 2 DRAM devices at very high speed and with low computational complexity. Moreover, list decoding offers the added benefit of avoiding mis-correction by detecting decoding errors when the list contains two codewords of similar likelihood.
To realize this innovative ECC technology, the firm developed a robust mathematical framework to analyze correction, detection, and mis-correction probabilities. It also engineered a highly parallel VLSI-friendly architecture for ultra-low-latency decoding. Successful demonstrations on FPGA platforms verified more-than-‘t’ error correction with extremely low mis-correction probability. As a part of the development effort, ScaleFlux collaborated with key ecosystem partners including memory suppliers, CPU vendors, and hyperscalers.
“Innovations like ScaleFlux’s decoding ECC methodology may be capable of enabling high-reliability CXL solutions,” said Jim Pappas, chairperson, CXL Consortium. “CXL technology-based advancements ultimately enable the industry to meet the demands of expanding memory capacity and bandwidth.”
Remember, it’s not just about memory
The impact of ScaleFlux’s ECC technology extends beyond improving DRAM reliability. By accommodating less reliable, lower-cost DRAM chips, it can reduce the TCO for data center operators. Furthermore, it enhances immunity to security risks associated with malicious DRAM RowHammer attacks, bolstering data center security.
The company’s innovative ECC design solution is set to change DRAM fault tolerance, offering new levels of reliability and performance benefits for computing infrastructure. With its potential to lower costs, enhance security, and push the boundaries of RAS (Reliability, Availability, and Serviceability), this technology is poised to shape the future of data center and AI computing infrastructure.
Summing it up
ScaleFlux’s ECC technology represents an advancement in DRAM fault tolerance, addressing the challenges faced by data centers while unlocking new possibilities for innovation at the system level.
“As we embark on this journey towards more resilient and efficient computing infrastructure, ScaleFlux’s innovations are blazing the trail for progress in the ever-evolving landscape of AI and data center technology,” says JB Baker, VP products, ScaleFlux.
(1) Downie, Chris. How Data Centers Can Simultaneously Enable AI Growth and ESG Progress. May 13, 2024, accessed June 7, 2024.
(2) Meixner, Anne. DRAM Test and Inspection Just Gets Tougher. Semiconductor Engineering, November 7, 2023, accessed June 7, 2024.