Drive Density Kills RAID

This article was written on October 5, 2020 by George Crump, CMO, StorONE, Inc., formerly founder and lead analyst, Storage Switzerland.

Drive Density Kills RAID

While it lowers the cost of storing data, drive density kills RAID as a means of protecting vs. media failure.

Today, both flash and HDD media technologies are available in ~16TB capacities. These high capacity devices expose various weaknesses in traditional RAID algorithms and make its use suspect today and impractical in the near future.

Drive density kills RAID hot sparing
The good thing about RAID is it continues to provide access to data when a drive fails. The challenge, however, is quickly getting the storage system back into a protected state before another drive fails. The majority of enterprise storage systems use global hot spares to rebuild the RAID group automatically. Hot spares are dedicated drives that sit idle, waiting for a failure. When a drive fails, these hot spares are logically assigned to the affected RAID group, so the customer doesn’t have to do anything to start the rebuild process.

There are 3 fundamental problems with hot sparing:

First, the hot spare needs to be replaced relatively quickly after the first failure; otherwise, the system is running without a hot spare. Replacing hot spares is particularly problematic during the current pandemic because access to data centers is no longer routine.
Second, because traditional RAID can’t mix drive sizes within RAID groups, the customer needs to have a hot spare for each type of drive in the storage system. The inability to mix drive sizes also means that drives must be of the same capacity when a customer wants to expand their storage system. Alternatively, they can create a new RAID group out of higher capacity drives and manually migrate data to the new RAID group. Using hot spares is also inefficient. Hot spares require setting aside what can now be 16TB of capacity, waiting for something to go wrong. It is not uncommon for customers to have 50TB or 100TB of idle capacity in hot spares. In most cases, it eliminates any capacity gains the customer will get from using deduplication or compression.
The third problem is that hot spares contribute to the slow RAID rebuilds, which we address next.

Drive density kills RAID because of slow rebuilds
One of the biggest challenges facing data centers is the time it takes to rebuild a RAID group. Even AFAs are seeing rebuild times creep up as capacities increase. But the biggest challenge is the rebuild time on HDDs. A RAID group built from 12TB or 16TB HDDs, typically measure rebuild times in days, not hours. Many potential customers who come to us looking for rebuild time relief are stating RAID rebuild times of weeks for these high capacity drives. Slow rebuild times cause many problems, some obvious others not so obvious.

The most obvious and pressing problem with slow rebuilds is the time it takes to complete the rebuild. If you have your system set to a single parity drive, you cross your fingers the entire time the rebuild is occurring, hoping that another drive doesn’t fail.

Why are RAID rebuilds so slow?
Part of the problem points back to the hot spare concept. When that hot spare is logically moved into position, all the other drives are writing simultaneously to that one drive. Also, most RAIDs have to do a sector by sector rebuild of the drive, so even if that 16TB drive had only 5TBs of data on it, it doesn’t rebuild any quicker than a drive that has 15.5TB of data on it. As both flash and HDDs’ capacity continues to increase, the problem continues to worsen.

All this simultaneous writing, while still managing other tasks, puts a heavy burden on the drives. Consequently, the chances of another failure increase dramatically. With a single parity design and high-density drives, rebuilds can take hours or days and the concerns of a double drive failure are legitimate.

The only solution storage vendors, who count on RAID, provide to customers is dual parity RAID. At the expense of capacity efficiency and write I/O performance, these RAID groups can survive two drive failures. The challenge is as capacities and RAID rebuild times continue to increase, the likelihood of multiple and simultaneous drive failures will also increase. The challenge is that there are no triple parity RAID designs on the market today. Even if triple parity RAID technology were available, the impact on write performance would be devastating.

Answer is vRAID
The answer to the challenges of increasing drive density is vRAID. When we founded StorONE, we had 2 choices regarding providing customers with protection from media failure, RAID, or erasure coding. We rejected RAID for the reasons described above. Erasure coding, however, was ideal, but using it would impact our software’s high-performance IO capabilities.

Our answer to the high-density challenge? Rewrite erasure coding from scratch, removing its inefficiencies and optimizing it for high-performance storage. Every performance benchmark you’ll see from us is on volumes with vRAID active. With vRAID, you can set redundancies on a per-volume level, mix drive sizes within the RAID group, and experience the fastest rebuilds in the industry, no matter the media type.