How to Store Petabytes of Machine-Generated Data?

This article was authored by Rainer W. Kaese, senior manager business development, storage products division, Toshiba Electronics Europe GmbH.

How to store (petabytes of) machine-generated data

The amount of data worldwide grows by several billion terabytes every year because more and more machines and devices are generating data. But where will we put it all? Even in this age of IoT, HDDs remain indispensable.

Data volumes have multiplied in recent decades, but the real data explosion is yet to come. Whereas, in the past, data was mainly created by people, such as photos, videos and documents, with the advent of the IoT age, machines, devices and sensors are now becoming the biggest data producers. There are already far more of them than people and they generate data much faster than us. A single autonomous car, for example, creates several terabytes per day.

Then there is the particle accelerator at CERN that generates 1PB/s, although “only” around 10PB/month are retained for later analysis.

In addition to autonomous driving and research, video surveillance and industry are the key contributors to this data flood. The market research company IDC assumes that the global data volume will grow from 45ZB last year to 175ZB in 2025 (1). This means that, within 6 years, 3x as much data will be generated as existed in total in 2019, namely 130ZB.

Much of this data will be evaluated at the point of creating, for example, in the sensors feeding an autonomous vehicle or production facility (known as edge computing). Here, fast results and reactions in real-time are essential, so the time required for data transmission and central analysis is unacceptable. However, on-site storage space and computing power are limited, so sooner or later, most data ends up in a data centre. It can then be post-processed and merged with data from other sources, analysed further and archived.

This poses enormous challenges for the storage infrastructures of companies and research institutions. They must be able to absorb a constant influx of large amounts of data and store it reliably. This is only possible with scale-out architectures that provide storage capacities of several dozen petabytes and can be continuously expanded. And they need reliable suppliers of storage hardware who can satisfy this continuous and growing storage demand. After all, we cannot afford for the data to end up flowing into a void. The public cloud is often touted as a suitable solution. Still, the reality is that the bandwidth for the data volumes being discussed is insufficient and the costs are not economically viable.

For organisations that store IoT data, storage becomes, in a sense, a commodity. It is not consumed in the true sense of the word but, like other consumer goods, it is purchased regularly and requires continuing investment. A blueprint of how storage infrastructures and storage procurement models can look in the IoT age is provided by research institutions such as CERN that already process and store vast amounts of data. The European research centre for particle physics is continuously adding new storage expansion units to its data centre, each of which contains several hundred HDDs of the most recent gen. In total, their 100,000 HDDs have attained a total storage capacity of 350PB.

Price decides the storage medium
The CERN example demonstrates that there is no way around HDDs when it comes to storing such enormous amounts of data. They remain the cheapest medium that meets the dual requirements of storage space and easy access. By comparison, tape is very inexpensive but is not suitable as an offline medium and is only appropriate for archiving data.

Flash memory, on the other hand, is currently still 8 to 10x more expensive per unit capacity than HDDs. Although the prices for SSDs are falling, they are doing so at a similar rate to HDDs. Moreover, HDDs are very well suited to meet the performance requirements of high-capacity storage environments. A single HDD may be inferior to a single SSD, but the combination of several fast-spinning HDDs achieve very high IO/s values that can reliably supply analytics applications with the data they require.

In the end, price alone is the decisive criterion – especially since the data volumes to be stored in the IoT world can only be compressed minimally to save valuable storage space. If at all possible, compression typically takes place within the endpoint or at the edge to reduce the amount of data to be transmitted. Thus, it arrives in compressed form at the data centre and must be stored without further compression. Furthermore, de-dupe offers little potential savings because, unlike on typical corporate file shares or backups, there is hardly any identical data.

Because of the flood of data in IoT and the resultant large quantity of drives required, the reliability of the HDDs used is of great importance. This is less to do with possible data losses, as these can be handled using appropriate backup mechanisms, and more to do with maintenance of the hardware. With an Annualised Failure Rate of 0.7%, instead of the 0.35% achieved by CERN with Toshiba HDDs, a storage solution using 100,000 HDDs would require that 350 drives are replaced annually – on average almost one drive replacement more per day.

HDDs irreplaceable for years to come
In the coming years, little will change with the main burden of IoT storage borne by HDDs. Flash production capacities will simply remain too low for SSDs to outstrip HDDs. To cover the current storage demand with SSDs alone, flash production would have to increase. Bearing in mind that the construction costs for a single flash fabrication facility run to several billion euros, this is an undertaking that is challenging to finance. Moreover, it would only result in higher flash output after around 2 years that would only cover the demand of 2020 and not that of 2022.

The production of HDDs, on the other hand, can be increased much more easily because less clean room production is needed than in semiconductor production. Additionally, the development of HDDs is progressing continuously, and new technologies such as HAMR and MAMR are continuing to deliver capacity increases. Experts assume that HDDs’ storage capacity will continue to increase at a rate of around 2TB per year for a few more years at constant cost. Thus, IDC predicts that by the end of 2025, more than 80% of the capacity required in the enterprise sector for core and edge data centres will continue to be obtained in the form of HDDs and less than 20% on SSDs and other flash media (1).

(1) IDC Data Age 2025 whitepaper, update from May 2020