Komprise: What You Need to Know Before Jumping Into Cloud Tiering Pool?

Blog written by Kumar Goswami, co-founder and CEO, Komprise, Inc., published on April 8, 2021

What You Need to Know Before Jumping Into Cloud Tiering Pool?

Cloud tiering tiering is now a critical capability in today’s increasingly hybrid, multi-cloud world of enterprise storage.

Cloud tiering and archiving can offer cost savings, a path to the cloud, and a zero-disruption solution that leverages existing investments. But, not all cloud tiering and cloud archiving solutions are the same – you may end up paying 75%+ more in cloud egress and 300% more in ongoing storage licensing costs by picking the wrong strategy.

The cloud tiering approach you pick will not only have implications on your short, medium, and long-term cost savings of migrating unstructured data to the cloud, it will also impact what benefits your organization is able to achieve from your cloud data migration strategy.

In this series of posts, we will review:

But first, I want to talk about what you need to know before jumping into the cloud tiering pools. This first post covers the differences between tiering to the cloud through your storage vendor vs. a data management file tiering and archiving solution like Komprise.

Storage array vendors, while providing insight into the array’s operations and performance, have historically done little to provide insight into the data stored on them. Addressing the need for analytics-driven data management across storage silos has been where Komprise comes in. However, to meet customer demand, storage array vendors have re-packaged their tiering solutions used inside the array to externally tier data to the cloud. While these so-called ‘Pool’ tiering solutions, for example NetApp FabricPool and EMC Isilon CloudPools, may reduce the storage cost of today’s fast, expensive flash-based storage by blending in the lower cost benefits of cloud storage, it’s important to understand when they are a good choice and when they are not as part of your overall intelligent data management strategy.

Cloud tiering: blocks vs. files
File storage-based cloud tiering provides limited data analytics and limited policies by which you can select data to tier. However, they provide continuous, transparent tiering which enables IT to systemically roll out cold data tiering to the cloud without disrupting users. File storage arrays use an efficient block-based storage system to store files. Each file is represented by a set of equally sized blocks. As a file grows, more blocks are provided to store the file’s content. To reduce the cost of the storage arrays, vendors provide multiple tiers of storage. The highest tier using flash storage is the fastest and the most expensive. Then come tiers that store data on SAS drives and finally SATA drives. Some storage array vendors use pure flash tiers with varying performance and costs associated with them.

These storage arrays use a tiering system whereby the file metadata and the frequently accessed blocks (from any file) are stored in the highest tier and less accessed blocks are downgraded to lower, less expensive (and higher capacity) tiers. This automated storage tiering approach allows the vendor to reduce costs by using smaller, faster tiers while still providing good performance.

With the demand to leverage the cloud, these array vendors are now using their tiering system, designed to work efficiently with internal storage tiers, to tier data to the cloud. The tiering system tiers cold blocks rather than files to the cloud. Multiple blocks are stored within a single object stored in the cloud. Metadata resides on the storage vendor filesystem and all data access needs to occur through the storage filesystem. While storage tiering solutions are good for tiering snapshots to the cloud, they result in unnecessary costs and lock-in when tiering and archiving files.

Array block-level tiering is a mismatch for cloud

This approach of tiering blocks rather than entire files has following ramifications:

Limited policies result in more data access from the cloud. Policies to specify the blocks to be tiered are limited. For instance, one popular storage array vendor can only maintain hot blocks that are less than 183 days old. Older blocks must be tiered to the cloud. This results in much higher access rate to the cloud, resulting in higher cloud API and egress costs. Many customers prefer to tier data older than one or two years to the cloud but cannot do so with the limited policies provided by storage arrays. Policies to exclude certain files, types of files or even directories are generally not possible with the limited policies provided by block-based storage array tiering.
Defragmentation of blocks leads to higher cloud costs. Accessing blocks in the cloud leads to defragmentation of the object in which these blocks are stored in the cloud. Once some percent, say 20%, of the blocks in an object have been read, the entire object is brought back into the array and coalesced with other defragmented objects and then written back to the cloud. While this reduces the storage used in the cloud, this continuous defragmentation process results in a continuous egress and API costs.
Sequential reads lead to higher cloud costs and lower performance. Sequential reads caused by applications such as virus scanners or third party backup can increase the cost of cloud storage. The sequential read operation from these applications can be detected by the storage array to prevent the re-hydration of the blocks, however, all reads are still handled by the cloud resulting in higher API and egress costs as well as lower performance across high latency channels.
Data tiered to the cloud cannot be accessed from the cloud without licensing a storage filesystem. Since blocks from many files, as opposed to entire files, are tiered to the cloud, the data can only be accessed from the storage array. What is stored in the cloud has no meaning to any application other than the storage array. This data lock-in eliminates the ability to access and process the cold data independently from the storage array.
Tiering blocks impacts performance of the storage array. Block tiering to the cloud can reduce the performance of the storage array. The mechanism to maintain block tiering to the cloud causes continuous, on-going traffic between the array and the cloud across a high latency channel that ultimately impacts the performance of the overall array. For these reasons, the storage array vendors strongly recommend limiting the tiering to only 200 or 300 terabytes to the cloud. Given the vast quantities of data most enterprises are dealing with today, block tiering is not suited for general data tiering to a public cloud across high latency channels. Block tiering is better suited for private clouds, which unfortunately lack the benefits that are attracting more and more enterprises to adopt public clouds.

Storage array vendors strongly recommend limiting the tiering to only 200 or 300TB to the cloud.

Data access results in re–hydration. Block tiering re-hydrates any data accessed from the cloud. This requires that there be space to accommodate some percent of cold data. This in turn reduces the potential cost savings.
Block tiering does not reduce backup costs. Third party backup applications read and store the hot and cold blocks of each file on the storage array. As a result, the backup window and the backup storage footprint are not reduced. Tiering cold blocks does not provide sufficient storage savings.
Block tiering locks you into your storage vendor. Since the cold data is tiered to the cloud in a proprietary format, when it is time to decommission your storage array and replace it with a new one you must stay with the same vendor. If you elect to change vendors, you will have to re-hydrate all of the data back to the original storage array and then migrate that data to the new storage array and then tier that data using some other tiering solution. You will have to do this iteratively many, many times since the cold data will be several multiples of the capacity of the storage array. In short, you will be locked-in to this vendor.
Proprietary lock-In and cloud file storage licensing costs. You cannot directly use native cloud services to access your data in the cloud – it has to be through the proprietary storage filesystem itself. This creates unnecessary licensing costs that customers must pay forever to access their data and creates undesirable lock-in as you cannot directly use native tools without relying on the filesystem for access.
Be aware of proprietary storage vendor lock-in and cloud file storage licensing costs.

Here is a summary of the differences between storage cloud tiering vs. the open and storage-agnostic file-level cloud tiering and archiving approach from Komprise:

	Storage Cloud Tiering (e.g. NetApp FabricPool, EMC CloudPools)	File-Level Cloud Tiering and Archiving – Komprise
Approach	Block-level	File-level
Leverages Existing Infrastructure	YES	YES
Users Access Moved Data without Disruption	YES	YES
Eliminates Lock-in	NO	YES
Works Across Object/Clouds	NO	YES
Works Across File storage vendors	NO	YES
Flexible Policies	NO	YES
Native Access in the Cloud	NO	YES
No 3^rd party filesystem cloud licensing costs	NO – Requires ongoing cloud license of filesystem to access data	YES
Ideal Use cases	Tier snapshots	Tier and archive files

But will I lose storage efficiencies such as de-dupe by not using the storage tiering solution?
You may wonder if you are losing some of the storage efficiencies such as de-dupe by not using the storage vendor tiering solution to go to the cloud. The overhead of keeping blocks in the cloud due to egress costs, rehydration costs and defragmentation costs significantly overshadows any potential dedupe savings. Also, when data is moved at the block level to the cloud, you are really not saving on any third-party backups and other applications because block tiering is a proprietary solution – read this white paper for more background on block vs file tiering. So if you consider all the additional backup licensing costs, cloud egress costs, cloud retrieval costs plus the fact that you are now locked-in and have to pay filesystem costs forever in the cloud to access your data, then the small savings you may get from de-dupe are significantly overshadowed by overall costs and the loss of flexibility.

When should I use block tiering provided by my storage vendor?
Tiering provided by a storage vendor such as NetApp FabricPool and Dell EMC CloudPools are suited for tiering snapshots, certain log files and other data from Flash storage – data that is proprietary and deleted in short order. Such temporal data is typically not backed up or virus scanned, they are only accessed in the event of an error or disaster and yet are generally large in size resulting in notable storage efficiency when tiered. Tiering this specific type of data reduces storage costs without incurring most of the shortcomings above. Block tiering is suited for tiering such temporal data. Block tiering techniques such as NetApp FabricPool and Dell EMC CloudPools are not well suited for tiering general user data for the reasons mentioned above. Pools solutions create 75% higher cloud egress and retrieval costs on file data, and their ongoing cloud licensing costs add 300%+ ongoing cloud costs.

Pools solutions create 75% higher cloud egress and retrieval costs on file data, and their ongoing cloud licensing costs add 300%+ ongoing cloud costs.

In my next post I’ll dive deeper into file-based vs. block-level tiering to the cloud and the benefits of file-based cloud tiering from Komprise.

Resources:
White paper: Cloud Tiering – Storage-Based vs. Storage Gateways vs. File-Based – Which is Better and Why? (registration required)
About Intelligent Data Management
Sign up for a Free Trial for Cloud Data Management