Hadoop Distributed File System Storage Subsystem Gets Boost

Here is an article published on the blog of Hortonworks

Discardable Distributed Memory: Supporting Memory Storage in HDFS
HDFS’s storage subsystem gets a boost with discardable distributed memory
By Sanjay Radia

Traditionally, HDFS, Hadoop’s storage subsystem, has focused on one kind of storage medium, namely spindle-based disks. However, a Hadoop cluster can contain significant amounts of memory and with the continued drop in memory prices, customers are willing to add memory targeted at caching storage to speed up processing.

Recently HDFS generalized its architecture to include other kinds of storage media including SDDs and memory [1]. We also added support for caching hot files in memory [2]. Initially this caching mechanism can be used by an Admin to tag hot files for potential caching; in the future this will be automated based on usage patterns and advisement of the query engines running on Hadoop.

This blog focuses on our vision and plans to take advantage of large amounts of memory to speed up processing for a variety of workloads and use cases beyond hot input files. In particular we introduce a new concept called Discardable Distributed Memory for intermediate data and its predictive caching. DDMs provide support for storing materialized queries, as described by Julian Hyde in [7]. DDMs add memory storage as a first-class abstraction to the core of Hadoop so that it is accessible to all frameworks and applications.

The work described here complements other work planned for Heterogeneous storage including policies for SSDs. We will also add mechanisms for the application layers to provide information to HDFS to allow richer policies whereby data can be moved and/or copied across storage tiers to match the needs of the workloads that access Hadoop data.

Summary of Heterogeneous Storage
Originally HDFS has focused on one kind of storage medium, namely spindle-based disks. Recently we have generalized the HDFS so that is distinguishes between different kinds of storage (we have called this Heterogeneous storage [1]).

This has involved some changes to HDFS:

DataNodes distinguish the kind of media on which they store HDFS blocks and report the media type for each block in the block reports sent to the NameNode.
NameNode keeps track of not just the location of each block but also the media type for each location.
Clients are informed of the media type along with the location. Hence a client or the scheduler can choose location based on both closeness and media type.

Use Cases
Below are the common use cases where memory’s lower latency could provide significant improved performance for a variety of Hadoop workloads. We use the term “application” to roughly mean a Hadoop job or a user session for interactive analytic queries.

Input files. HDFS has recently added support for caching HDFS files.
Final output of an application
- a) Output that needs to be reliably-persisted (i.e. 3 replicas worth of reliability)
- b) Output that is streamed to the user and hence does not need to be persistent
Intermediate data within an application. Examples of these are
- a) Hive intermediate files
- b) Intermediate outputs for iterative applications e.g. RDDs [9]
Predictive caching of 2b, 3 for later use by other applications. Julian covers these in his recent blog [7]. Caching of final output (2a) is covered by 1.

Discardable Distributed Memory for Transient Data
HDFS already supports Use Case 1 and improvements are planned to allow more policy-based automatic caching so that an admin does not need to choose and manage which files are cached. For Use Cases 2b, 3 and 4, we introduce the concept of Discardable Distributed Memory (DDM) that is discussed in more detail below.

Discardable Distributed Memory (DDM)
DDM have the following salient properties:

Data can be discarded to deal with resource shortage. It is assumed that the application can recompute the data; in case of Use Case 3 (intermediate data), the application is still running and will recompute (Hadoop’s MapReduce and Tez already support such recomputations in cases of failures or data loss, similar to RDDs). In case of Use Case 4 (predictive caching), the data was added to DDM as an optimization to avoid recomputation, but the application can recompute it.
The data is lazily written to and backed by an HDFS file. This has the following advantages:
- Provides the low latency of memory so that the write is considered completed as soon as it hits the memory; the write back to an HDFS file is asynchronous background activity.
- Narrows the window when recomputation is needed: recovery is needed only if there is failure or a DDM discard while the data in memory hasn’t been written to disk. Once the data has hit the one or more disks it can be refetched from storage in face of DDM discards from memory or from node failures.
- Improves management of scarce memory resources – the HDFS file system acts as large backing store allowing greater flexibility in discarding memory data that due to competitive demands on memory.

Our choice of implementation for lazy asynchronous writing is to use a memory-mapped local replica file and then lazily write to two other replicas (see below for further details on the implementation).
The DDM data can be accessed from any computer in the cluster – either via the memory on the original computer where it was created or from another replica.
Our choice of the term discardable was deliberate to indicate the memory is managed by the system and can be reallocated as needed, though clearly with hints and directives from the application using it.

Managing the Memory-Storage Resource in a Multi-tenant Environment with YARN
The memory-storage used by DDM is a scarce resource and it needs to be managed:

The memory-storage resource should be managed within the YARN resource model so that an application can be scheduled for its cpu, normal memory and its DDM memory.
The CapacityScheduler can be used to guarantee memory-storage resources across tenants using the capacity scheduler queues.
The tougher problem is memory allocation amongst the users that are within the same tenant queue. Compare this memory-storage with the other resources that Yarn currently manages.
- The cpu resource is fungible across tasks while memory-storage is not.
- The regular memory i,ully allocated to a container and cannot be taken away short of killing the container; memory-storage is not like regular memory.

The good news is that, while this resource is scarce, it is backed by file storage and hence it gives the system greater degree of freedom in managing; in particular if a DDM is discarded after it has it one or more file replicas, the data is not lost.

Comparison of DDMs
The use of memory to speed up processing is not new. More recently it has received a lot of attention in the Hadoop spce as a way to speed up processing. A umber of benchmarks have shown that if you can fit the data in memory and not read it from disk then performance is improved (hardly surprising given the orders of magnitude difference in latencies of disk and memory). Other memory-based storage system have benchmarks to show that HDFS with a large OS-level buffer cache is also slower than their memory based file system. Again this should not be surprising since the OS buffer cache gets pressure not just from HDFS usage but also from MapReduce temp storage and other applications running on Hadoop. Further, HDFS’s use of fadvise[8] might work against caching in certain circumstances and HDFS has not been tuned to take advantage of a large buffer cache.

We did consider using the OS buffer cache for our solution but dismissed it for several reasons. First, the OS buffer cache may not give the fine-grained control we may need. Second, as we mentioned above, it gets pressure from other applications and MapReduce’s tmp storage. Third, (minor) HDFS does rely on the buffer cache for read-ahead and we would have re-engineer that separately or mix it in with the caching. Fourth, the notion of caching can solve only one of our use cases and that DDM is fundamentally different than caching.

Is our solution just a buffer cache built inside HDFS? A storage system buffer cache thinks of memory as a transient location for data that fundamentally resides on disks. We think of DDM data as residing primarily in memory and the term cache is not appropriate. Our solution is closer to the paging or swapping in virtual memory and also closer to the “anti-caching” work [5] where one views data as sitting primarily in memory and considers disks as extended backing for the memory. In the case of DDM, the data is viewed to be sitting in memory and backed up by disks with the key distinction that the DDM data can be discarded (discarded from memory or even from the disk) to meet competitive resource demands because it can be reconstructed by the application’s framework; virtual memory or anti-caching do not have a notion of discardability.

We envision, and plan to work with, the Apache Spark community so that the Spark framework can use DDM to store RDDs in memory; similar to the approach they have tried with Tachyon [4]. This allows many Spark applications to share RDDs since they are now resident outside the address space of the application.

Conclusion
Work on DDMs implementation has started; please see Apache jira HDFS-5851 [6] for more details on the underlying mechanism. Our goal in HDFS-5851 is to expose the underlying mechanisms needed to support DDMs. DDMs will add much needed support for memory storage as a first-class abstraction to the core of Hadoop so that it is accessible to all frameworks and applications built on Hadoop.

References
[1] Heterogeneous Storage Phase 1 design doc (HDFS-2832)
[2] HDFS Centralized Cache Management
[3] Matei Zaharia, et al, Spark: Cluster Computing with Working Sets
[4] H. Li, et al., Tachyon: Memory Throughput I/O for Cluster Computing Frameworks
[5] Justin DeBrabant, et al.., Anti-Caching: A New Approach to Database Management System [6] Architecture
[6] Support memory as a storage media, Apache HDFS Jira, HDFS-5851
[7] Julian Hyde, Discardable, In-Memory, Materialized Query (DIMMQ)
[8] fadvice
[9] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia et al