R&D: TeraHeap – Exploiting Flash Storage for Mitigating DRAM Pressure in Managed Big Data Frameworks
Authors propose TeraHeap system that eliminates S/D overhead and expensive GC scans for large portion of objects in analytics frameworks.
This is a Press Release edited by StorageNewsletter.com on February 13, 2025 at 2:00 pmACM Transactions on Programming Languages and Systems has published an article written by Lacovos G. Kolokasis, Giannos Evdorou, Foundation for Research and Technology – Hellas (FORTH), Institute of Computer Science (ICS), Heraklion, Greece and Department of Computer Science, University of Crete, Heraklion, Greece, Shoaib Akram, Australian National University, Canberra, Australia, Christos Kozanitis, Foundation for Research and Technology – Hellas (FORTH), Institute of Computer Science (ICS), Heraklion, Greece, Anastasios Papagiannis, Isovelent, Inc., Cupertino, CA, USA, Foivos S. Zakkak, Red Hat Ltd., Manchester, UK, Polyvios Pratikakis, and Angelos Bilas, Foundation for Research and Technology – Hellas (FORTH), Institute of Computer Science (ICS), Heraklion, Greece and Department of Computer Science, University of Crete, Heraklion, Greece.
Abstract: “Big data analytics frameworks, such as Spark and Giraph, need to process and cache massive datasets that do not always fit on the managed heap. Therefore, frameworks temporarily move long-lived objects outside the heap (off-heap) on a fast storage device. However, this practice results in (1) high serialization/deserialization (S/D) cost and (2) high memory pressure when off-heap objects are moved back for processing.“
“In this article, we propose TeraHeap, a system that eliminates S/D overhead and expensive GC scans for a large portion of objects in analytics frameworks. TeraHeap relies on three concepts: (1) It eliminates S/D by extending the managed runtime (JVM) to use a second high-capacity heap (H2) over a fast storage device. (2) It offers a simple hint-based interface, allowing analytics frameworks to leverage object knowledge to populate H2. (3) It reduces GC cost by fencing the collector from scanning H2 objects while maintaining the illusion of a single managed heap, ensuring memory safety.“
“We implement TeraHeap in OpenJDK8 and OpenJDK17 and evaluate it with fifteen widely used applications in two real-world big data frameworks, Spark and Giraph. We find that for the same DRAM size, TeraHeap improves performance by up to 73% and 28% compared to native Spark and Giraph. Also, it can still provide better performance by consuming up to and less DRAM than native Spark and Giraph, respectively. TeraHeap can also be used for in-memory frameworks and applying it to the Neo4j Graph Data Science library improves its performance by up to 26%. Finally, it outperforms Panthera, a state-of-the-art garbage collector for hybrid DRAM-NVM memories, by up to 69%.“