MemVerge and Open Source Community in Partnership
To protect distributed HPC apps with DMTCP.
This is a Press Release edited by StorageNewsletter.com on November 23, 2021 at 2:01 pmMemVerge, Inc. and the DMTCP Project announced a partnership to accelerate development and adoption of long-awaited Distributed MultiThreaded Checkpointing (DMTCP) technology.
Checkpointing is commonly used by enterprise apps to minimize downtime but checkpointing is almost impossible for complex distributed HPC apps with massive data sets. Under development for over a decade, DMTCP has recently made the impossible possible for several workloads including VLSI circuit simulators, circuit verification, formalization of mathematics, bioinformatics, network simulators, high energy physics, cybersecurity, big data, middleware, mobile computing, cloud computing, virtualization of GPUs, and HPC. DMTCP stands ready for commercialization and wider deployment.
The collaboration will facilitate DMTCP’s move into the market. The partnership includes MemVerge developers joining the DMTCP Project and contributing to open-source development; MemVerge providing commercial support for the open-source DMTCP software; and MemVerge integrating the fully tested and supported version into application-specific big memory solutions.
MemVerge has also begun collaboration with the National Energy Research Scientific Computing Center (NERSC) to optimize MPI-Agnostic Network-Agnostic (MANA), a plug-in on top of DMTCP that has been used for transparent checkpointing of MPI on the Cori and Perlmutter HPCs.
“Robust, performant checkpointing offers us flexibility in scheduling jobs for system maintenances and real-time data processing for experimental facilities. This feature also allows us to better backfill jobs, which ultimately leads to increased system utilization and improved job throughput for our nearly 8,000 scientific users,” said Rebecca Hartman-Baker, user engagement group lead, NERSC, Lawrence Berkeley National Laboratory.
Gene Cooperman, professor, Northeastern University, and leader of the DMTCP Project, has led this open-source DMTCP project for almost 20 years. He is especially excited about the recent three-way collaboration to support MANA for MPI.
According to him: “The collaboration among NERSC/LBNL, MemVerge, and the DMTCP open-source community will bring reliable and efficient transparent checkpointing to MPI (and later to CUDA) for the production market. While DMTCP and MANA will always remain free and open source, the use of MemVerge technology for rapid writing of memory to stable storage will bring an important enhancement to this technology.“
“Distributed checkpointing is a perfect complement to ZeroIO In-Memory Snapshot technology that MemVerge has pioneered,” said Charles Fan, CEO, MemVerge. “We look forward to collaborating with the DMTCP community on future technology and market development.“
“Being able to seamlessly and graciously recover from system failures during complex simulation runs is critical to optimize efficiency for completing jobs with long run-times,” said Mark Nossokoff, senior research analyst, Hyperion Research. “Checkpointing is a well-understood technique for saving the states of independent node memory during a failure mode and restoring that state when the machine is backup and running. Bringing checkpointing capability to big memory architectures with pooled, distributed memory across multiple nodes operating on large datasets should further enable adoption of in-memory computing techniques within the HPC and AI communities. Kudos to MemVerge for stepping up to provide the industry stewardship to make DMTCP a commercial reality.“
About DMTCP and DMTCP/MANA Project
DMTCP (Distributed MultiThreaded Checkpointing) transparently checkpoints a single-host or distributed computation in user-space – with no modifications to user code or to the OS. It works on most Linux applications, including Python, Matlab, R, GUI desktops, MPI, etc. It is robust and widely used (on Sourceforge since 2007). MANA is an implementation of transparent checkpointing for MPI. MANA is under continuing development, but has already demonstrated robust, transparent checkpointing for computations with 1,000 MPI processes.