IBM, Nvidia Team on Supercomputing Scalability for AI

Author of this article, published on June 28, 2021, is Douglas O’Flaherty, global ecosystems leader, IBM Storage.

IBM, Nvidia Team on Supercomputing Scalability for AI

For years IBM Storage has worked with NVIDIA Corp. to deliver unique and highly scalable data solutions needed for the demanding environments of high-performance computing, AI and analytic workloads.

This week we are advancing that work to new heights. IBM Storage is showcasing the latest in Magnum IO GPUDirect Storage (GDS) performance with new benchmark results, and is announcing an updated reference architecture (RA) developed for NVIDIA 2, 4,and 8-node DGX POD configurations.

Both companies are also committed to bringing a DGX SuperPOD solution with IBM Elastic Storage System 3200 (ESS 3200) by the end of 3Q21.

Ibm Nvidia

The commitment to DGX SuperPOD support, the new GDS benchmarks and the updated DGX POD RA are all designed to help organizations simplify AI infrastructure adoption. Whether companies are deploying a single DGX system or deploying hundreds, these enhancements are designed to help them streamline their AI journey.

Let’s look at each of these new developments more closely …

Benchmarking Magnum IO GPUDirectStorage
To support faster data science with faster storage performance, NVIDIA GPUDirect Storage bypasses the CPU and reads data directly from storage into GPU memory. This is designed to reduce latency, improve throughput, and offload work from the CPU for greater data efficiency. Since the release of Spectrum Scale 5.1.1, IBM, our partners, and clients have been testing this innovative technology in different implementations and various data sizes with promising results.

Using the latest GDS GA version and standard GDS configuration on a DGX A100 system connected to ESS 3200 storage through the 2 storage-fabric (north-south) IB network adapters, we achieved 43GB/s across the eight GPUs. The performance of GDS shows a 1.9x improvement over standard data transfers, achieving up to 86% of physical bandwidth.

To test the system, IBM devised an experimental benchmark test designed to stress test shared storage across the NVIDIA servers, the NVIDIA IB network and the ESS storage. In the benchmark, a pair of all-flash ESS 3200s, were built using Spectrum Scale, using NVIDIA GPU DirectStorage beta and delivered 191.3GB/s to drive data to 16 GPUs using NVIDIA GPUDirect Storage beta. The ESS 3200s effectively saturated the NVIDIA 8x HDR (200GB/s) IB network. This was accomplished by connecting directly to the GPU fabric and using a GDS enabled version of the common FIO read benchmark, which optimizes placement directly in the GPU memory and avoids a common CPU bottleneck.

Reference Architecture
IBM today also issued updated IBM Storage Reference Architectures with DGX A100 Systems. In the practical configurations of the architectures, separate storage and compute networks allowed for scalable performance. The ESS 3200 doubles the read performance of the previous gen to 80GB/s, so that a single ESS 3200 can support more throughput to more systems. For 2 DGX systems, that is over 75GB/s when using GDS and almost 40GB/s for data without GDS.

IBM also performed first-of-its-kind GDS testing on GPU enabled systems in the IBM labs. Using the latest beta of GDS, we demonstrated lower latency and greater bandwidth at various data and I/O sizes with efficiency gains by offloading the CPU. As organizations scale up their GPU-accelerated computing, data processing, imaging, and AI efforts, CUDA developers now have more control over their data with direct copy from storage using GPUDirect Storage.

The architecture provides a scalable unit growing adoption of DGX systems and shared data services. The flexibility of Spectrum Scale SDS is engineered to enable the enterprise features, hybrid cloud, and multi-site support required by many organizations and enterprises.

DGX SuperPOD Support of ESS 3200
Additionally, The two companies announced their commitment to support ESS 3200 for use with DGX SuperPOD by the end of 3Q21. DGX SuperPOD is NVIDIA’s large scale architecture that starts at 20 DGX A100 systems and scales to 140 systems. With future integration of the scalable ESS 3200 into NVIDIA Base Command Manager, and including support for NVIDIA Bluefield DPUs, networking and multi-tenancy will be simplified.

“AI requires powerful performance, which makes it important to ensure that compute and storage are tightly integrated,” said Charlie Boyle, VP and GM of DGX Systems, NVIDIA. “The collaboration between NVIDIA and IBM is expanding customer choice for DGX SuperPODs and DGX systems featuring technologies designed for world-leading AI development.”

Regardless of whether your company is only starting on your AI journey or building the largest configurations, the ability to deploy NVIDIA with ESS and Spectrum Scale will provide training, inference, and analytics faster.

NVIDIA is part of IBM’s partner ecosystem, an initiative to support partners of all types – whether they build on, service or resell IBM technologies and platforms – to help clients manage and modernize workloads.

GDDirect read measurements performed by IBM Labs: 1 million block size, 16GB file, 16 threads, averaged (s=0.0229): 2 ESS 3200 running Spectrum Scale 5.1.1 connected via IB HDR to 1 DGX A100 server with 2 storage fabric HDR IB NICs
GDDirect read measurements performed by IBM Labs: 2 ESS 3200 running Spectrum Scale 5.1.1, each with 4 IB HDR (8x25GB/s= 00GB/s) to 2 DGX A100 servers using all 8 GPU compute fabric HDR ports.
IBM Storage Reference Architecture with DGX A100 Systems
https://newsroom.ibm.com/2021-04-27-IBM-Launches-Advanced-Storage-Solutions-Designed-to-Simplify-Data-Accessibility-Availability-Across-Hybrid-Clouds
GDDirect read measurements performed by IBM Labs: One ESS 3200 running Spectrum Scale 5.1.1, with 4 IB HDR to 2 DGX A100 servers using the 2 storage fabric HDR IB network connections.
IBM Lab running a single A100 GPU Lenovo server and ESS 5000 across I/O sizes from 4k to 8 million and from 4 to 32 threads. GDS improvement average: 36.6% more bandwidth, 23.5% lower latency, and 53% less CPU utilization