The MVAPICH2-DPU MPI library is a derivative of the MVAPICH2 MPI library and is optimized to harness the full potential of NVIDIA Bluefield Data Processing Units (DPUs) with InfiniBand networking and accelerate HPC applications.

Features

The MVAPICH2-DPU 2021.06 release has the following features:

  • Based on MVAPICH2 2.3.6, conforming to the MPI 3.1 standard
  • Supports all features available with the MVAPICH2 2.3.6 release
  • Novel framework to offload non-blocking collectives to DPU
  • Offloads non-blocking Alltoall (MPI_Ialltoall) to DPU
  • 100% overlap of computation with MPI_Ialltoall non-blocking collective
  • Accelerates scientific applications using MPI_Ialltoall non-blocking collective

Performance

Capability of the MVAPICH2-DPU library to extract peak overlap between computation happening at the host and MPI_Ialltoall communicationFigure 1: Capability of the MVAPICH2-DPU library to extract peak overlap between computation happening at the host and MPI_Ialltoall communication

Figure 1 illustrates performance results of the MPI_Ialltoall non-blocking collective benchmark running with 512 (32 nodes with 16 processes per node (PPN) each) and 1,024 (32 nodes with 32 PPN each) MPI processes, respectively. As message size increases, the MVAPICH2-DPU library is able to demonstrate the peak (100%) overlap between computation and MPI_Ialltoall non-blocking collective. In contrast, the MVAPICH2 default library without such DPU offloading capability is able to provide very little overlap between computation and MPI_Ialltoall non_blocking collective.

Capability of the MVAPICH2-DPU library to reduce overall execution time of an MPI application when computation steps are used in conjunction with the MPI_Ialltoall non-blocking collective operation in an overlapped manner. Figure 2: Capability of the MVAPICH2-DPU library to reduce overall execution time of an MPI application when computation steps are used in conjunction with the MPI_Ialltoall non-blocking collective operation in an overlapped manner.

When computation steps in an MPI application are used in conjunction with the MPI_Ialltoall non-blocking collective operation in an overlapped manner, the MVAPICH2-DPU MPI library has the unique capability to provide significant performance benefits in the overall program execution time. This is possible with the MVAPICH2-DPU MPI library because the Arm cores in the DPUs are able to implement the non-blocking alltoall operations while the Xeon cores on the host are performing computation with peak overlap, as illustrated in Figure 1. As indicated in Figure 2, the MVAPICH2-DPU MPI library can deliver up to 23% performance benefits compared to the basic MVAPICH2 MPI library across message sizes and PPNs on a 32-node experiment with the OMB MPI_Ialltoall benchmark.

Capability of the MVAPICH2-DPU library to reduce overall execution time of the P3DFFT application. Figure 3: Capability of the MVAPICH2-DPU library to reduce overall execution time of the P3DFFT application.

The enhanced version of the P3DFFT MPI kernel was evaluated on the 32-node HPC-AI cluster with the MVAPICH2-DPU MPI library. As illustrated in Figure 3, the MVAPICH2-DPU MPI library reduces the overall execution time of the P3DFFT application kernel up to 21% for various grid sizes and PPNs.

NVIDIA Developer Blog

Additional details are available from NVIDIA Developers Blog “Accelerating Scientific Applications in HPC Clusters with NVIDIA DPUs Using the MVAPICH2-DPU MPI Library“.

Live Demo with NVIDIA at the ISC ’21 Conference

Several team members of X-ScaleSolutions worked with NVIDIA to have live demos of the MVAPICH2-DPU package at the ISC ’21 conference. A link to this live demo is available here.

Contact

Interested in a free trial license? Please email us at contactus@x-scalesolutions.com or fill out the contact form at x-scalesolutions.com/contact/ for support.