The MVAPICH2-DPU MPI library is a derivative of the MVAPICH2 MPI library and is optimized to harness the full potential of NVIDIA Bluefield Data Processing Units (DPUs) with InfiniBand networking and accelerate HPC applications.

Features

The MVAPICH2-DPU 2021.08 release has the following features:

  • Based on MVAPICH2 2.3.6, conforming to the MPI 3.1 standard
  • Supports all features available with the MVAPICH2 2.3.6 release
  • Novel frameworks to offload nonblocking collectives to DPU
  • Offloads nonblocking collectives to DPU
    • Alltoall (MPI_Ialltoall)
    • Allgather (MPI_Iallgather)
    • Bcast (MPI_Ibcast)
  • Up to 100% overlap of communication and computation with nonblocking collectives
  • Accelerates scientific applications using MPI_Ialltoall nonblocking collective

Performance

Capability of the MVAPICH2-DPU library to extract peak overlap between computation happening at the host and MPI_Ialltoall communicationFigure 1: Capability of the MVAPICH2-DPU library to extract peak overlap between computation happening at the host and MPI_Ialltoall communication

Figure 1 illustrates performance results of the MPI_Ialltoall nonblocking collective benchmark running with 512 (32 nodes with 16 processes per node (PPN) each) and 1,024 (32 nodes with 32 PPN each) MPI processes, respectively. As message size increases, the MVAPICH2-DPU library is able to demonstrate peak (100%) overlap between computation and MPI_Ialltoall nonblocking collective. In contrast, the MVAPICH2 default library without such DPU offloading capability is able to provide very little overlap between computation and MPI_Ialltoall nonblocking collective.

Capability of the MVAPICH2-DPU library to reduce overall execution time of an MPI application when computation steps are used in conjunction with the MPI_Ialltoall non-blocking collective operation in an overlapped manner. Figure 2: Capability of the MVAPICH2-DPU library to reduce overall execution time of an MPI application when computation steps are used in conjunction with the MPI_Ialltoall non-blocking collective operation in an overlapped manner.

When computation steps in an MPI application are used in conjunction with the MPI_Ialltoall non-blocking collective operation in an overlapped manner, the MVAPICH2-DPU MPI library has the unique capability to provide significant performance benefits in the overall program execution time. This is possible with the MVAPICH2-DPU MPI library because the Arm cores in the DPUs are able to implement the non-blocking alltoall operations while the Xeon cores on the host are performing computation with peak overlap, as illustrated in Figure 1. As indicated in Figure 2, the MVAPICH2-DPU MPI library can deliver up to 23% performance benefits compared to the basic MVAPICH2 MPI library across message sizes and PPNs on a 32-node experiment with the OMB MPI_Ialltoall benchmark.

MVAPICH2-DPU can also reduce total execution time of the osu_iallgather microbenchmark by offloading the MPI_Iallgather collective to the DPU to achieve overlap of communication and computation. Figure 3 demonstrates a reduction of 48% in total execution time of the benchmark for a 256 process job.

 Figure 3: Capability of MVAPICH2-DPU to reduce Total Execution Time of osu_iallgather

Figure 4 illustrates the overlap of communication and computation with the osu_iallgather microbenchmark with MVAPICH2-DPU compared to default (no offloading) MVAPICH2. The subfigures show that the pure-host based algorithms are not able to provide any overlap. However, using the offload mechanism of MVAPICH2-DPU to progress communication instead can provide up to 98% overlap at 16 nodes 1 PPN and up to 51% overlap at 16 nodes 16 PPN.

 Figure 4: Overlap of Computation with Communication of osu_iallgather

In addition to MPI_Ialltoall and MPI_Iallgather, MVAPICH2-DPU can offload the MPI_Ibcast nonblocking collective. MVAPICH2-DPU can reduce total execution time of the osu_ibcast microbenchmark by up to 59% for a 256 process job at 16 MB, as shown in Figure 5.

Figure 5: Overall Execution Time of osu_ibcast

Figure 6 illustrates the overlap of MPI_Ibcast that MVAPICH2-DPU can provide compared to host-based MVAPICH2. As shown in the figure, MVAPICH2-DPU can achieve up to 77% overlap, 23% more than host-based MVAPICH2.

Figure 6: Overlap of osu_ibcast

Scientific Application Evaluation with P3DFFT

An enhanced version of the P3DFFT MPI kernel was evaluated on the 32-node HPC-AI cluster with the MVAPICH2-DPU MPI library. As illustrated in Figure 7, the MVAPICH2-DPU MPI library reduces the overall execution time of the P3DFFT application kernel up to 21% for various grid sizes and PPNs.

Capability of the MVAPICH2-DPU library to reduce overall execution time of the P3DFFT application. Figure 7: Capability of the MVAPICH2-DPU library to reduce overall execution time of the P3DFFT application.

NVIDIA Developer Blog

Additional details are available from NVIDIA Developers Blog “Accelerating Scientific Applications in HPC Clusters with NVIDIA DPUs Using the MVAPICH2-DPU MPI Library“.

Live Demo at the MUG ’21 Conference

Several team members of X-ScaleSolutions will present a live demo of MVAPICH2-DPU at the Mvapich2 User Group (MUG) ’21 conference. Register here.

Contact

Interested in a free trial license? Please email us at contactus@x-scalesolutions.com for support.