The MVAPICH2-DPU MPI library is a derivative of the MVAPICH2 MPI library and is optimized to harness the full potential of NVIDIA Bluefield Data Processing Units (DPUs) with InfiniBand networking and accelerate HPC applications.
Features
The MVAPICH2-DPU 2022.02 release has the following features:
- Based on MVAPICH2 2.3.6, conforming to the MPI 3.1 standard
- Supports all features available with the MVAPICH2 2.3.6 release
- Novel frameworks to offload nonblocking collectives to DPU
- Offloads nonblocking collectives to DPU
- Alltoall (MPI_Ialltoall)
- Allgather (MPI_Iallgather)
- Bcast (MPI_Ibcast)
- Up to 100% overlap of communication and computation with nonblocking collectives
- Accelerates scientific applications using MPI_Ialltoall nonblocking collective
Performance
Figure 1: Capability of the MVAPICH2-DPU library to extract peak overlap between computation happening at the host and MPI_Ialltoall communication
Figure 1 illustrates performance results of the MPI_Ialltoall nonblocking collective benchmark running with 512 (32 nodes with 16 processes per node (PPN) each) and 1,024 (32 nodes with 32 PPN each) MPI processes, respectively. As message size increases, the MVAPICH2-DPU library is able to demonstrate peak (100%) overlap between computation and MPI_Ialltoall nonblocking collective. In contrast, the MVAPICH2 default library without such DPU offloading capability is able to provide very little overlap between computation and MPI_Ialltoall nonblocking collective.
Figure 2: Capability of the MVAPICH2-DPU library to reduce overall execution time of an MPI application when computation steps are used in conjunction with the MPI_Ialltoall non-blocking collective operation in an overlapped manner.
When computation steps in an MPI application are used in conjunction with the MPI_Ialltoall non-blocking collective operation in an overlapped manner, the MVAPICH2-DPU MPI library has the unique capability to provide significant performance benefits in the overall program execution time. This is possible with the MVAPICH2-DPU MPI library because the Arm cores in the DPUs are able to implement the non-blocking alltoall operations while the Xeon cores on the host are performing computation with peak overlap, as illustrated in Figure 1. As indicated in Figure 2, the MVAPICH2-DPU MPI library can deliver up to 23% performance benefits compared to the basic MVAPICH2 MPI library across message sizes and PPNs on a 32-node experiment with the OMB MPI_Ialltoall benchmark.
MVAPICH2-DPU can also reduce total execution time of the osu_iallgather microbenchmark by offloading the MPI_Iallgather collective to the DPU to achieve overlap of communication and computation. Figure 3 demonstrates a reduction of 84% in total execution time of the benchmark for a 512 process job.
Figure 3: Capability of MVAPICH2-DPU to reduce Total Execution Time of osu_iallgather
Figure 4 illustrates the overlap of communication and computation with the osu_iallgather microbenchmark with MVAPICH2-DPU compared to default (no offloading) MVAPICH2. The subfigures show that for most message sizes, the pure-host based algorithms are not able to provide any overlap. However, using the offload mechanism of MVAPICH2-DPU to progress communication instead can provide near 100% overlap at 16 nodes 1 PPN and up to 73% overlap at 16 nodes 32 PPN.
Figure 4: Overlap of Computation with Communication of osu_iallgather
In addition to MPI_Ialltoall and MPI_Iallgather, MVAPICH2-DPU can offload the MPI_Ibcast nonblocking collective. MVAPICH2-DPU can reduce total execution time of the osu_ibcast microbenchmark by up to 48% for a 512 process job at 16 MB, and 58% for a 32 node 1 process per node job as shown in Figure 5.
Figure 5: Overall Execution Time of osu_ibcast
Figure 6 illustrates the overlap of MPI_Ibcast that MVAPICH2-DPU can provide compared to host-based MVAPICH2. As shown in the figure, MVAPICH2-DPU can achieve up to 98% overlap, +38% more than host-based MVAPICH2.
Figure 6: Overlap of osu_ibcast
Scientific Application Evaluation with P3DFFT
An enhanced version of the P3DFFT MPI kernel was evaluated on the 32-node HPC-AI cluster with the MVAPICH2-DPU MPI library. As illustrated in Figure 7, the MVAPICH2-DPU MPI library reduces the overall execution time of the P3DFFT application kernel up to 21% for various grid sizes and PPNs.
Figure 7: Capability of the MVAPICH2-DPU library to reduce overall execution time of the P3DFFT application.
NVIDIA Developer Blog
Additional details are available from NVIDIA Developers Blog “Accelerating Scientific Applications in HPC Clusters with NVIDIA DPUs Using the MVAPICH2-DPU MPI Library“.
Video of Live Demo at the MUG ’21 Conference
Contact
Interested in a free trial license? Please email us at contactus@x-scalesolutions.com for support.