X-ScaleAI-DPU is a high performance solution to accelerate CPU-based distributed DNN training by utilizing the capabilities of data processing units (DPUs).

Features

  • Exploiting HPC Technologies for CPU-based deep learning.
  • Offload DNN training tasks to the DPU
  • Support for DNN checkpointing with DPUs (New)
  • Support for BlueField-3 DPUs (New)
  • User friendly Python interface to run DL applications on the CPU and DPU
  • Fine-tuned MPI library for CPU and DPU systems
  • Distributed Training with Pytorch using Horovod
  • “Out of the box” optimal performance on CPU+DPU platforms
  • Tested on several DNNs and datasets with up to 19% improvement in DNN training performance without checkpointing. (New)
  • Up to 33% improvement in epoch time with checkpointing (New)
  • Simple installation and execution in one command
  • Coming soon: support for more system configurations

Installation

X-ScaleAI-DPU offers a one-command installation process.

Sample Run

X-ScaleAI also offers a simple run command.

Performance Evaluation

System Configuration

  • Two Intel(R) Xeon(R) 16-core CPUs (32 total) E5-2697A V4 @ 2.60 GHz
  • NVIDIA BlueField-3 SoC, NDR200 200Gb/s InfiniBand adapters
  • NVIDIA BlueField-2 SoC, HDR100 100Gb/s InfiniBand adapters
  • Memory: 256GB DDR4 2400MHz RDIMMs per node
  • 1TB 7.2K RPM SSD 2.5″ hard drive per node
  • NVIDIA ConnectX-6 HDR/HDR100 200/100Gb/s InfiniBand/VPI adapters with Socket Direct

1. XScaleAI-DPU improvement for DNN training

Performance improvement using XScaleAI-DPU over CPU-only training on the ResNet-20v1 model on the CIFAR10 dataset (BlueField-3)
Performance improvement using XScaleAI-DPU over CPU-only training on the ResNet-20v1 model on the CIFAR10 dataset (BlueField-2)
  • Up to 19% improvement in training performance using X-ScaleAI-DPU
  • Support for both BlueField-2 and BlueField-3
  • Performance improvement across different DL models and datasets

2. XScaleAI-DPU improvement for DNN training with checkpointing

Performance improvement for checkpointing using XScaleAI-DPU over CPU-only training on the ResNet-34 model on the CIFAR10 dataset (BlueField-3)
  • Up to 33% improvement in epoch time on the ResNet-34 model using X-ScaleAI-DPU compared to CPU only.
  • Improvement percentage using X-ScaleAI-DPU for checkpointing increases as number of nodes increases.
  • Performance improvement across different DL models and datasets