X-ScaleAI-DPU is a high performance solution to accelerate CPU-based distributed DNN training by utilizing the capabilities of data processing units (DPUs).


  • Exploiting HPC Technologies for CPU-based deep learning
  • Offload DNN training tasks to the  DPU
  • User friendly Python interface to run DL applications on the CPU and DPU
  • Fine-tuned MPI library for CPU and DPU systems
  • Distributed Training with Pytorch using Horovod
  • “Out of the box” optimal performance on CPU+DPU platforms
  • Tested on several DNNs and datasets with up to 17% improvement in performance.
  • Simple installation and execution in one command
  • Coming Soon: support for more system configurations


X-ScaleAI-DPU offers a one-command installation process.

Sample Run

X-ScaleAI also offers a simple run command.


System Configuration

  • Two Intel(R) Xeon(R) 16-core CPUs (32 total) E5-2697A V4 @ 2.60 GHz
  • NVIDIA BlueField-2 SoC, HDR100 100Gb/s InfiniBand/VPI adapters
  • Memory: 256GB DDR4 2400MHz RDIMMs per node
  • 1TB 7.2K RPM SSD 2.5″ hard drive per node
  • NVIDIA ConnectX-6 HDR/HDR100 200/100Gb/s InfiniBand/VPI adapters with Socket Direc
Performance improvement using XScaleAI-DPU over CPU-only training on the ResNet-20v1 model on the CIFAR10 dataset

Performance improvement using XScaleAI-DPU over CPU-only training on the ShuffleNet model on the TinyImageNet dataset

  • Up to 17% improvement in training performance using X-ScaleAI-DPU
  • Consistent improvement with scaling up to 32 nodes
  • Performance improvement across different DL models and datasets