Skip to content

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI

Read our paper entitled “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation” at arXiv.org.

Synopsis

In this work minds.ai together with the MVAPICH team from Ohio State University led by Prof. DK Panda performed an extensive benchmarking and testing suite of distributed TensorFlow methods. The goal of this work was to determine which method would perform the most optimal on various high performance computing infrastructures. The systems tested ranged from University clusters to the Piz Daint Supercomputer comprised of 5000 GPU powered compute nodes. The work showed that, depending on the workload, one has to be careful with which distribution method and libraries are used. The work furthermore identified a number of bottlenecks in MVAPICH for which improvements were proposed and that are now part of the stable MVAPICH distribution.

Related

How can we help?

Reach out below - we’d love to hear more about how we can help you.

We use cookies and similar technologies to enable services and functionality on our site and to understand your interaction with our service. By clicking on accept, you agree to our use of such technologies for analytics. See Privacy Policy

Leave this field blank