Read our paper entitled “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation” at arXiv.org.
In this work minds.ai together with the MVAPICH team from Ohio State University led by Prof. DK Panda performed an extensive benchmarking and testing suite of distributed TensorFlow methods. The goal of this work was to determine which method would perform the most optimal on various high performance computing infrastructures. The systems tested ranged from University clusters to the Piz Daint Supercomputer comprised of 5000 GPU powered compute nodes. The work showed that, depending on the workload, one has to be careful with which distribution method and libraries are used. The work furthermore identified a number of bottlenecks in MVAPICH for which improvements were proposed and that are now part of the stable MVAPICH distribution.