HPC Technique Propels Deep Learning at Scale By HPC Wire
Baidu's Silicon Valley Artificial Intelligence Lab (SVAIL) has released a modified implementation of the ring all-reduce OpenMPI algorithm for the deep-learning community, which will enable faster training of neural networks across graphical-processing unit (GPU) nodes.
Unlike the OpenMPI version, the SVAIL modification avoids making extraneous copies between the central processing unit (CPU) and the GPU.
Although commonplace in high-performance computing, the technique has been underused within AI and deep learning, according to Baidu. Compared with using a single GPU, the ring all-reduce algorithm is about 31 times faster at 40 GPUs.
The algorithm has enabled the SVAIL team to get linear GPU scaling up to 128 GPUs and to parallelize the training of Deep Speech 2, its speech-recognition mode.
Two years after the approach was initially developed, the researchers have issued two non-proprietary implementations, one for TensorFlow and one for more general applications. ... "