Accelerating Big Data Processing and Machine/Deep Learning Middleware on Modern HPC Clusters

Authors: Mr. Gilad Shainer (Mellanox Technologies)

BP
Abstract: The convergence of HPC, Big Data, and Machine/Deep Learning is the next game-changing business opportunity. Machine/Deep Learning is a pillar of today’s technological world and enables making better decisions based on the great amounts of data being collected. This BoF will involve all the speakers and the audience to identify the most critical challenges facing the community and coming up with a roadmap for the next 5-10 years in accelerating Big Data processing and Machine/Deep Learning middleware (e.g., Hadoop/Spark/TensorFlow/Caffe) on modern HPC clusters. Recent examples from organizations such as Baidu, Tencent, NVIDIA, Stanford, OSU, and more will be discussed.

Long Description: Data analytics has become an essential function within many high performance, enterprise data centers, clouds and hyperscale platforms. Machine Learning is a pillar of today’s technological world, offering solutions that enable making better and more accurate decisions based on the great amounts of data being collected. Machine Learning encompasses a wide range of applications, ranging from security, financial, and image and voice recognition, to self-driving cars and smart cities.

Training a deep neural network requires complex computations. The technology advancements in CPUs and GPUs do not keep progress with the growing needs for deep neural network trainings. As an example, a GoogLeNet deep network training takes 21 days to train on an ImageNet-1K on a single GPU. Distributing the learning phase into a large cluster of CPUs / GPUs can dramatically accelerate the learning time, but requires a scalable distributed implementation of the learning flow.

Multiple software implementations are now available, including reduction trees and other algorithms along with many papers that have been published covering this topic. All the existing implementations use data movement from one node to another to calculate the summation and are limited performance wise. Moving the summation operation to smart network elements can dramatically improve the operation performance as it reduces the amount of data moving from one node to another, reduces the time to complete the sum operation, and fully offloads the operation from the CPU / GPU to the network, therefore allowing better utilization of the GPU resource.

The session will cover the latest developed around accelerating machine learning algorithms and frameworks such as TensorFlow, Paddle, Caffe-2 and Apache Spark. Recent examples from organizations such as Baidu, Berkeley, Tencent, NVIDIA, Stanford and more will be reviewed and discussed.

Conference Presentation: pdf

Birds of a Feather Index