SC17 Denver, CO

Building Efficient Clouds for HPC, Big Data, and Deep Learning Middleware and Applications


Authors: Prof. Dhabaleswar Panda (Ohio State University)

BP
Abstract: Cloud Computing platforms have become the desired environments for running HPC, Big Data, and Deep Learning workloads. The community is seeing the opportunities and challenges of designing high-performance HPC/Big Data/Deep Learning runtime over clouds. This BoF will involve all the speakers and the audience to identify the most critical challenges facing the community and come up with a roadmap for the next 5-10 years in building efficient clouds with virtual-machines/containers for running HPC/Big Data/Deep Learning workloads. In-depth overviews of virtualization system software, high-performance communication and I/O mechanisms, and example applications on HPC clouds will be discussed.

Long Description: Significant growth has been witnessed during the last few years in HPC clusters with multi-/many-core processors, accelerators, and high-performance interconnects (e.g., InfiniBand/Omni-Path/RoCE). To alleviate the cost burden, sharing HPC cluster resources to end users through virtualization is becoming more attractive. The recently introduced Single Root I/O Virtualization (SR-IOV) technique for InfiniBand and High-Speed Ethernet provides native I/O virtualization capabilities and is changing the landscape of HPC virtualization. However, SR-IOV lacks support for locality-aware communication and live migration, which limits its usage for efficient running HPC/Big Data/Deep Learning workloads. In this context, the proposed BoF will organize several talks and discussions with the audience on the following aspects.

Goal: Involving all the speakers and the audience to identify the most critical challenges facing the community in building efficient clouds with virtual-machines and containers for running HPC, Big Data, and Deep Learning middleware and applications on modern HPC and data center architectures with high-performance interconnects, SR-IOV, accelerators, and storage technologies.

Topic: In-depth overviews of popular virtualization system software (e.g., hypervisors, containers, OpenStack, Slurm), high-performance communication and I/O mechanisms on HPC clouds. Discussions on the opportunities and technical challenges of designing high-performance HPC/Big Data/Deep Learning runtime over cloud environments. Demonstration on how high-performance solutions can be designed to run HPC, Big Data, and Deep Learning workloads (like MPI, Hadoop, Spark, TensorFlow) in clouds. Soliciting all kinds of feedback from the audience to come up with a roadmap for the next 5 - 10 years about how to efficiently handle these grand challenges associated with building efficient HPC clouds.

A sequence of short presentations from prominent people (stakeholders) working in this area, followed by a panel discussion with the audience.

Intended Audience: Various categories of people working in the areas of Cloud Computing, Virtualization, HPC, Big Data, and Deep Learning, including scientists, engineers, researchers, students, managers, and newcomers.

Relevance to the Expected HPC Audience: This proposed BoF will discuss the opportunities and technical challenges of designing high-performance runtime over cloud environments for delivering near-native performance to HPC, Big Data, and Deep Learning workloads. The SC conference is the leading forum for discussing these topics. This BoF will help the intended attendees to learn and rethink multiple aspects of running HPC, Big Data, and Deep Learning middleware and applications efficiently over cloud environments.

Previous Organization: This is the first time to organize this BoF.

Outcome: We will write a report to describe results of a survey of BoF attendees for specific questions. The tentative survey questions may be: 1) Which type of cloud environments are you using in your work? Virtual Machine, Container, or Nested? 2) Have you ever encountered any performance and scalability issues with those cloud environments? How you finally solved those issues? 3) What kind of opportunities and challenges have you seen for building efficient clouds with modern software and hardware technologies? 4) What kind of technologies you think are important for the next 5 - 10 years to efficiently handle these grand challenges associated with building efficient HPC clouds?

Conference Presentation: pdf


Birds of a Feather Index