Presenters
Event Type
Tutorial

Networks
TimeMonday, November 13th1:30pm -
5pm
Location205
DescriptionAs InfiniBand (IB), Omni-Path, and High-Speed Ethernet
(HSE) technologies mature, they are being used to design
and deploy various High-End Computing (HEC) systems: HPC
clusters with GPGPUs and Xeon Phis supporting MPI,
Storage and Parallel File Systems, Cloud Computing
systems with SR-IOV Virtualization, Grid Computing
systems, and Deep Learning systems. These systems are
bringing new challenges in terms of performance,
scalability, portability, reliability and network
congestion. Many scientists, engineers, researchers,
managers and system administrators are becoming
interested in learning about these challenges,
approaches being used to solve these challenges, and the
associated impact on performance and scalability.
This tutorial will start with an overview of these systems. Advanced hardware and software features of IB, Omni-Path, HSE, and RoCE and their capabilities to address these challenges will be emphasized. Next, we will focus on Open Fabrics RDMA and Libfabrics programming, and network management infrastructure and tools to effectively use these systems. A common set of challenges being faced while designing these systems will be presented. Finally, case studies focusing on domain-specific challenges in designing these systems (including the associated software stacks), their solutions and sample performance numbers will be presented.
This tutorial will start with an overview of these systems. Advanced hardware and software features of IB, Omni-Path, HSE, and RoCE and their capabilities to address these challenges will be emphasized. Next, we will focus on Open Fabrics RDMA and Libfabrics programming, and network management infrastructure and tools to effectively use these systems. A common set of challenges being faced while designing these systems will be presented. Finally, case studies focusing on domain-specific challenges in designing these systems (including the associated software stacks), their solutions and sample performance numbers will be presented.
Links