P47: Understanding Congestion on Omni-Path Fabrics

Authors: Lauren Gillespie (Southwestern University), Christopher Leap (University of New Mexico), Dan Cassidy (Los Alamos National Laboratory)

Abstract: High-performance computing systems require high-speed interconnects, such as InfiniBand (IB), to efficiently transmit data. Intel’s Omni-Path Architecture (OPA) is a new interconnect similar to IB that is implemented on some of Los Alamos National Laboratory’s recent clusters. Both interconnects suffer from degraded performance under heavy network traffic loads, resulting in packet discards. However, unlike IB, OPA specifically calls out these drops in the form of the performance counter, congestion discards. Owing to the relative immaturity of the OPA fabric technology, the correlation between performance degradation and congestion discards has not been fully evaluated to date. This research aims to increase the level of understanding of the effects congestion has on cluster performance by presenting a sufficiently high data injection load to the OPA fabric such that performance degradation is induced and the cause of this performance degradation can be evaluated. LA-UR-17-26341
