Characterizing Faults, Errors, and Failures in Extreme-Scale Systems

Authors: Dr. Christian Engelmann (Oak Ridge National Laboratory)

BP
Abstract: This session brings together a group of international experts from the Accelerated Data Analytics and Computing Institute to present their efforts in characterizing faults, errors, and failures in extreme-scale systems and to discuss practical experiences with software tools and infrastructures, including operational aspects. The ADAC Institute is a collaboration between Oak Ridge National Laboratory, the Swiss Federal Institute of Technology Zurich, Tokyo Institute of Technology, Lawrence Livermore National Laboratory, Juelich Research Centre, the University of Tokyo, Cray, Nvidia, and Intel. The session includes short presentations and a discussion that focuses on future research and development, collaboration opportunities, and vendor interactions.

Long Description: Today's high-performance computing (HPC) systems are heavily instrumented for system health monitoring. System administrators typically use this capability to identify faulty components through manual root cause analysis methods. This health data is captured through a variety of methods. System logs contain information about abnormal events, such as critical conditions, faults, errors and failures. Job logs maintain a history of application runs and their exit statuses, i.e., successful vs. failed. Reliability, availability and serviceability monitoring systems provide data from hardware and software sensors, such as temperatures, memory errors and processor usage. I/O and storage systems maintain health and performance databases. Data on application health beyond exit statuses is not collected, as applications are usually not instrumented. These data provide opportunities for root cause analysis, efficient failure detection, error propagation tracking, and system reliability evaluation, but also present a large-scale data analytics challenge. Selecting and correlating the right data sources is essential, as the entire volume of collected data can not be stored for analyses.

Today's methods and tools for characterizing faults in extreme-scale HPC systems are fraught with challenges as they often do not take advantage of the full spectrum of available instrumentation, lack in advanced data analytics, and do not consider application health. Fault models are generally simplistic and ignore important aspects, such as the impact of utilization history on reliability statements (i.e., a component or subsystem that is not used very much is statistically very reliable -- no utilization, no faults). A easy to use tools and efficient frameworks are needed for characterizing extreme-scale systems.

Several HPC centers have recently focused on the development of new software tools and hardware and software infrastructures for system and application monitoring and analysis. The primary goal is to improve resilience through reliable fault detection at an early stage with actionable information for efficient mitigation during system design and runtime. The approaches range from analyzing logs using software tools in an offline fashion to analyzing instrumentation data streams with more complex a software infrastructure in a near real-time paradigm. These technologies are required to handle high-volume data gathering and processing tasks, such as event classification, event statistics, temporal and spatial correlation, and root cause analyses. This opens up new possibilities for bridging modern data analytics approaches that tackle Big Data problems for characterizing operational faults, errors, and failures in extreme-scale systems.

This Birds of a Feather session brings together a group of international experts on this topic as part of the Accelerated Data Analytics and Computing (ADAC) Institute. The ADAC Institute is a collaboration between Oak Ridge National Laboratory, the Swiss Federal Institute of Technology (ETH) Zurich, Tokyo Institute of Technology, Lawrence Livermore National Laboratory, Juelich Research Centre, the University of Tokyo, Cray, Nvidia, and Intel that, among other aspects, focuses on sharing best practices regarding the operation, management, and procurement of HPC resources. As part of this effort, this BoF session will feature representatives from the ADAC Institute in the form of short presentations and a discussion panel.

Conference Presentation: pdf

Birds of a Feather Index