Failures in Large Scale Systems: Long-Term Measurement, Analysis, and Implications
Event Type
Paper

State of the Practice
TimeWednesday, November 15th3:30pm - 4pm
Location405-406-407
DescriptionResilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. While the complexity of managing system reliability has increased, the number of studies covering comprehensive quantification and deep analysis of failures characteristics in large scale systems has not increased in the same proportion. To bridge this gap, in this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over the period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss implications of new findings.
Download PDF: here