P89: Desh: Deep Learning for HPC System Health Resilience

Authors: Anwesha Das (North Carolina State University), Abhinav Vishnu (Pacific Northwest National Laboratory), Charles Siegel (Pacific Northwest National Laboratory), Frank Mueller (North Carolina State University)

Abstract: HPC systems are well known to endure service downtime due to increasing failures. With enhancements in HPC architectures and design, enabling resilience is extremely challenging due to component scaling and absence of well defined failure indicators. HPC system logs are notorious to be complex and unstructured. Efficient fault prediction to enable proactive recovery mechanisms is the need of the hour to make such systems more robust and reliable. This work addresses such faults in computing systems using a recurrent neural network based technique called LSTM (long short-term memory).

We present our framework Desh : Deep Learning for HPC System Health, which entails a procedure to diagnose and predict failures with acceptable lead times. Desh indicates prospects of indicating failure indicators with enhanced training and classification for generic applicability to other systems. This deep learning based framework gives interesting insights for further work on HPC system reliability.
Award: Best Poster Finalist (BP): no

Poster: pdf
Two-page extended abstract: pdf

Poster Index