Parallel Programming Languages, Libraries, Models and Notations
TimeThursday, November 16th4pm - 4:30pm
DescriptionWhile program hangs on large parallel systems can be detected via the widely used timeout mechanism, it is difficult to set an optimal timeout threshold if users have limited knowledge of a program. Too small timeout will lead to high false alarm rates, and too large timeout will waste valuable computing resources. This paper presents a highly efficient hang detection tool, ParaStack, that does not rely on timeout. We have adapted ParaStack to work with Torque and Slurm parallel job schedulers and validated both its functionality and performance on the current world's tenth fastest supercomputer Stampede. Experimental results demonstrate that ParaStack can detect hangs accurately, in a timely manner, and at negligible runtime cost. Also ParaStack pinpoints the faulty processes with high accuracy when the hang is caused by errors in computation phase.