Resilient N-Body Tree Computations with Algorithm-Based Focused Recovery: Model and Performance Analysis
Workshop: The 8th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computer Systems (PMBS17)
Authors: Aurelien Cavelan (University of Basel)
Abstract: This presentation presents a model and performance study for Algorithm-Based Focused Recovery (ABFR) applied to N-body computations, subject to latent errors. We make a detailed comparison with the classical Checkpoint/Restart (CR) approach. While the model applies to general frameworks, the performance study is limited to perfect binary trees, due to the inherent difficulty of the analysis. With ABFR, the crucial parameter is the detection interval, which bounds the error latency. We show that the detection interval has a dramatic impact on the overhead, and that optimally choosing its value leads to significant gains over the CR approach.
Workshop Index