Analyzing the Criticality of Transient Faults-Induced SDCs on GPU Applications
Workshop: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Abstract: In this paper, we compare the soft-error sensitivity of parallel applications on modern GPUs obtained through architectural-level fault injections and high-energy particle beam radiation experiments. Fault-injection and beam experiments provide different information and use different transient-fault sensitivity metrics, which are hard to combine. In this presentation, we show how correlating beam and fault-injection data can provide a deeper understanding of the behavior of GPUs in the occurrence of transient faults. In particular, we demonstrate that commonly used architecture-level fault models (and fast injection tools) can be used to identify critical kernels and to associate some experimentally observed output errors with their causes. Additionally, we show how register file and instruction-level injections can be used to evaluate ECC efficiency in reducing the radiation-induced error rate.