DescriptionThis panel will explore silent errors in HPC applications that are expected to increase significantly as the semiconductor technology reaches its feature size limits. The panel will address the following questions:
Can we characterize silent errors in a way that they can be distinguished from other types of errors? Specifically, when do we say whether or not a program has suffered from silent errors?
What solid evidence is there that silent errors are happening in real systems, and how they are affecting the correctness or running time of programs?
How does the propagation of errors and their detectability depend on application characteristics? How does one restructure a program (including use of minimal verification code) to enforce bounds on error impact and propagation?
Is it possible to develop general compiler, middleware, and hardware techniques to detect or mask their impact without significant performance degradation and increase in cost and energy?