Moderator
Event Type
Panel
Reliability
Reproducibility
Resiliency
TimeFriday, November 17th8:30am -
10am
Location201-203
DescriptionThis panel will explore silent errors in HPC
applications that are expected to increase significantly
as the semiconductor technology reaches its feature size
limits. The panel will address the following
questions:
Can we characterize silent errors in a way that they can be distinguished from other types of errors? Specifically, when do we say whether or not a program has suffered from silent errors?
What solid evidence is there that silent errors are happening in real systems, and how they are affecting the correctness or running time of programs?
How does the propagation of errors and their detectability depend on application characteristics? How does one restructure a program (including use of minimal verification code) to enforce bounds on error impact and propagation?
Is it possible to develop general compiler, middleware, and hardware techniques to detect or mask their impact without significant performance degradation and increase in cost and energy?
Can we characterize silent errors in a way that they can be distinguished from other types of errors? Specifically, when do we say whether or not a program has suffered from silent errors?
What solid evidence is there that silent errors are happening in real systems, and how they are affecting the correctness or running time of programs?
How does the propagation of errors and their detectability depend on application characteristics? How does one restructure a program (including use of minimal verification code) to enforce bounds on error impact and propagation?
Is it possible to develop general compiler, middleware, and hardware techniques to detect or mask their impact without significant performance degradation and increase in cost and energy?