Author/Presenters
Event Type
Workshop
Accelerators
Deep Learning
Exascale
GPU
Parallel Application Frameworks
Parallel Programming Languages, Libraries, Models
and Notations
SIGHPC Workshop
System Software
TimeSunday, November 12th2pm -
2:30pm
Location505
DescriptionSilent data corruption (SDC) and fail-stop errors are
the most hazardous error types in high-performance
computing (HPC) systems. In this study, we present an
automatic, efficient and lightweight redundancy
mechanism to mitigate both error types. We propose
partial task-replication and checkpointing for
task-parallel HPC applications to mitigate silent and
fail-stop errors. To avoid the prohibitive costs of
complete replication, we introduce a lightweight
selective replication mechanism. Using a fully automatic
and transparent heuristics, we identify and selectively
replicate only the reliability-critical tasks based on a
risk metric. Our approach detects and corrects around
70% of silent errors with only 5% average performance
overhead. Additionally, the performance overhead of the
heuristic itself is negligible.