SC17 Denver, CO

P96: Correcting Detectable Uncorrectable Errors in Memory

Authors: Grzegorz Pawelczak (University of Bristol), Simon McIntosh-Smith (University of Bristol)

Abstract: With the expected decrease in Mean Time Between Failures, Fault Tolerance has been identified as one of the major challenges for exascale computing. One source of faults are soft errors caused by cosmic rays, which can cause bit corruptions to the data held in memory. Current solutions for protection against these errors include Error Correcting Codes, which can detect and/or correct these errors. When an error that can be detected but not corrected occurs, a Detectable Uncorrectable Error (DUE) results, and unless checkpoint-restart is used, the system will usually fail. In our work we present a probabilistic method of correcting DUEs which occur in the part of the memory where the program instructions are stored. We devise a correction technique for DUEs for the ARM A64 instruction set which combines extended Hamming code with Cyclic Redundancy Check code to provide near 100% Successful Correction Rate of DUEs.
Award: Best Poster Finalist (BP): no

Poster: pdf
Two-page extended abstract: pdf

Poster Index