SC17 Denver, CO

Resilient Programming Environments

Authors: Dr. George Bosilca (University of Tennessee)

Abstract: Dealing with process faults were early introduced and heartily embraced by the parallel programming paradigms used in industry where resilience was always a core component of the toolbox proposed to the user. As resiliency is becoming more critical in HPC, user communities are left searching for alternative extensions to their traditional programming paradigms.

This BoFis intended as a forum between experts in different programming paradigms and users looking for solutions to circumvent the pitfalls of coping with failures. Join and become a voice that shapes the resilience support provided by parallel programing paradigms and their supporting software infrastructure.

Long Description: The User Level Failure Mitigation fault tolerant MPI implementation has reached a level of maturity allowing an increasing number of users to prototype and evaluate innovative application-driven fault tolerance strategies over MPI. This effort is not solitary in the HPC field, most of the parallel programming paradigms either have proposed basic solutions, or are actively investigating potential venue for allowing users to develop resilient applications. Most of the efforts, both in industry and academia, to provide resilience are building on top of few concepts, reliable broadcast, consensus and dynamic launching, that used together are the basic building blocks for a range of potential, application or science domain driven solution.

This BoF plan to serve as a gathering/meeting point for developers and users of HPC solutions, to share, discuss and compare their views on application's needs and opportunities from the programming paradigm to provide some level of support. No particular parallel programming paradigm will be targeted, but traditionally mainstream HPC programming paradigms (MPI, PGAS) will be of interest for a wide audience. Due to time restriction, we are planning to address few of the parallel programming paradigms in use, MPI, X10, CoArray, GPI, DataSpaces.

Based on past experiences with BoF at SC (not on this particular subject, but the authors are leading other BoF submissions), we have found that a BoF format makes it difficult to solicit questions in-vivo, and discussion rarely arise spontaneously among a varied group of participants. Thus, we plan to address this issue prior to the Bof, by soliciting online polls from different communities that are currently affected by the lack of resilience in HPC programming paradigms, and use this as a base to foster discussions during the BoF. During the Bof we plan to first have a panel of experts succinctly describe how some programming paradigms have chosen to address this issue and why the programming experts consider this approach valid for a particular scenario. The questions gathered online will then be addressed by the experts, and will be discussed with the participants. The discussion will continue with questions from the public, or/and more specialized topics from the panelists.

Conference Presentation: pdf

Birds of a Feather Index