Regression Testing and Monitoring Tools
Authors: Dr. Bilel Hadri (King Abdullah University of Science and Technology)
BP
Abstract: Supercomputers are becoming larger and more complex tightly integrated systems consisting of many different hardware components, tens of thousands of processors and memory chips, kilometers of networking cables, large numbers of disks, and hundreds of applications and libraries. To increase scientific productivity and ensure that applications efficiently and effectively exploit a system’s full potential, all the components must deliver reliable, stable, and performant service. This BoF discusses the best practice from supercomputing centers using different strategies on system performance assessments and seeks those interested in sharing experiences to detect issues related to the performance and functionality of HPC systems.
Long Description: ===Motivation:
For many years, regression testing has been an essential step of any software development or integration cycle. However, for HPC systems, regression testing is typically performed in a more ad-hoc fashion, and is focused on the basic functionality of the various hardware components before releasing the system back to the users as soon as possible after maintenance or in response to user complaints regarding functionality and performance issues. Usually, the performance of all components are monitored and measured independently, nevertheless, it does not capture the overall behavior of the HPC system that users and their parallel applications are facing.
The goal of this BoF is to bring together the different effort from multiple leadership-class supercomputer centers and ideas from the HPC community to discuss about the regression testing, the different strategies adopted and the lessons learned. In addition, we will consider feedback on how efforts can be merged, and make them available and easy to be implemented by the HPC community. This BoF will bring together those with experience and interest in regression testing and for those who want to explore this topic more deeply. The target audience includes computational scientists, user support and sys-admins staff, along with research community involved in benchmarking and monitoring. It will be the first meeting for this special interest group at SC, nevertheless two BoFs have been organized successfully at Cray User Group in 2015 and 2016 with 25 attendees.
===Audience Interaction: The BoF will be in three parts.
Part1 (10min): The BoF will be interactive starting with live survey on the regression testing so that the speakers have an idea about the audience, their interests in the topic, the tests performed and guidance needed. The survey questions are: Are you aware of any regression testing tools? Does your HPC center perform the regression testing? What tests are performed? Is it an automatic process? How long does it last?
Then, three presentations will be given from different supercomputing centers representatives providing to the audience an opportunity to learn about the different strategies and approaches put in place in their centers along with a short demo.
Part2(30min): The proposers bring to the table passion and experience improving the satisfaction of users by detecting earlier issues that they might experience and provide them the best environment for using the HPC systems. Dr. Reuben Budiardja (Oak Ridge National Laboratory) will provide an overview on Application-Level Regression Testing Framework Using Jenkins that is implemented at NCSA and NICS system. He was awarded Runner-Up Best Paper at Cray User Group Meeting in 2017. Dr. Vasileios Karakasis ( Swiss Supercomputing Center) will present a new framework for writing regression tests for HPC systems, called ReFrame. Dr Bilel Hadri( King Abdullah University of Science and Technology) will share the design and implementation of the regression testing methodology used on Shaheen2 XC40 to detect and track issues along with the lesson learned.
Part3(20min): the last part is dedicated to questions with the audience and hand-ons/demo using the tools presented.
Conference Presentation: pdf
Birds of a Feather Index