HPC Systems Monitoring Data in Action
Authors: Jim Brandt (Sandia National Laboratories)
Abstract: New opportunities in the use of monitoring data include experimental measurement of science throughput, data driven approaches to systems architectural design decisions, real time analysis of system/application state and function, and long term trend analysis. We invite application users and developers, system architects, and system administrators to provide new perspectives. Panelists using extensive monitoring data will present new insights from their systems. They will interact with the audience to share experiences and foster collaborations on tools and designs. We will identify areas of overlap or potential joint activity to steer future analysis development and use of data.
Long Description: Currently HPC monitoring is largely administrator focused, seeking to provide notification of failed components, down services, system load, and environmental issues. However, the abundance of system information, increased density of components, and diversity of application and middleware design provides new opportunities for the use of monitoring data.
This BoF will explore new opportunities in the use of HPC monitoring data. These include: experimental measurement of science throughput, data driven approaches to systems architectural design decisions, real time analysis of system/application state and function, long term trend analysis, and resource-aware adaptivity of application and system operations.
We encourage attendance of a cross-section of disciplines including a) users and developers for potential insights on application performance and resource mapping from monitoring data, b) system architects for insight on the use of detailed resource utilization and performance data to drive architectural decisions, c) middleware and system software developers for insight on how system information could enable more intelligent decisions and adaptive software, and d) system administrators for information on the needs and gaps in current large-scale system information.
The BoF will begin with short presentations from projects that are using detailed monitoring data to provide insights into application performance, systems operation, and future planning. Panelists have been chosen because of their progress in specific projects addressing new areas in the use of HPC monitoring data. They represent diverse aspects of the problem space and provide a variety of perspectives:
1) Greg Bauer (NCSA) -- Application performance analysis with systems’ understanding.
2) Martins Innus (University at Buffalo) -- Integrating monitoring data into SUPReMM and workload analysis of BlueWaters
3) Ayse Coskun (Boston University) -- Early identification of anomalous application behavior
4) Joe Greenseid (MPO) -- Runtime system Assessments
5) Mike Showerman (NCSA) -- Troubleshooting applications at a system level
We will transition to a interactive segment to engage the larger group in sharing experiences to foster new collaborations on tools and approaches. We will use input from the audience to help identify areas of interest, overlap, or potential joint activity as well as mechanisms to steer future software designs. We will discuss barriers to entry into the collection, storage, and use of large scale monitoring data.
An artifact of this BoF will be written report detailing outcomes and opinions from discussions of the panelists and audience in the following areas:
1)Novel uses of HPC monitoring data currently in production or development
2)Potential paths for impacting system software, architecture design, monitoring data sources and collection mechanisms
3)Collaboration on tool and analysis development and testing
4)Barriers to collection, storage, and use of large scale monitoring data
The organizers are active promoters of HPC Monitoring Community building including:
1)Administrators of the “Monitoring Large Scale HPC Systems” community web site and mailing list: https://sites.google.com/site/monitoringlargescalehpcsystems/home.
2)Organizing Committee of “Monitoring and Analysis of HPC Systems Plus Operations (HPCMASPA)” Workshop Series at IEEE Cluster: https://sites.google.com/site/hpcmaspa/
3)Lead of the Cray System Monitoring Working Group
4)Organizers of SIAM Minisymposia on HPC Monitoring and Analysis
5)Organizers of multiple successful BoF’s at SC, CUG, etc.
Conference Presentation: pdf
Birds of a Feather Index