SC17 Denver, CO

Machine Learning for Parallel Performance Analytics

Authors: Mr. Hans-Christian Hoppe (Intel Corporation)

Abstract: Parallel performance analysis tools are a mature field, with several highly scalable and capable tool families available. The objective of automating the analysis process and creating insights from the sea of raw data has not been fully achieved yet. Modern techniques in machine learning and AI (such a deep learning networks) might provide the means to develop highly automated, easy to use performance tools. This BoF assembles experts from the performance tools community and discusses how AI/ML techniques could be taken up to achieve this goal and usher in the era of automatic parallel performance analytics.

Long Description: Tools for thorough analysis and characterisation of HPC and high performance data analytics (HPDA) applications have been a traditional research area for many years, and several families of extremely capable tools have emerged, some of them able to scale to the largest HPC systems and to correlate a wide variety of metrics.

These tools create vast amounts of data for real-­world applications and systems. Advances in data visualization notwithstanding, it still requires deep competency and dedication on the side of the tool user, or the assistance of an experienced tools specialist to sift through this sea of data, drive the analysis process and generate insights that help in application or system optimization.

Different approaches to automate the analysis process have been tried in the past, and statistics, clustering and rule-­based systems are included in several systems. The spectacular advances in machine learning and artificial intelligence, in particular in deep neural networks (DNNs) lead to the question of whether their application can bring significant benefits and maybe even lead to automatic performance analytics becoming a realistic proposition. Examples of using such techniques include the automatic detection of complex patterns across many performance metrics which indicate a specific performance bottleneck, and guiding a tool user through the analysis process based on successful analysis steps from previous users.

This BoF assembles a distinguished panel of speakers from leading performance tools efforts (including Judit Gimenez/BSC, Brian Wylie/JSC, Wolfgang Nagel/TU Dresden and Felix Wolf/TU Darmstadt, Allen Malony/University of Oregon and Martin Schulz/LLNL) and invites experts from the machine learning area.

Short presentations will highlight the ways ML can be taken up in this field, and show success stories of using ML in the adjacent fields of system monitoring and resource management. A panel­-style discussion led and minuted by the BoF organisers will follow after the presentations.

One of the most important goals of the BoF is to get a fruitful and sustainable discussion going between the expert parallel performance analysis teams, AI/ML specialists and end users of parallel performance analytics. This could take the form of a community of practice, or of consortia that draft an R&D agenda with an eye towards starting academic or industrial projects.

Conference Presentation: pdf

Birds of a Feather Index