Dagstuhl Seminar 23171: Driving HPC Operations With Holistic Monitoring and Operational Data Analytics

Dagstuhl Seminar 23171

Driving HPC Operations With Holistic Monitoring and Operational Data Analytics

( Apr 23 – Apr 28, 2023 )

(Click in the middle of the image to enlarge)

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/23171

Organizers

Florina M. Ciorba (Universität Basel, CH)
Ann Gentile (Sandia National Labs - Albuquerque, US)
Michael Ott (LRZ - München, DE)
Torsten Wilde (HPE- Böblingen, DE)

Contact

Andreas Dolzmann (for scientific matters)
Jutka Gasiorowski (for administrative matters)

Summary

Show Summary

The Dagstuhl Seminar 23171 (April 23-28, 2023) brought together 35 practitioners and researchers in the areas of HPC system management and monitoring, data analytics, and computer science to collaboratively work on developing community solutions for revolutionizing HPC system operations. Autonomous operations have long been the vision for efficient HPC system operations due to the size and complexity of current and evolving HPC systems and the need for pervasive, low-latency response. Autonomous operations are a complex topic encompassing monitoring, analysis, feedback, and response. The seminar goals were to make substantial progress on the technical, community, and funding challenges necessary for the community to move forward and reach this vision.

The seminar schedule comprised a mix of keynotes presentations, seed and position talks, enlisted and ad-hoc lightning talks, interleaved with plenary discussions and working group discussions, both in the seminar rooms as well as outdoors.

These program elements and the active participation of the attendees lead to many fruitful discussions on the following topics:

Center-specific urgent use cases that drive data collection, analysis, and response requirements across the variety of institutions represented.
Types of available data, including sources, semantics, and fidelity, to support continuous analyses.
Requirements for actionable analytics. What is needed to convert raw data into information upon which action can be taken (e.g., confidence measures, explainability requirements not inherent in AI approaches, representation of results, latency, etc.)?
Applicability of existing analytics and informatics approaches to the domain-specifics of HPC operations. While there are many promising ML/AI approaches in other domains (e.g., image/speech processing, autonomous vehicles), it is not yet clear how many and which of those apply to the HPC operations and research domains (e.g., the occurrence of rare fault events, discontinuity of inertia-less measurements).
Opportunities for response involving infrastructure, hardware, system software, and applications. Identification of feedback hooks that would need to be added to existing and evolving system components (e.g., hardware, firmware, system software, application software) to support automated response.
Exploration of formalism and architectural design patterns from the field of Self-Adaptive Systems to facilitate common, interoperable, and interchangeable design and development paths forward.

The technical presentations and engaging discussions reinforced the urgent need and desire for a community approach to advance the state and practice of HPC Monitoring and Operational Data Analytics, with the goal of revolutionizing HPC operations and research, in order to deliver efficient and sustainable HPC systems and applications.

The fundamental results of this community discussion are given in Section 10 of the full report of this report. These include assessments of the state of autonomous loops in HPC operations and assessments of challenges and opportunities. The community agreed to continue meeting, on a monthly basis and at upcoming community-relevant events. They further agreed to develop proofs of concepts for concrete use cases that will showcase both the need for holistic monitoring and analysis and their benefits for more efficient HPC operations. These proofs of concepts will also serve as a basis for technical design decisions and prototype solutions to be deployed in various HPC systems. The final goals of our effort are to continue to build and progress on a community collaborative technical path forward, a community interaction path forward, and a community collaborative funding path forward, as described in Section 11 of the full report, to fulfill the vision of autonomous and efficient HPC system operations.

Creative Commons BY 4.0

Jim Brandt, Florina M. Ciorba, Ann Gentile, Michael Ott, and Torsten Wilde

Motivation

Show Motivation

Advances in analytic approaches have brought the vision of efficient HPC operations enabled by dynamic analysis and automated feedback/adaptation within reach. Many HPC centers have started the development and deployment of frameworks to enable continuous and holistic monitoring, archiving, and analysis of performance data from their production machines and related infrastructures. The impact of such frameworks rests upon the ability to effectively analyze such data. Analytic techniques have been successfully developed and applied in other domains but their features may not apply directly to HPC Operations data and situations. Leveraging, adapting, and extending such techniques would open up new avenues for research and development of actionable analytics that can drive more intelligent operations through both manual and automated response to conditions of interest.

This Dagstuhl Seminar will bring together HPC practitioners in the areas of system management and monitoring and computer science research to collaboratively work on developing community solutions for efficient HPC system operations. The topics to be discussed in this seminar range from use cases to data and analytic approaches required to address them, to ultimately using the results of analyses to improve performance, operations, and research, both with and without human-in-the-loop. Specifically, the topics to be discussed will include:

Center-specific urgent use cases that drive data collection, analysis, and response requirements across the variety of institutions represented.
Types of available data, including sources, meanings, and fidelities, to support continuous analyses.
Requirements for actionable analytics: What do we need to convert raw data into information upon which we can act (e.g., confidence measures, explainability requirements not inherent in AI approaches, representation of results, latency, etc.)?
Applicability of existing analytics and informatics approaches to the domain-specifics of HPC operations. While there are many promising ML/AI approaches in other domains (e.g., image/speech processing, autonomous vehicles), it is not yet clear how many of those apply to the HPC operations and research domains (e.g., the occurrence of rare fault events, discontinuity of inertia-less measurements).
Opportunities for response involving infrastructure, hardware, system software, and applications. Discussion will include identifying what hooks would be needed to be added to existing system components (e.g., hardware, firmware, system software, application software) to support automated response.

Creative Commons BY 4.0

Jim M. Brandt, Florina M. Ciorba, Ann Gentile, Michael Ott, Martin Schulz and Torsten Wilde

Participants

Show Participants

On-site

Francieli Boito (INRIA - Bordeaux, FR) [dblp]
Jim Brandt (Sandia National Labs - Albuquerque, US) [dblp]
Valeria Cardellini (University of Rome "Tor Vergata", IT) [dblp]
Philip Carns (Argonne National Laboratory, US) [dblp]
Florina M. Ciorba (Universität Basel, CH) [dblp]
Isaías Alberto Comprés Ureña (TU München - Garching, DE) [dblp]
Thaleia Dimitra Doudali (IMDEA Software Institute - Madrid, ES) [dblp]
Hilary Egan (NREL - Golden, US)
Ahmed Eleliemy (Universität Basel, CH)
Ann Gentile (Sandia National Labs - Albuquerque, US) [dblp]
Taylor Groves (Lawrence Berkeley National Laboratory, US) [dblp]
Thomas Gruber (Universität Erlangen-Nürnberg, DE)
Jeff Hanson (HPE - Lakewood, US)
Utz-Uwe Haus (HPE HPC/AI EMEA Research Lab - Wallisellen, CH) [dblp]
Esa Heiskanen (CSC Ltd. - Kajaani, FI)
Kevin A Huck (University of Oregon - Eugene, US) [dblp]
Thomas Ilsche (TU Dresden, DE) [dblp]
Thomas Jakobsche (Universität Basel, CH) [dblp]
Terry Jones (Oak Ridge National Laboratory, US) [dblp]
Sven Karlsson (Technical University of Denmark - Lyngby, DK) [dblp]
Allen D. Malony (University of Oregon - Eugene, US) [dblp]
Henrique Mendonça (CSCS - Lugano, CH)
Abdullah Mueen (University of New Mexico, US) [dblp]
Michael Ott (LRZ - München, DE) [dblp]
Tapasya Patki (LLNL - Livermore, US) [dblp]
Ivy Bo Peng (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
Krishnan Raghavan (Argonne National Laboratory, US) [dblp]
David Schibeci (Pawsey Supercomputing Centre - Kensington, AU)
Kathleen Shoga (LLNL - Livermore, US) [dblp]
Michael Showerman (University of Illinois at Urbana-Champaign, US) [dblp]
Frédéric Suter (Oak Ridge National Laboratory, US) [dblp]
Oriol Vidal (Barcelona Supercomputing Center, ES) [dblp]
Torsten Wilde (HPE- Böblingen, DE) [dblp]
Keiji Yamamoto (RIKEN - Hyogo, JP) [dblp]

Remote:

Devesh Tiwari (Northeastern University - Boston, US) [dblp]

Classification

Distributed / Parallel / and Cluster Computing
Machine Learning
Performance

Keywords

Exascale system monitoring
Data center monitoring
Operational Data Analytics
HPC operations and research

Seminar 23171

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Seminar 23171

Driving HPC Operations With Holistic Monitoring and Operational Data Analytics

( Apr 23 – Apr 28, 2023 )

Permalink

Organizers

Contact

Publications

Impacts

Schedule

Summary

Motivation

Participants

Classification

Keywords