https://www.dagstuhl.de/23171

23. – 28. April 2023, Dagstuhl-Seminar 23171

Driving HPC Operations With Holistic Monitoring and Operational Data Analytics

Organisatoren

Florina M. Ciorba (Universität Basel, CH)
Ann Gentile (Sandia National Labs – Albuquerque, US)
Michael Ott (LRZ – München, DE)
Torsten Wilde (HPE- Böblingen, DE)

Auskunft zu diesem Dagstuhl-Seminar erteilen

Jutka Gasiorowski zu administrativen Fragen

Andreas Dolzmann zu wissenschaftlichen Fragen

Motivation

Advances in analytic approaches have brought the vision of efficient HPC operations enabled by dynamic analysis and automated feedback/adaptation within reach. Many HPC centers have started the development and deployment of frameworks to enable continuous and holistic monitoring, archiving, and analysis of performance data from their production machines and related infrastructures. The impact of such frameworks rests upon the ability to effectively analyze such data. Analytic techniques have been successfully developed and applied in other domains but their features may not apply directly to HPC Operations data and situations. Leveraging, adapting, and extending such techniques would open up new avenues for research and development of actionable analytics that can drive more intelligent operations through both manual and automated response to conditions of interest.

This Dagstuhl Seminar will bring together HPC practitioners in the areas of system management and monitoring and computer science research to collaboratively work on developing community solutions for efficient HPC system operations. The topics to be discussed in this seminar range from use cases to data and analytic approaches required to address them, to ultimately using the results of analyses to improve performance, operations, and research, both with and without human-in-the-loop. Specifically, the topics to be discussed will include:

  • Center-specific urgent use cases that drive data collection, analysis, and response requirements across the variety of institutions represented.
  • Types of available data, including sources, meanings, and fidelities, to support continuous analyses.
  • Requirements for actionable analytics: What do we need to convert raw data into information upon which we can act (e.g., confidence measures, explainability requirements not inherent in AI approaches, representation of results, latency, etc.)?
  • Applicability of existing analytics and informatics approaches to the domain-specifics of HPC operations. While there are many promising ML/AI approaches in other domains (e.g., image/speech processing, autonomous vehicles), it is not yet clear how many of those apply to the HPC operations and research domains (e.g., the occurrence of rare fault events, discontinuity of inertia-less measurements).
  • Opportunities for response involving infrastructure, hardware, system software, and applications. Discussion will include identifying what hooks would be needed to be added to existing system components (e.g., hardware, firmware, system software, application software) to support automated response.

Motivation text license
  Creative Commons BY 4.0
  Jim M. Brandt, Florina M. Ciorba, Ann Gentile, Michael Ott, Martin Schulz and Torsten Wilde

Classification

  • Distributed / Parallel / And Cluster Computing
  • Machine Learning
  • Performance

Keywords

  • Exascale system monitoring
  • Data center monitoring
  • Operational Data Analytics
  • HPC operations and research

Dokumentation

In der Reihe Dagstuhl Reports werden alle Dagstuhl-Seminare und Dagstuhl-Perspektiven-Workshops dokumentiert. Die Organisatoren stellen zusammen mit dem Collector des Seminars einen Bericht zusammen, der die Beiträge der Autoren zusammenfasst und um eine Zusammenfassung ergänzt.

 

Download Übersichtsflyer (PDF).

Dagstuhl's Impact

Bitte informieren Sie uns, wenn eine Veröffentlichung ausgehend von Ihrem Seminar entsteht. Derartige Veröffentlichungen werden von uns in der Rubrik Dagstuhl's Impact separat aufgelistet  und im Erdgeschoss der Bibliothek präsentiert.

Publikationen

Es besteht weiterhin die Möglichkeit, eine umfassende Kollektion begutachteter Arbeiten in der Reihe Dagstuhl Follow-Ups zu publizieren.