TOP
Suche auf der Schloss Dagstuhl Webseite
Sie suchen nach Informationen auf den Webseiten der einzelnen Seminare? - Dann:
Nicht fündig geworden? - Einige unserer Dienste laufen auf separaten Webseiten mit jeweils eigener Suche. Bitte beachten Sie folgende Liste:
Schloss Dagstuhl - LZI - Logo
Schloss Dagstuhl Services
Seminare
Innerhalb dieser Seite:
Externe Seiten:
  • DOOR (zum Registrieren eines Dagstuhl Aufenthaltes)
  • DOSA (zum Beantragen künftiger Dagstuhl Seminare oder Dagstuhl Perspektiven Workshops)
Publishing
Innerhalb dieser Seite:
Externe Seiten:
dblp
Innerhalb dieser Seite:
Externe Seiten:
  • die Informatik-Bibliographiedatenbank dblp


Dagstuhl-Seminar 20291

Adaptive Resource Management for HPC Systems Postponed

( 12. Jul – 17. Jul, 2020 )

Permalink
Bitte benutzen Sie folgende Kurz-Url zum Verlinken dieser Seite: https://www.dagstuhl.de/20291

Ersetzt durch
Dagstuhl-Seminar 21441: Adaptive Resource Management for HPC Systems (2021-11-01 - 2021-11-05) (Details)

Organisatoren

Kontakt

Motivation

Today’s supercomputers have very static resource management. Jobs are submitted via batch scripts to the resource manager, then scheduled on the machine with a fixed set of nodes. Other resources, such as power, network bandwidth and storage are not actively managed and are provided only on a best-effort basis. This inflexible, node-focused and static resource management will have to change in the future due to many reasons, some of them listed below.

First, applications are becoming increasingly more dynamic. Techniques such as adaptive mesh refinement, e.g., as used in Tsunami simulations, lead to scalability changes over the application’s execution. Furthermore, only some application phases might profit from specialized accelerators, and I/O phases might even run best with a limited number of compute resources.

Additionally, the execution environment of applications is also becoming dynamic. Modern processors change the clock frequency according to the instruction mix as well as power and thermal envelopes. Heavy use of the vector units can lead to a lower clock frequency to stay in the thermal power budget, for example.

As an independent concern, due to the sheer number of components, failure rates are expected to increase thus slowing down computation or even leading to an increased number of node failures.

Finally, the upcoming machines will be power constrained, which means that the power will have to be carefully distributed among all running applications. The resulting power capping will impact the application’s performance due to adaptation of the clock frequency and due to manufacturing variability. These challenges in HPC will only be solvable by using a more adaptive resource management approach. For example, compute nodes need to be redistributed among running applications to adapt to changes in the application’s resource requirements either due to a varying number of grid points or interspersed algorithmic phases that profit from certain accelerators; network and I/O bandwidth will have to be assigned to applications to avoid interference caused by contention of concurrent communication and I/O phases; power needs to be dynamically redistributed both within an application and across applications to enable increased efficiency. Dynamic redistribution of resources will also give more flexibility to the resource manager to schedule jobs on the available resources and thus reduce idle times and efficiency lowering contention scenarios, e.g., in the situation of big jobs waiting for execution.

This Dagstuhl Seminar will investigate a holistic, layered approach for adaptive resource management. It starts with the resource management layer being responsible for scheduling applications on the machine and dynamically allocating resources to the running applications. At the programming level, applications need to be programmed in a resource-aware style such that they can adapt to resource changes and can make most efficient usage of the resources. On top of the programming interfaces, programming tools have to be available that allow the application developers to analyze and tune the applications for the varying amount of available resources. At the application level, applications have to be redesigned to enable significant gains in efficiency and throughput, e.g., adaptive mesh refinement, approximate computing, and power-aware algorithms are a few aspects to mention here.

The outcomes of this seminar will be a list of challenges and a roadmap that identifies the next steps for implementing adaptive resource management of HPC systems including languages, message passing libraries, resource managers, tools, and runtimes. A report will be published after the seminar.

Copyright Hans Michael Gerndt, Masaaki Kondo, Barton P. Miller, and Tapasya Patki

Teilnehmer
  • Michael Gerndt (TU München, DE) [dblp]
  • Masaaki Kondo (University of Tokyo, JP) [dblp]
  • Barton P. Miller (University of Wisconsin - Madison, US) [dblp]
  • Tapasya Patki (LLNL - Livermore, US) [dblp]

Klassifikation
  • modelling / simulation
  • operating systems
  • optimization / scheduling

Schlagworte
  • High Performance Computing
  • Programming Tools
  • Power Management
  • Resource Management