TOP
Search the Dagstuhl Website
Looking for information on the websites of the individual seminars? - Then please:
Not found what you are looking for? - Some of our services have separate websites, each with its own search option. Please check the following list:
Schloss Dagstuhl - LZI - Logo
Schloss Dagstuhl Services
Seminars
Within this website:
External resources:
  • DOOR (for registering your stay at Dagstuhl)
  • DOSA (for proposing future Dagstuhl Seminars or Dagstuhl Perspectives Workshops)
Publishing
Within this website:
External resources:
dblp
Within this website:
External resources:
  • the dblp Computer Science Bibliography


Dagstuhl Seminar 21441

Adaptive Resource Management for HPC Systems

( Nov 01 – Nov 05, 2021 )

(Click in the middle of the image to enlarge)

Permalink
Please use the following short url to reference this page: https://www.dagstuhl.de/21441

Organizers

Contact


Motivation

Today’s supercomputers have very static resource management. Jobs are submitted via batch scripts to the resource manager, then scheduled on the machine with a fixed set of nodes. Other resources, such as power, network bandwidth and storage are not actively managed and are provided only on a best-effort basis. This inflexible, node-focused and static resource management will have to change in the future due to many reasons, some of them listed below.

First, applications are becoming increasingly more dynamic. Techniques such as adaptive mesh refinement, e.g., as used in Tsunami simulations, lead to scalability changes over the application’s execution. Furthermore, only some application phases might profit from specialized accelerators, and I/O phases might even run best with a limited number of compute resources.

Additionally, the execution environment of applications is also becoming dynamic. Modern processors change the clock frequency according to the instruction mix as well as power and thermal envelopes. Heavy use of the vector units can lead to a lower clock frequency to stay in the thermal power budget, for example.

As an independent concern, due to the sheer number of components, failure rates are expected to increase thus slowing down computation or even leading to an increased number of node failures.

Finally, the upcoming machines will be power constrained, which means that the power will have to be carefully distributed among all running applications. The resulting power capping will impact the application’s performance due to adaptation of the clock frequency and due to manufacturing variability. These challenges in HPC will only be solvable by using a more adaptive resource management approach. For example, compute nodes need to be redistributed among running applications to adapt to changes in the application’s resource requirements either due to a varying number of grid points or interspersed algorithmic phases that profit from certain accelerators; network and I/O bandwidth will have to be assigned to applications to avoid interference caused by contention of concurrent communication and I/O phases; power needs to be dynamically redistributed both within an application and across applications to enable increased efficiency. Dynamic redistribution of resources will also give more flexibility to the resource manager to schedule jobs on the available resources and thus reduce idle times and efficiency lowering contention scenarios, e.g., in the situation of big jobs waiting for execution.

This Dagstuhl Seminar will investigate a holistic, layered approach for adaptive resource management. It starts with the resource management layer being responsible for scheduling applications on the machine and dynamically allocating resources to the running applications. At the programming level, applications need to be programmed in a resource-aware style such that they can adapt to resource changes and can make most efficient usage of the resources. On top of the programming interfaces, programming tools have to be available that allow the application developers to analyze and tune the applications for the varying amount of available resources. At the application level, applications have to be redesigned to enable significant gains in efficiency and throughput, e.g., adaptive mesh refinement, approximate computing, and power-aware algorithms are a few aspects to mention here.

The outcomes of this seminar will be a list of challenges and a roadmap that identifies the next steps for implementing adaptive resource management of HPC systems including languages, message passing libraries, resource managers, tools, and runtimes. A report will be published after the seminar.

Copyright Hans Michael Gerndt, Masaaki Kondo, Barton P. Miller, and Tapasya Patki

Summary

Today's supercomputers have very static resource management. Jobs are submitted via batch scripts to the resource manager, then scheduled on the machine with a fixed set of nodes. Other resources, such as power, network bandwidth and storage are not actively managed and are provided only on a best-effort basis. This inflexible, node-focused and static resource management will have to change in the future due to many reasons, some of them listed below.

First, applications are becoming increasingly more dynamic. Techniques such as adaptive mesh refinement, e.g., as used in Tsunami simulations, lead to scalability changes over the application's execution. Furthermore, only some application phases might profit from specialized accelerators, and I/O phases might even run best with a limited number of compute resources.

Additionally, the execution environment of applications is also becoming dynamic. Modern processors change the clock frequency according to the instruction mix as well as power and thermal envelopes. Heavy use of the vector units can lead to a lower clock frequency to stay in the thermal power budget, for example.

As an independent concern, due to the sheer number of components, failure rates are expected to increase thus slowing down computation or even leading to an increased number of node failures.

Finally, the upcoming machines will be power constrained, which means that the power will have to be carefully distributed among all running applications. The resulting power capping will impact the application's performance due to adaptation of the clock frequency and due to manufacturing variability. These challenges in HPC will only be solvable by using a more adaptive resource management approach. For example, compute nodes need to be redistributed among running applications to adapt to changes in the application's resource requirements either due to a varying number of grid points or interspersed algorithmic phases that profit from certain accelerators; network and I/O bandwidth will have to be assigned to applications to avoid interference caused by contention of concurrent communication and I/O phases; power needs to be dynamically redistributed both within an application and across applications to enable increased efficiency. Dynamic redistribution of resources will also give more flexibility to the resource manager to schedule jobs on the available resources and thus reduce idle times and efficiency lowering contention scenarios, e.g., in the situation of big jobs waiting for execution.

This Dagstuhl Seminar investigated a holistic, layered approach for adaptive resource management. It started with the resource management layer being responsible for scheduling applications on the machine and dynamically allocating resources to the running applications. At the programming level, applications need to be programmed in a resource-aware style such that they can adapt to resource changes and can make most efficient usage of the resources. On top of the programming interfaces, programming tools have to be available that allow the application developers to analyze and tune the applications for the varying amount of available resources. At the application level, applications have to be redesigned to enable significant gains in efficiency and throughput, e.g., adaptive mesh refinement, approximate computing, and power-aware algorithms are a few aspects to mention here.

The discussions led to a joint summary presenting the state-of-the-art, required techniques on these layers of HPC systems, as well as the foreseen advantages of adaptive resource management.

Copyright Michael Gerndt, Masaaki Kondo, Barton P. Miller, and Tapasya Patki

Participants
On-site
  • Eishi Arima (TU München, DE) [dblp]
  • Eduardo César (Autonomus University of Barcelona, ES) [dblp]
  • Isaías Alberto Comprés Ureña (TU München, DE) [dblp]
  • Michael Gerndt (TU München, DE) [dblp]
  • Jophin John (TU München, DE) [dblp]
  • Matthias Maiterth (TU München, DE) [dblp]
  • Barton P. Miller (University of Wisconsin-Madison, US) [dblp]
  • Bernd Mohr (Jülich Supercomputing Centre, DE) [dblp]
  • Frank Mueller (North Carolina State University - Raleigh, US) [dblp]
  • Santiago Narvaez Rivas (TU München, DE)
  • Mirko Rahn (Fraunhofer ITWM - Kaiserslautern, DE) [dblp]
  • Lubomir Riha (VSB-Technical University of Ostrava, CZ) [dblp]
  • Martin Schulz (TU München, DE) [dblp]
  • Anna Sikora (Autonomus University of Barcelona, ES) [dblp]
  • Ondrej Vysocky (VSB-Technical University of Ostrava, CZ)
  • Felix Wolf (TU Darmstadt, DE) [dblp]
Remote:
  • Dong Ahn (LLNL - Livermore, US)
  • Andrea Bartolini (University of Bologna, IT) [dblp]
  • Pete Beckman (Argonne National Laboratory - Lemont, US) [dblp]
  • Mohak Chadha (TU München, DE)
  • Julita Corbalan (Barcelona Supercomputing Center, ES)
  • Balazs Gerofi (RIKEN - Kobe, JP) [dblp]
  • Toshihiro Hanawa (University of Tokyo, JP) [dblp]
  • Shantenu Jha (Rutgers University - Piscataway, US) [dblp]
  • Rashawn Knapp (Intel - Hillsboro, US) [dblp]
  • Masaaki Kondo (Keio University - Yokohama, JP) [dblp]
  • Daniel John Milroy (LLNL - Livermore, US)
  • Tapasya Patki (LLNL - Livermore, US) [dblp]
  • Barry L. Rountree (LLNL - Livermore, US) [dblp]
  • Roxana Rusitoru (Arm - Cambridge, GB) [dblp]
  • Sakamoto Ryuichi (Tokyo Institute of Technology, JP) [dblp]
  • Wolfgang Schröder-Preikschat (Universität Erlangen-Nürnberg, DE) [dblp]

Classification
  • modelling / simulation
  • operating systems
  • optimization / scheduling

Keywords
  • High Performance Computing
  • Programming Tools
  • Power Management
  • Resource Management