Dagstuhl Seminar 20101: Resiliency in Numerical Algorithm Design for Extreme Scale Simulations

Dagstuhl Seminar 20101

Resiliency in Numerical Algorithm Design for Extreme Scale Simulations

( Mar 01 – Mar 06, 2020 )

(Click in the middle of the image to enlarge)

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/20101

Organizers

Luc Giraud (INRIA - Bordeaux, FR)
Ulrich Rüde (Universität Erlangen-Nürnberg, DE)
Linda Stals (Australian National University - Canberra, AU)

Contact

Shida Kunz (for scientific matters)
Annette Beyer (for administrative matters)

Publications

Resiliency in Numerical Algorithm Design for Extreme Scale Simulations (Dagstuhl Seminar 20101). Luc Giraud, Ulrich Rüde, and Linda Stals. In Dagstuhl Reports, Volume 10, Issue 3, pp. 1-57, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2020)

Impacts

Motivation

Show Motivation

Advanced supercomputing is characterized by the enormous resources and cost involved. A typical large-scale computation may be running for 48 hours on a future exascale system. For such a system, 20 MW is a quite optimistic estimate for the power consumption. Thus, the computation will consume a million kWh and thus on the order of 100 000 Euro in energy cost alone. It is clearly unacceptable that the whole computation would be lost if any one of the several million parallel processes fails during the execution. Similarly, the system will perform more than 10²³ floating point operations within this computation. If a single operation suffers from a bitflip error, must then the whole computation be declared invalid? How about the notion of reproducibility itself: must this core paradigm of science be revised and refined for results that are obtained by large scale simulation?

Conventional resilience techniques will not scale to the exascale regime: With a main memory footprint of tens of Petabyte, writing checkpoint data to background storage at frequent intervals can create unacceptable overhead in runtime and energy consumption. Forecasts show that the meantime between failures could be lower than the time to recover from a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated.

More advanced techniques must be devised, and the key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) What are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible system runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas will be an essential topic of the seminar.

The goal of this Dagstuhl Seminar is to bring a diverse group of scientists together who conduct research to make future exascale computing applications resilient against soft and hard faults. In particular, we will explore the role that algorithms and applications play in the holistic approach needed to tackle this challenge.

Creative Commons BY 3.0 DE

Luc Giraud, Ulrich Rüde, and Linda Stals

Summary

Show Summary

On the path to extreme scale computing, the hardware design must meet stringent requirements to keep the energy consumption of parallel computers at acceptable levels. This technological challenge is tackled by shrinking the electronic devices and reducing the voltage while simultaneously increasing the number of components. Recent studies indicate that such computer systems will become less reliable and some forecasts show that the mean time between failures could be lower than the time to recover from classical checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated.

The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge.

As a major result from the seminar, all of the participants contributed to the following white paper.

Creative Commons BY 3.0 Unported license

Luc Giraud, Ulrich Rüde, and Linda Stals

Participants

Show Participants

Emmanuel Agullo (INRIA - Bordeaux, FR)
Mirco Altenbernd (Universität Stuttgart, DE)
Hartwig Anzt (KIT - Karlsruher Institut für Technologie, DE)
Leonardo Bautista-Gomez (Barcelona Supercomputing Center, ES) [dblp]
Tommaso Benacchio (Polytechnic University of Milan, IT)
Luca Bonaventura (Polytechnic University of Milan, IT)
Hans-Joachim Bungartz (TU München, DE) [dblp]
Florina M. Ciorba (Universität Basel, CH) [dblp]
Nathan DeBardeleben (Los Alamos National Laboratory, US) [dblp]
Daniel Drzisga (TU München, DE)
Sebastian Eibl (Universität Erlangen-Nürnberg, DE)
Christian Engelmann (Oak Ridge National Laboratory, US) [dblp]
Luc Giraud (INRIA - Bordeaux, FR) [dblp]
Dominik Göddeke (Universität Stuttgart, DE) [dblp]
Marco Heisig (Universität Erlangen-Nürnberg, DE)
Fabienne Jézéquel (Sorbonne University - Paris, FR)
Nils Kohl (Universität Erlangen-Nürnberg, DE)
Xiaoye Sherry Li (Lawrence Berkeley National Laboratory, US) [dblp]
Michael Obersteiner (TU München, DE)
Enrique S. Quintana-Ortí (Technical University of Valencia, ES)
Ulrich Rüde (Universität Erlangen-Nürnberg, DE) [dblp]
Miriam Schulte (Universität Stuttgart, DE) [dblp]
Martin Schulz (TU München, DE) [dblp]
Feng Shilu (Australian National University - Canberra, AU)
Robert Speck (Jülich Supercomputing Centre, DE)
Linda Stals (Australian National University - Canberra, AU)
Keita Teranishi (Sandia National Labs - Livermore, US)
Dominik Thönnes (Universität Erlangen-Nürnberg, DE)
Andreas Wagner (TU München, DE)
Barbara Wohlmuth (TU München, DE) [dblp]

Classification

data structures / algorithms / complexity
modelling / simulation

Keywords

parallel computer architecture
fault tolerance
checkpointing
supercomputer applications

Seminar 20101

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Seminar 20101

Resiliency in Numerical Algorithm Design for Extreme Scale Simulations

( Mar 01 – Mar 06, 2020 )

Permalink

Organizers

Contact

Publications

Impacts

Motivation

Summary

Participants

Classification

Keywords