TOP
Search the Dagstuhl Website
Looking for information on the websites of the individual seminars? - Then please:
Not found what you are looking for? - Some of our services have separate websites, each with its own search option. Please check the following list:
Schloss Dagstuhl - LZI - Logo
Schloss Dagstuhl Services
Seminars
Within this website:
External resources:
  • DOOR (for registering your stay at Dagstuhl)
  • DOSA (for proposing future Dagstuhl Seminars or Dagstuhl Perspectives Workshops)
Publishing
Within this website:
External resources:
dblp
Within this website:
External resources:
  • the dblp Computer Science Bibliography


Dagstuhl Seminar 20101

Resiliency in Numerical Algorithm Design for Extreme Scale Simulations

( Mar 01 – Mar 06, 2020 )

(Click in the middle of the image to enlarge)

Permalink
Please use the following short url to reference this page: https://www.dagstuhl.de/20101

Organizers
  • Luc Giraud (INRIA - Bordeaux, FR)
  • Ulrich Rüde (Universität Erlangen-Nürnberg, DE)
  • Linda Stals (Australian National University - Canberra, AU)

Contact


Impacts

Motivation

Advanced supercomputing is characterized by the enormous resources and cost involved. A typical large-scale computation may be running for 48 hours on a future exascale system. For such a system, 20 MW is a quite optimistic estimate for the power consumption. Thus, the computation will consume a million kWh and thus on the order of 100 000 Euro in energy cost alone. It is clearly unacceptable that the whole computation would be lost if any one of the several million parallel processes fails during the execution. Similarly, the system will perform more than 1023 floating point operations within this computation. If a single operation suffers from a bitflip error, must then the whole computation be declared invalid? How about the notion of reproducibility itself: must this core paradigm of science be revised and refined for results that are obtained by large scale simulation?

Conventional resilience techniques will not scale to the exascale regime: With a main memory footprint of tens of Petabyte, writing checkpoint data to background storage at frequent intervals can create unacceptable overhead in runtime and energy consumption. Forecasts show that the meantime between failures could be lower than the time to recover from a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated.

More advanced techniques must be devised, and the key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) What are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible system runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas will be an essential topic of the seminar.

The goal of this Dagstuhl Seminar is to bring a diverse group of scientists together who conduct research to make future exascale computing applications resilient against soft and hard faults. In particular, we will explore the role that algorithms and applications play in the holistic approach needed to tackle this challenge.

Copyright Luc Giraud, Ulrich Rüde, and Linda Stals

Summary

On the path to extreme scale computing, the hardware design must meet stringent requirements to keep the energy consumption of parallel computers at acceptable levels. This technological challenge is tackled by shrinking the electronic devices and reducing the voltage while simultaneously increasing the number of components. Recent studies indicate that such computer systems will become less reliable and some forecasts show that the mean time between failures could be lower than the time to recover from classical checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated.

The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge.

As a major result from the seminar, all of the participants contributed to the following white paper.

Copyright Luc Giraud, Ulrich Rüde, and Linda Stals

Participants
  • Emmanuel Agullo (INRIA - Bordeaux, FR)
  • Mirco Altenbernd (Universität Stuttgart, DE)
  • Hartwig Anzt (KIT - Karlsruher Institut für Technologie, DE)
  • Leonardo Bautista-Gomez (Barcelona Supercomputing Center, ES) [dblp]
  • Tommaso Benacchio (Polytechnic University of Milan, IT)
  • Luca Bonaventura (Polytechnic University of Milan, IT)
  • Hans-Joachim Bungartz (TU München, DE) [dblp]
  • Florina M. Ciorba (Universität Basel, CH) [dblp]
  • Nathan DeBardeleben (Los Alamos National Laboratory, US) [dblp]
  • Daniel Drzisga (TU München, DE)
  • Sebastian Eibl (Universität Erlangen-Nürnberg, DE)
  • Christian Engelmann (Oak Ridge National Laboratory, US) [dblp]
  • Luc Giraud (INRIA - Bordeaux, FR) [dblp]
  • Dominik Göddeke (Universität Stuttgart, DE) [dblp]
  • Marco Heisig (Universität Erlangen-Nürnberg, DE)
  • Fabienne Jézéquel (Sorbonne University - Paris, FR)
  • Nils Kohl (Universität Erlangen-Nürnberg, DE)
  • Xiaoye Sherry Li (Lawrence Berkeley National Laboratory, US) [dblp]
  • Michael Obersteiner (TU München, DE)
  • Enrique S. Quintana-Ortí (Technical University of Valencia, ES)
  • Ulrich Rüde (Universität Erlangen-Nürnberg, DE) [dblp]
  • Miriam Schulte (Universität Stuttgart, DE) [dblp]
  • Martin Schulz (TU München, DE) [dblp]
  • Feng Shilu (Australian National University - Canberra, AU)
  • Robert Speck (Jülich Supercomputing Centre, DE)
  • Linda Stals (Australian National University - Canberra, AU)
  • Keita Teranishi (Sandia National Labs - Livermore, US)
  • Dominik Thönnes (Universität Erlangen-Nürnberg, DE)
  • Andreas Wagner (TU München, DE)
  • Barbara Wohlmuth (TU München, DE) [dblp]

Classification
  • data structures / algorithms / complexity
  • modelling / simulation

Keywords
  • parallel computer architecture
  • fault tolerance
  • checkpointing
  • supercomputer applications