March 1 – 6 , 2020, Dagstuhl Seminar 20101

Resiliency in Numerical Algorithm Design for Extreme Scale Simulations


Luc Giraud (INRIA – Bordeaux, FR)
Ulrich Rüde (Universität Erlangen-Nürnberg, DE)
Linda Stals (Australian National University – Canberra, AU)

For support, please contact

Annette Beyer for administrative matters

Shida Kunz for scientific matters


Advanced supercomputing is characterized by the enormous resources and cost involved. A typical large-scale computation may be running for 48 hours on a future exascale system. For such a system, 20 MW is a quite optimistic estimate for the power consumption. Thus, the computation will consume a million kWh and thus on the order of 100 000 Euro in energy cost alone. It is clearly unacceptable that the whole computation would be lost if any one of the several million parallel processes fails during the execution. Similarly, the system will perform more than 1023 floating point operations within this computation. If a single operation suffers from a bitflip error, must then the whole computation be declared invalid? How about the notion of reproducibility itself: must this core paradigm of science be revised and refined for results that are obtained by large scale simulation?

Conventional resilience techniques will not scale to the exascale regime: With a main memory footprint of tens of Petabyte, writing checkpoint data to background storage at frequent intervals can create unacceptable overhead in runtime and energy consumption. Forecasts show that the meantime between failures could be lower than the time to recover from a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated.

More advanced techniques must be devised, and the key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) What are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible system runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas will be an essential topic of the seminar.

The goal of this Dagstuhl Seminar is to bring a diverse group of scientists together who conduct research to make future exascale computing applications resilient against soft and hard faults. In particular, we will explore the role that algorithms and applications play in the holistic approach needed to tackle this challenge.

  Creative Commons BY 3.0 DE
  Luc Giraud, Ulrich Rüde, and Linda Stals


  • Data Structures / Algorithms / Complexity
  • Modelling / Simulation


  • Parallel computer architecture
  • Fault tolerance
  • Checkpointing
  • Supercomputer applications


In the series Dagstuhl Reports each Dagstuhl Seminar and Dagstuhl Perspectives Workshop is documented. The seminar organizers, in cooperation with the collector, prepare a report that includes contributions from the participants' talks together with a summary of the seminar.


Download overview leaflet (PDF).


Furthermore, a comprehensive peer-reviewed collection of research papers can be published in the series Dagstuhl Follow-Ups.

Dagstuhl's Impact

Please inform us when a publication was published as a result from your seminar. These publications are listed in the category Dagstuhl's Impact and are presented on a special shelf on the ground floor of the library.

NSF young researcher support