https://www.dagstuhl.de/20101

01. – 06. März 2020, Dagstuhl-Seminar 20101

Resiliency in Numerical Algorithm Design for Extreme Scale Simulations

Organisatoren

Luc Giraud (INRIA – Bordeaux, FR)
Ulrich Rüde (Universität Erlangen-Nürnberg, DE)
Linda Stals (Australian National University – Canberra, AU)

Auskunft zu diesem Dagstuhl-Seminar erteilt

Dagstuhl Service Team

Dokumente

Teilnehmerliste
Gemeinsame Dokumente
Dagstuhl's Impact: Dokumente verfügbar

Motivation

Advanced supercomputing is characterized by the enormous resources and cost involved. A typical large-scale computation may be running for 48 hours on a future exascale system. For such a system, 20 MW is a quite optimistic estimate for the power consumption. Thus, the computation will consume a million kWh and thus on the order of 100 000 Euro in energy cost alone. It is clearly unacceptable that the whole computation would be lost if any one of the several million parallel processes fails during the execution. Similarly, the system will perform more than 1023 floating point operations within this computation. If a single operation suffers from a bitflip error, must then the whole computation be declared invalid? How about the notion of reproducibility itself: must this core paradigm of science be revised and refined for results that are obtained by large scale simulation?

Conventional resilience techniques will not scale to the exascale regime: With a main memory footprint of tens of Petabyte, writing checkpoint data to background storage at frequent intervals can create unacceptable overhead in runtime and energy consumption. Forecasts show that the meantime between failures could be lower than the time to recover from a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated.

More advanced techniques must be devised, and the key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) What are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible system runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas will be an essential topic of the seminar.

The goal of this Dagstuhl Seminar is to bring a diverse group of scientists together who conduct research to make future exascale computing applications resilient against soft and hard faults. In particular, we will explore the role that algorithms and applications play in the holistic approach needed to tackle this challenge.

Motivation text license
  Creative Commons BY 3.0 DE
  Luc Giraud, Ulrich Rüde, and Linda Stals

Classification

  • Data Structures / Algorithms / Complexity
  • Modelling / Simulation

Keywords

  • Parallel computer architecture
  • Fault tolerance
  • Checkpointing
  • Supercomputer applications

Dokumentation

In der Reihe Dagstuhl Reports werden alle Dagstuhl-Seminare und Dagstuhl-Perspektiven-Workshops dokumentiert. Die Organisatoren stellen zusammen mit dem Collector des Seminars einen Bericht zusammen, der die Beiträge der Autoren zusammenfasst und um eine Zusammenfassung ergänzt.

 

Download Übersichtsflyer (PDF).

Publikationen

Es besteht weiterhin die Möglichkeit, eine umfassende Kollektion begutachteter Arbeiten in der Reihe Dagstuhl Follow-Ups zu publizieren.

Dagstuhl's Impact

Bitte informieren Sie uns, wenn eine Veröffentlichung ausgehend von
Ihrem Seminar entsteht. Derartige Veröffentlichungen werden von uns in der Rubrik Dagstuhl's Impact separat aufgelistet  und im Erdgeschoss der Bibliothek präsentiert.