Dagstuhl Seminar 14402: Resilience in Exascale Computing

Dagstuhl Seminar 14402

Resilience in Exascale Computing

( Sep 28 – Oct 01, 2014 )

(Click in the middle of the image to enlarge)

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/14402

Organizers

Hermann Härtig (TU Dresden, DE)
Satoshi Matsuoka (Tokyo Institute of Technology, JP)
Frank Mueller (North Carolina State University - Raleigh, US)
Alexander Reinefeld (Konrad-Zuse-Zentrum - Berlin, DE)

Contact

Andreas Dolzmann (for scientific matters)
Annette Beyer (for administrative matters)

Publications

Resilience in Exascale Computing (Dagstuhl Seminar 14402). Hermann Härtig, Satoshi Matsuoka, Frank Mueller, and Alexander Reinefeld. In Dagstuhl Reports, Volume 4, Issue 9, pp. 124-139, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2015)

Schedule

Schedule

Motivation

Show Motivation

The objective of this seminar is to bring together researchers and developers with a background on HPC system software (OS, network, storage, management tools) to discuss medium to long-term approaches towards resilience in exascale computers. Two concrete outcomes will be (a) outlines for alternatives for resilience at extreme scale with trade-offs and dependencies on hardware/technology advances and (b) initiation of a standardization process for a resilience API. The latter is driven by current trends of resilience libraries to let users specify important data regions required for tolerating faults and for potential recovery. Berkeley Lab's BLCR, Livermore's SCR and Capello's FTI feature such region specification in their APIs, and so may do in-house application-specific solutions. A standardized resilience API would allow application programmers to be agnostic of future underlying resilience mechanisms and policies so that resilience libraries can be exchanged at will (and might even become inter-operable).

The focus of solutions is on the practical system side and should reach beyond currently established solutions.

As a first result of the seminar, we expect to formulate a list of objectives and approaches addressing a variety of problems faced by resiliency. We anticipate that participants engage in increased coordination and collaboration within the currently (mostly) separate communities of HPC system software and application development.

A second result of the seminar will be the initiation of the standardization process. One challenge is to find the most promising context for standardization. Current HPC-related standards (MPI, OpenMP, OpenACC) do not seem suitable since resilience cuts across concrete runtime environments and may also extend beyond HPC to Clouds and data centers involving industry participants from these areas (in future standardization meetings beyond the scope of this seminar).

Overall, the objective of the seminar is to spark research and standardization activities in a coordinated manner that can pave the way for tomorrow's exascale computers to the benefit of the application developers. Thus we expect not only HPC system developers to benefit from the seminar but also the community of scientific computing at large, well beyond computer science. Due to the wide range of participants (researchers and industry practitioners from the U.S., Europe, and Asia), forthcoming research work may significantly help enhance Fault Tolerance properties of exascale systems, and technology transfer is likely to also reach general-purpose computing with many-core parallelism and server-style computing. Specifically, the work should set the seeds for increased collaborations between institutes in Europe, the U.S., and Asia.

Summary

Show Summary

Motivation

The upcoming transition from petascale to exascale computers requires the development of radically new methods of computing. Massive parallelism, delivered by manycore processors and their assembly to systems beyond 10^7 processing units will open the way to extreme computing with more than 10^18 floating point operations per second. The large number of functional components (computing cores, memory chips, network interfaces) will greatly increase the probability of partial failures. Already today, each of the four fastest supercomputers in the TOP500 list comprises more than half a million CPU cores, and this tendency towards massive parallelism is expected to accelerate in the future. In such large and complex systems, component failures are the norm rather than an exception. Applications must be able to handle dynamic reconfigurations during runtime and system software is needed to provide fault tolerance (FT) at a system level. For example, Jaguar reportedly experience 20 faults per hour in production mode some of which could be mitigated while others could not.

To prevent valuable computation to be lost due to failures, checkpoint/restart (C/R) has become a requirement for long running jobs. However, current C/R mechanisms are insufficient, because the communication channels between the main memory and the parallel file system are far too slow to allow to save (and restore) a complete memory dump to disk farms. As an alternative, the memory of neighboring compute nodes may be used to keep partial checkpoints, but then erasure coding must be used to prevent against the loss of data in case of single node failures. To make things worse, precious communication bandwidth is needed for writing/reading checkpoints, which slows down the application. Techniques for data compression or application-specific checkpointing (with a reduced memory footprint) were proposed as a solution, but they only alleviate the problem by a certain extent.

We assume exascale hardware architectures to consist of a heterogeneous set of computational units (ranging from general-purpose CPUs to specialized units such as today's GPUs), memory chips (RAM, flash, phase-change memory), and various kinds of interconnects. The operating system and its load balancing mechanisms need to adapt to the hardware's properties as well as to workload characteristics. With the co-existence of legacy applications and new applications, it can be assumed that exascale systems must be capable of executing a broad range of parallel programming paradigms like MPI, OpenMP, PGAS, or MapReduce. These will not always and in every case require the functionality of a fully fledged operating system. We furthermore expect applications to become more complex and dynamic. Hence, developers cannot be expected to continuously handle load balancing and reliability. It is the operating system's task to find a sweet spot that on the one hand provides generic means for load management and checkpointing, while on the other hand allowing application developers full control over the performance-relevant functionality if required.

Objectives and Expected Results

The objective of this seminar is to bring together researchers and developers with a background on HPC system software (OS, network, storage, management tools) to discuss medium to long-term approaches towards resilience in exascale computers. Two concrete outcomes are (a) outlines for alternatives for resilience at extreme scale with trade-offs and dependencies on hardware/technology advances and (b) initiation of a standardization process for a resilience API. The latter is driven by current trends of resilience libraries to let users specify important data regions required for tolerating faults and for potential recovery. Berkeley Lab's BLCR, Livermore's SCR and Capello's FTI feature such region specification in their APIs, and so do may in-house application-specific solutions. A standardized resilience API would allow application programmers to be agnostic of future underlying resilience mechanisms and policies so that resilience libraries can be exchanged at will (and might even become inter-operable). The focus of solutions is on the practical system side and should reach beyond currently established solutions. Examples of areas of interest are:

What is the "smallest denominator" that defines a resilience API? How can the standardization of a resilience API be realized?
How can reactive FT schemes that respond to failures be enhanced to reduce system overhead, ensure progress in computation and sustain ever shorter MTBFs?
How should low-energy and/or persistent memory be included on nodes for checkpointing (for example PCM) and used by applications and the OS?
Can a significant number of faults be predicted with exact locations ahead of time so that proactive FT may provide complementary capabilities to move computation away from nodes that are about to fail?
Can message logging, incremental checkpointing and similar techniques contribute to lower checkpointing overhead?
Is redundant execution a viable alternative at exascale? How can partly redundant execution contribute to increased resilience in exascale algorithms?
Can algorithm-based fault tolerance be generalized to entire classes of algorithms? Can results continuously be checked?
What is the impact of silent data corruption (SDC) on HPC computing today? Which solvers can tolerate SDCs, which ones need to be enhanced (and how)?
How do current/novel network architectures interact with the OS (e.g., how does migration interact with RDMA)?
How can execution jitter be reduced or tolerated on exascale systems, particularly in the presence of failures?
Can an interface be designed that allows the application to give "hints" to the OS in terms of execution steering for resilience handling? How does this approach interact with scalability mechanisms and policies, e.g., load balancing, and with programming models, e.g., to define fault handlers?
Do distributed communication protocols offer better resilience? How do they support coordination between node-local and inter-node scheduling?
Does "dark silicon" offer new opportunities for resilience?
How can I/O on exascale be efficient and resilient (e.g., in situ analysis of simulation results)?

As a result of the seminar, we expect that this list of objectives will be refined, extended, and approaches to address each of these problems will be formulated, We anticipate that participants engage in increased coordination and collaboration within the currently (mostly) separate communities of HPC system software and application development.

Furthermore, the standardization process will be kicked off. One challenge is to find the most promising context for standardization. Current HPC-related standards (MPI, OpenMP, OpenACC) do not seem suitable since resilience cuts across concrete runtime environments and may also extend beyond HPC to Clouds and data centers involving industry participants from these area (in future standardization meetings beyond the scope of this meeting).

Overall, the objective of the workshop is to spark research and standardization activities in a coordinated manner that can pave the way for tomorrow's exascale computers to the benefit of the application developers. Thus we expect not only HPC system developers to benefit by the seminar but also the community of scientific computing at large, well beyond computer science. Due to the wide range of participants (researchers and industry practitioners from the U.S., Europe, and Asia), forthcoming research work may significantly help enhance FT properties of exascale systems, and technology transfer is likely to also reach general-purpose computing with many-core parallelism and server-style computing. Specifically, the work should set the seeds for increased collaborations between institutes in Europe and the U.S./Asia.

Relation to Previous Dagstuhl Seminars

Two of the proposers, Frank Mueller and Alexander Reinefeld, previously co-organized a Dagstuhl Seminar on Fault Tolerance in High-Performance Computing and Grids in 2009. It provided a forum for exchanging research ideas on FT in high-performance computing and grid computing community. In the meantime, the state-of-the-art greatly advanced and it became clear, that exascale computing will not be possible without adequate means for resilience. Hence, the new seminar will be more concrete in that the pressing problems of FT for exascale computing and standardization must be tackled and solved with the joint forces of system researchers and developers.

The proposed seminar also builds on the Dagstuhl Perspective Workshop 12212 Co-Design of Systems and Applications for Exascale, which also relates to the DFG-funded project FFMK (http://ffmk.tudos.org, "A Fast and Fault-tolerant Microkernel-based System for Exascale Computing", DFG priority program 1648). Compared to the perspective workshop, our proposed seminar is much more focused on single, pressing topic of exascale computing, namely resilience.

Creative Commons BY 3.0 Unported license

Hermann Härtig, Satoshi Matsuoka, Frank Mueller, and Alexander Reinefeld

Participants

Show Participants

Dorian C. Arnold (University of New Mexico - Albuquerque, US) [dblp]
Amnon Barak (The Hebrew University of Jerusalem, IL) [dblp]
Leonardo Bautista-Gomez (Argonne National Laboratory, US) [dblp]
George Bosilca (University of Tennessee, Knoxville, US) [dblp]
Zizhong Chen (University of California - Riverside, US) [dblp]
Andrew A. Chien (University of Chicago, US) [dblp]
Nathan DeBardeleben (Los Alamos National Laboratory, US) [dblp]
Björn Döbel (TU Dresden, DE) [dblp]
James John Elliott (North Carolina State University - Raleigh, US) [dblp]
Christian Engelmann (Oak Ridge National Laboratory, US) [dblp]
Mattan Erez (University of Texas - Austin, US) [dblp]
Hermann Härtig (TU Dresden, DE) [dblp]
Torsten Hoefler (ETH Zürich, CH) [dblp]
Larry Kaplan (Cray Inc. - Seattle, US) [dblp]
Dieter Kranzlmüller (LMU München, DE) [dblp]
Matthias Lieber (TU Dresden, DE) [dblp]
Naoya Maruyama (RIKEN - Kobe, JP) [dblp]
Satoshi Matsuoka (Tokyo Institute of Technology, JP) [dblp]
Frank Mueller (North Carolina State University - Raleigh, US) [dblp]
Alexander Reinefeld (Konrad-Zuse-Zentrum - Berlin, DE) [dblp]
Yves Robert (ENS - Lyon, FR) [dblp]
Robert B. Ross (Argonne National Laboratory, US) [dblp]
Kento Sato (Tokyo Institute of Technology, JP) [dblp]
Martin Schulz (LLNL - Livermore, US) [dblp]
Thorsten Schütt (Konrad-Zuse-Zentrum - Berlin, DE) [dblp]
Vilas Sridharan (Advanced Micro Devices - Boxborough, US) [dblp]
Thomas Steinke (Konrad-Zuse-Zentrum - Berlin, DE) [dblp]
Jeffrey S. Vetter (Oak Ridge National Laboratory, US) [dblp]
Abhinav Vishnu (Pacific Northwest National Lab. - Richland, US) [dblp]
Gerhard Wellein (Universität Erlangen-Nürnberg, DE) [dblp]
Felix Wolf (GRS for Simulation Sciences - Aachen, DE) [dblp]

Related Seminars

Dagstuhl Seminar 09191: Fault Tolerance in High-Performance Computing and Grids (2009-05-03 - 2009-05-08) (Details)

Classification

operating systems

Keywords

Exascale computing
resilience
fault tolerance
manycore computers
operating systems
micro kernels
work-load balancing
checkpointing

Seminar 14402

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Seminar 14402

Resilience in Exascale Computing

( Sep 28 – Oct 01, 2014 )

Permalink

Organizers

Contact

Publications

Schedule

Motivation

Summary

Motivation

Objectives and Expected Results

Relation to Previous Dagstuhl Seminars

Participants

Related Seminars

Classification

Keywords