Dagstuhl Seminar 09191: Fault Tolerance in High-Performance Computing and Grids

Dagstuhl Seminar 09191

Fault Tolerance in High-Performance Computing and Grids

( May 03 – May 08, 2009 )

(Click in the middle of the image to enlarge)

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/09191

Organizers

Franck Cappello (University of Paris South XI, FR)
Laxmikant Kale (University of Illinois - Urbana-Champaign, US)
Frank Mueller (North Carolina State University - Raleigh, US)
Keshav Pingali (University of Texas - Austin, US)
Alexander Reinefeld (Konrad-Zuse-Zentrum - Berlin, DE)

Contact

Dagstuhl Service Team

Publications

Fault Tolerance in High-Performance Computing and Grids. Franck Capello, Laxmikant Kale, Frank Mueller, Keshav Pingali, and Alexander Reinefeld (Eds.). Dagstuhl Seminar Proceedings, Volume 9191. June 26, 2009

Summary

Show Summary

The objective of this seminar was to bring together researchers and practitioners from the HPC and Grid communities to discuss medium to long-term approaches to address fault tolerance (FT). The focus of solutions was on the practical, system side and with the intent to reach beyond established solutions.

Overall, the objective of the workshop is to spark research activities in a coordinated manner that can significantly enhance FT capabilities of today's and tomorrow's HPC systems and Grids. The benefits of this work extend to the community of scientific computing at large, well beyond computer science. Due to the wide range of participants (researchers and industry practitioners from the U.S., Europe, and Asia), forthcoming research work may significantly help enhance FT properties of large-scale systems, and technology transfer is likely to eventually reach general-purpose computing given the increasing trend to multi-core parallelism and server-style computing, such as Google. Specifically, the work should set the seeds for increased collaborations between institutes in Europe and the U.S./Asia. If successful, a follow-up seminar may be organized in the following year.

This meeting was the first of its kind at Dagstuhl and provided a foundation to create a community platform with a cohesive outlook on FT in HPC and Grids. The presentations of participants concentrated on fundamental issues related to FT in HPC applications, runtime systems, operating systems, networking, I/O and scheduler. The program consisted of an introductory session for all participants, 22 presentations well as four ``open mic'' sessions where time was set aside for spontaneous discussions, brain storming and community-building plans. The seminar brought together a total of 31 researchers and developers working in the areas related to fault tolerance from universities, national research laboratories and computer vendors. The goals were to increase the exchange of ideas, knowledge transfer, foster a multidisciplinary approach to attacking this very important research problem with direct impact on the way in which we design and utilize parallel systems to make applications resilient to faults in hardware or software.

Through the lively engagement of all participants, the seminar was very successful and conducted in a professional, friendly and collegial atmosphere supported by the kind and helpful staff at Schloss Dagstuhl. Lively discussions continued every day well beyond meeting times. The group meeting in its four-day format combined with the cosy confinement of Dagstuhl provided an umbrella for thoughtful discussion that conferences or workshops cannot provide. This helped create a community feeling that could become a building block for a concerted effort to coordinate future research activities, cooperate in outreach effort and maximize everyone's productivity and impact in fulling together each one's unique expertise for a combined effort to successfully solve the grand challenges of FT in HPC and Grids. During the meeting, follow-up action items with community-building character were identified, as detailed in the online discussion notes of the open mic sessions. They include (a) creation of a mailing list to coordinate activities and disseminate information on FT in high-performance computing and related areas, (b) provision of a Wiki designated to the collection of information on active projects, existing solutions and the coordination of future research activities, and (c) organization of follow-on meetings for the community. Within one month of the seminar, action items (a) and (b) have been realized and a follow-up meeting (c) is in the planning stages.

Participants

Show Participants

Artur Andrzejak (Konrad-Zuse-Zentrum - Berlin, DE) [dblp]
Gabriel Antoniu (INRIA Rennes - Bretagne Atlantique, FR) [dblp]
Henri E. Bal (VU University Amsterdam, NL) [dblp]
George Bosilca (University of Tennessee, US) [dblp]
Franck Cappello (University of Paris South XI, FR) [dblp]
Christian Engelmann (Oak Ridge National Laboratory, US) [dblp]
Dick H.J. Epema (TU Delft, NL) [dblp]
Wolfgang Frings (Jülich Supercomputing Centre, DE) [dblp]
Richard L. Graham (Oak Ridge National Laboratory, US)
Amina Guermouche (Université Paris Sud, FR) [dblp]
Paul Hargrove (Lawrence Berkeley National Laboratory, US)
Hermann Härtig (TU Dresden, DE) [dblp]
Thomas Herault (Université Paris Sud, FR) [dblp]
Matthias Hovestadt (TU Berlin, DE)
Laxmikant Kale (University of Illinois - Urbana-Champaign, US) [dblp]
Rainer Keller (Oak Ridge National Laboratory, US)
Chokchai Leangsuksun (Louisiana Tech University, US)
Volker Lindenstruth (Universität Heidelberg, DE) [dblp]
Barry Linnert (TU Berlin, DE)
Xiaosong Ma (North Carolina State University - Raleigh, US) [dblp]
Frank Mueller (North Carolina State University - Raleigh, US) [dblp]
Dhabaleswar K. Panda (Ohio State University - Columbus, US) [dblp]
Alexander Reinefeld (Konrad-Zuse-Zentrum - Berlin, DE) [dblp]
Florian Schintke (Konrad-Zuse-Zentrum - Berlin, DE) [dblp]
Jörg Schneider (TU Berlin, DE)
Michael Schöttner (Heinrich-Heine-Universität Düsseldorf, DE)
Stephen L. Scott (Oak Ridge National Laboratory, US) [dblp]
Eugen Staab (University of Luxembourg, LU) [dblp]
Jesper Larsson Traff (NEC Europe - St. Augustin, DE)
Paolo Trunfio (University of Calabria, IT)
Geoffroy Vallee (Oak Ridge National Laboratory, US)

Related Seminars

Dagstuhl Seminar 14402: Resilience in Exascale Computing (2014-09-28 - 2014-10-01) (Details)

Classification

Operating systems

Keywords

High-Performance Computing
Grids
Fault-Tolerance
Applications
Runtime Systems
Operating Systems
Middleware
Peer-to-Peer
Overlay Networks

Seminar 09191

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Seminar 09191

Fault Tolerance in High-Performance Computing and Grids

( May 03 – May 08, 2009 )

Permalink

Organizers

Contact

Publications

Summary

Participants

Related Seminars

Classification

Keywords