https://www.dagstuhl.de/09191

03. – 08. Mai 2009, Dagstuhl-Seminar 09191

Fault Tolerance in High-Performance Computing and Grids

Organisatoren

Franck Cappello (University of Paris South XI, FR)
Laxmikant Kale (University of Illinois – Urbana-Champaign, US)
Frank Mueller (North Carolina State University – Raleigh, US)
Keshav Pingali (University of Texas – Austin, US)
Alexander Reinefeld (Konrad-Zuse-Zentrum – Berlin, DE)

Auskunft zu diesem Dagstuhl-Seminar erteilt

Dagstuhl Service Team

Dokumente

Dagstuhl Seminar Proceedings DROPS
Teilnehmerliste

Summary

The objective of this seminar was to bring together researchers and practitioners from the HPC and Grid communities to discuss medium to long-term approaches to address fault tolerance (FT). The focus of solutions was on the practical, system side and with the intent to reach beyond established solutions.

Overall, the objective of the workshop is to spark research activities in a coordinated manner that can significantly enhance FT capabilities of today's and tomorrow's HPC systems and Grids. The benefits of this work extend to the community of scientific computing at large, well beyond computer science. Due to the wide range of participants (researchers and industry practitioners from the U.S., Europe, and Asia), forthcoming research work may significantly help enhance FT properties of large-scale systems, and technology transfer is likely to eventually reach general-purpose computing given the increasing trend to multi-core parallelism and server-style computing, such as Google. Specifically, the work should set the seeds for increased collaborations between institutes in Europe and the U.S./Asia. If successful, a follow-up seminar may be organized in the following year.

This meeting was the first of its kind at Dagstuhl and provided a foundation to create a community platform with a cohesive outlook on FT in HPC and Grids. The presentations of participants concentrated on fundamental issues related to FT in HPC applications, runtime systems, operating systems, networking, I/O and scheduler. The program consisted of an introductory session for all participants, 22 presentations well as four ``open mic'' sessions where time was set aside for spontaneous discussions, brain storming and community-building plans. The seminar brought together a total of 31 researchers and developers working in the areas related to fault tolerance from universities, national research laboratories and computer vendors. The goals were to increase the exchange of ideas, knowledge transfer, foster a multidisciplinary approach to attacking this very important research problem with direct impact on the way in which we design and utilize parallel systems to make applications resilient to faults in hardware or software.

Through the lively engagement of all participants, the seminar was very successful and conducted in a professional, friendly and collegial atmosphere supported by the kind and helpful staff at Schloss Dagstuhl. Lively discussions continued every day well beyond meeting times. The group meeting in its four-day format combined with the cosy confinement of Dagstuhl provided an umbrella for thoughtful discussion that conferences or workshops cannot provide. This helped create a community feeling that could become a building block for a concerted effort to coordinate future research activities, cooperate in outreach effort and maximize everyone's productivity and impact in fulling together each one's unique expertise for a combined effort to successfully solve the grand challenges of FT in HPC and Grids. During the meeting, follow-up action items with community-building character were identified, as detailed in the online discussion notes of the open mic sessions. They include (a) creation of a mailing list to coordinate activities and disseminate information on FT in high-performance computing and related areas, (b) provision of a Wiki designated to the collection of information on active projects, existing solutions and the coordination of future research activities, and (c) organization of follow-on meetings for the community. Within one month of the seminar, action items (a) and (b) have been realized and a follow-up meeting (c) is in the planning stages.

Related Dagstuhl-Seminar

Classification

  • Operating Systems

Keywords

  • High-Performance Computing
  • Grids
  • Fault-Tolerance
  • Applications
  • Runtime Systems
  • Operating Systems
  • Middleware
  • Peer-to-Peer
  • Overlay Networks

Dokumentation

In der Reihe Dagstuhl Reports werden alle Dagstuhl-Seminare und Dagstuhl-Perspektiven-Workshops dokumentiert. Die Organisatoren stellen zusammen mit dem Collector des Seminars einen Bericht zusammen, der die Beiträge der Autoren zusammenfasst und um eine Zusammenfassung ergänzt.

 

Download Übersichtsflyer (PDF).

Dagstuhl's Impact

Bitte informieren Sie uns, wenn eine Veröffentlichung ausgehend von Ihrem Seminar entsteht. Derartige Veröffentlichungen werden von uns in der Rubrik Dagstuhl's Impact separat aufgelistet  und im Erdgeschoss der Bibliothek präsentiert.

Publikationen

Es besteht weiterhin die Möglichkeit, eine umfassende Kollektion begutachteter Arbeiten in der Reihe Dagstuhl Follow-Ups zu publizieren.