May 3 – 8 , 2009, Dagstuhl Seminar 09191
Fault Tolerance in High-Performance Computing and Grids
Franck Cappello (INRIA Saclay – Île-de-France – Orsay, FR)
Laxmikant Kale (University of Illinois – Urbana Champaign, US)
Frank Mueller (North Carolina State University, US)
Keshav Pingali (University of Texas – Austin, US)
Alexander Reinefeld (K. Zuse Zentrum Berlin, DE)
For support, please contact
The objective of this seminar was to bring together researchers and practitioners from the HPC and Grid communities to discuss medium to long-term approaches to address fault tolerance (FT). The focus of solutions was on the practical, system side and with the intent to reach beyond established solutions.
Overall, the objective of the workshop is to spark research activities in a coordinated manner that can significantly enhance FT capabilities of today's and tomorrow's HPC systems and Grids. The benefits of this work extend to the community of scientific computing at large, well beyond computer science. Due to the wide range of participants (researchers and industry practitioners from the U.S., Europe, and Asia), forthcoming research work may significantly help enhance FT properties of large-scale systems, and technology transfer is likely to eventually reach general-purpose computing given the increasing trend to multi-core parallelism and server-style computing, such as Google. Specifically, the work should set the seeds for increased collaborations between institutes in Europe and the U.S./Asia. If successful, a follow-up seminar may be organized in the following year.
This meeting was the first of its kind at Dagstuhl and provided a foundation to create a community platform with a cohesive outlook on FT in HPC and Grids. The presentations of participants concentrated on fundamental issues related to FT in HPC applications, runtime systems, operating systems, networking, I/O and scheduler. The program consisted of an introductory session for all participants, 22 presentations well as four ``open mic'' sessions where time was set aside for spontaneous discussions, brain storming and community-building plans. The seminar brought together a total of 31 researchers and developers working in the areas related to fault tolerance from universities, national research laboratories and computer vendors. The goals were to increase the exchange of ideas, knowledge transfer, foster a multidisciplinary approach to attacking this very important research problem with direct impact on the way in which we design and utilize parallel systems to make applications resilient to faults in hardware or software.
Through the lively engagement of all participants, the seminar was very successful and conducted in a professional, friendly and collegial atmosphere supported by the kind and helpful staff at Schloss Dagstuhl. Lively discussions continued every day well beyond meeting times. The group meeting in its four-day format combined with the cosy confinement of Dagstuhl provided an umbrella for thoughtful discussion that conferences or workshops cannot provide. This helped create a community feeling that could become a building block for a concerted effort to coordinate future research activities, cooperate in outreach effort and maximize everyone's productivity and impact in fulling together each one's unique expertise for a combined effort to successfully solve the grand challenges of FT in HPC and Grids. During the meeting, follow-up action items with community-building character were identified, as detailed in the online discussion notes of the open mic sessions. They include (a) creation of a mailing list to coordinate activities and disseminate information on FT in high-performance computing and related areas, (b) provision of a Wiki designated to the collection of information on active projects, existing solutions and the coordination of future research activities, and (c) organization of follow-on meetings for the community. Within one month of the seminar, action items (a) and (b) have been realized and a follow-up meeting (c) is in the planning stages.
Related Dagstuhl Seminar
- 14402: "Resilience in Exascale Computing" (2014)
- Operating Systems
- High-Performance Computing
- Runtime Systems
- Operating Systems
- Overlay Networks