Fault-Tolerant Distributed Algorithms on VLSI Chips
( 07. Sep – 10. Sep, 2008 )
- Bernadette Charron-Bost (Ecole Polytechnique - Palaiseau, FR)
- Shlomi Dolev (Ben Gurion University - Beer Sheva, IL)
- Jo Ebergen (Sun Microsystems - Menlo Park, US)
- Ulrich Schmid (TU Wien, AT)
- Annette Beyer (für administrative Fragen)
The Dagstuhl seminar 08371 on Fault-Tolerant Distributed Algorithms on VLSI Chips was devoted to exploring whether the wealth of existing fault-tolerant distributed algorithms research can be utilized for meeting the challenges of future-generation VLSI chips. Participants from both the distributed fault-tolerant algorithms community, interested in this emerging application domain, and from the VLSI systems-on-chip and digital design community, interested in well-founded system-level approaches to fault-tolerance, surveyed the current state-of-the-art and tried to identify possibilities to work together. The seminar clearly achieved its purpose: It became apparent that most existing research in Distributed Algorithms is too heavy-weight for being immediately applied in the "core" VLSI design context, where power, area etc. are scarce resources. At the same time, however, it was recognized that emerging trends like large multicore chips and increasingly critical applications create new and promising application domains for fault-tolerant distributed algorithms. We are convinced that the very fruitful cross-community interactions that took place during the Dagstuhl seminar will contribute to new research activities in those areas.
Shrinking feature sizes and increasing clock speeds are the most visible signs of the tremendous advances in VLSI design, which will accommodate billions of transistors on a single in the near future. This comes, however, at the price of increased system-level complexity: In today’s deep submicron technology with GHz clock speeds, wiring delays dominate transistor switching delays, and signals cannot traverse the whole die within single clock cycle any more. In fact, a modern VLSI chip can no longer be viewed as a monolithic block of synchronous hardware, where all state transitions occur simultaneously. Rather, VLSI chips are nowadays considered as systems of interacting subsystems — the advent of Systems-on-Chip (SoC)and Networks-on-Chip (NoC).
In addition, ever-increasing manufacturing variabilities increase the defect ratio, and the reduced voltage swing needed for high clock speeds and low power consumption also increases the adverse effects of α-particle and neutron hits during operation, as well as cross-talk and ground-bouncing sensitivity. The resulting increase of the transient failure rate (soft-error rate), which was negligible in most former-generation chips, has hence raised general concerns about the dependability of future generation VLSI chips. Consequently, suitable fault-tolerance mechanisms with respect to timing errors or value errors are vital for such devices: Fine-grained fault-tolerance like radiation-hardening, fault masking at transistor or gate level, error-correcting codes or error detection and recovery are the primary methods of choice here.
Due to the above trends, however, modern VLSI chips have much in common with the loosely-coupled distributed systems that have been studied by the fault-tolerant distributed algorithms community for decades. System-level fault tolerance based on replication and distributed agreement is the dominant approach here, and a wealth of different computing and failure models, algorithms & protocols, and theoretical results regarding solvability of problems and achievable performance have been established in the past.
The purpose of our Dagstuhl seminar was to explore whether fault-tolerant distributed algorithms research can indeed be utilized for meeting the challenges of future-generation VLSI chips: Just as Temporal Logic, established in the distributed computing scope decades ago, found its way to the VLSI domain, other radically new solutions and methods may also find their way. And indeed, some recent research suggested a positive answer to this question: For example, demonstrated that distributed fault-tolerant clock generation algorithms can be adapted to the very special requirements of VLSI chips, and demonstrated that self-stabilization is a very promising approach for designing robust VLSI chips.
Fifteen participants from the distributed fault-tolerant algorithms community (and related fields, like verification), interested in the new application domain of VLSI chips, and twelve participants from the VLSI community, interested in system-level approaches to fault-tolerance, joined at Dagstuhl in order to survey the current state-of-the-art and identify possibilities to work together.
The presentations and the unique setting of Dagstuhl, with its relaxed and stimulating atmosphere, fully achieved their purpose: Long discussions during the official seminar, and many fruitful cross-community interactions during the free times were stimulated, which even exceeded the amount of available time.
- Janusz Brzozowski (University of Waterloo, CA)
- Bernadette Charron-Bost (Ecole Polytechnique - Palaiseau, FR) [dblp]
- Shlomi Dolev (Ben Gurion University - Beer Sheva, IL) [dblp]
- Jo Ebergen (Sun Microsystems - Menlo Park, US) [dblp]
- Sergey Frenkel (Russian Academy of Sciences - Moscow, RU)
- Gottfried Fuchs (TU Wien, AT)
- Matthias Függer (TU Wien, AT) [dblp]
- Mike Gerdes (Universität Augsburg, DE)
- Leslie Lamport (Microsoft Corp. - Mountain View, US) [dblp]
- Rajit Manohar (Cornell University, US)
- Alain Martin (CalTech - Pasadena, US)
- Philippe Matherat (Télécom ParisTech, FR)
- Chris J. Myers (Univ. of Utah, US) [dblp]
- Lirida Naviner (ENST - Paris, FR)
- Tim Nieberg (Universität Bonn, DE)
- Dhiraj Pradhan (University of Bristol, GB)
- Rüdiger Reischuk (Universität Lübeck, DE) [dblp]
- André Schiper (EPFL - Lausanne, CH) [dblp]
- Ulrich Schmid (TU Wien, AT) [dblp]
- Daniel J. Sorin (Duke University - Durham, US)
- Andreas Steininger (TU Wien, AT)
- Oliver Theel (Universität Oldenburg, DE) [dblp]
- Philippas Tsigas (Chalmers UT - Göteborg, SE) [dblp]
- Helmut Veith (TU Darmstadt, DE) [dblp]
- Jennifer L. Welch (Texas A&M University - College Station, US) [dblp]
- Josef Widder (Ecole Polytechnique - Palaiseau, FR) [dblp]
- Alex Yakovlev (Newcastle University, GB) [dblp]
- data structures / algorithms / complexity
- Fault-tolerant distributed algorithms
- system-level fault tolerance
- VLSI systems-on-chip
- digital logic
- formal specification