July 5 – 10 , 2015, Dagstuhl Seminar 15281
Algorithms and Scheduling Techniques to Manage Resilience and Power Consumption in Distributed Systems
1 / 3 >
For support, please contact
Many computer applications are executed on large-scale systems that comprise many hardware components, such as clusters that can be federated into distributed cloud computing or grid computing platforms. The owners/managers of these systems face two main challenges: failure management and energy management.
Failure management, the goal of which is to achieve resilience, is necessary because a large number of hardware resources implies a large number of failures during the execution of an application. While hardware resources can be made more reliable via the use of redundancy, this redundancy increases component cost. As a result, systems deployed within budget constraints must be built from unreliable components, that have a finite Mean Time Between Failure (MTBF), i.e., commercial-of-the-shelf components. For instance, a failure would occur every 50 minutes in a system with one million components, even if the MTBF of a single component is as large as 100 years.
Energy management, the goal of which is to optimize power consumption and to handle thermal issues, is also necessary due to both monetary and environmental constraints. While in today's systems, processors are the most power-consuming components, it is anticipated that in future distributed systems, the power dissipated to perform communications and I/O transfers will make up a much larger share of the overall energy consumption. In fact, the relative cost of communication is expected to increase dramatically, both in terms of latency/overhead and of consumed energy. Consequently, the computation and communication workloads of typical applications executed in HPC and/or cloud environments will lead to large power consumption and heat dissipation.
These two challenges, resilience and energy efficiency, are currently being studied by many researchers. Some of these researchers come from a "systems" culture, and investigate in particular systems design and management strategies that enhance resilience and energy efficiency. These strategies include high-level resource-provisioning policies, pragmatic resource allocation and scheduling heuristics, novel approaches for designing and deploying systems software infrastructures, and tools for monitoring/measuring the state of the system. Other researchers come from an "algorithms" culture. They investigate formal definitions of resilience and energy efficiency problems, relying on system models of various degrees of accuracy and sophistication, and aiming to obtain strong complexity results and algorithmic solutions for solving these problems. These two communities are quite often separated in the scientific literature and in the field. Some of the pragmatic solutions developed in the former community appear algorithmically weak to the latter community, while some of the algorithmic solutions developed by the latter community appear impractical to the former community. Furthermore, the separation of application and system platform due to ubiquitous resource virtualization layers also interferes with an effective cooperation of algorithmic and system management methods, and in particular to handle resiliency and energy efficiency. To move forward, more interaction and collaboration is needed between the systems and the algorithms communities, an observation that was made very clear during the discussions in the predecessor Dagstuhl seminar.
The broader challenge faced by systems and algorithms designer is that the optimization metrics of interest (resilience, power consumption, heat distribution, performance) are intimately related. For instance, high volatility in power consumption due to the use of dynamic frequency and voltage scaling (DFVS) is known to lead to thermal hotspots in a datacenter. Therefore, the datacenter must increase the safety margin for their cooling system to handle these hotspots. As a result, the power consumed by the cooling system is increased, possibly increasing the overall power consumption of the whole system, even though the motivation for using DVFS in the first place was to reduce power consumption! When resilience is thrown into the mix, then the trade-offs between the conflicting resilience, performance, and energy goals become even more intertwined. Adding fault-tolerance to a system, for instance, by using redundant computation or by periodically saving the state of the system to secondary storage, can decrease performance and almost always increases hardware resource requirements and thus power consumption. The field is rife with such conundrums, which must be addressed via systems and algorithms techniques used in conjunction. In this seminar, we have brought together researchers and practitioners from both the systems and the algorithms community, so as to foster fruitful discussions of these conundrums, many of which were touched upon in the predecessor seminar but by no means resolved.
To provide a clear context, the seminar focused around workflow applications. Workflows correspond to a broad and popular model of computation in which diverse computation tasks (which many themselves follow arbitrary models of computation) are interconnected via control and data dependencies. They have become very popular in many domains, ranging from scientific to datacenter applications, and share similar sets of challenges and current solutions. Part of the motivation of using workflows, and thus to develop workflow management systems and algorithms, is that they make it possible to describe complex and large computations succinctly and portably. Most of the invited seminar participants have worked and are currently working on issues related to the efficient, resilient, and energy efficient execution of workflows in distributed platforms. They thus provide an ideal focal and unifying theme for the seminar.
A number of workflow tools is available to aid the users in defining and executing workflow applications. While these tools are thus designed primarily to support the end user, they are in fact ideal proving grounds for implementing novel systems and algorithms techniques to aim at optimizing performance, resilience, and energy efficiency. Therefore, these tools provide a great opportunity to enhance both the application and the software infrastructure to meet both the needs of the end users and of the systems owners/managers. These goals are very diverse and, as we have seen above, intertwined, so that re-designing algorithms and systems to meet these goals is a difficult proposition (again, higher resilience often calls for redundant computations and/or redundant communication, which in turn consumes extra power and can reduce performance). In a broad sense, we are facing complex multi-criteria optimization problems that must be (i) formalized in a way that is cognizant of the practical systems constraints and hardware considerations; (ii) solved by novel algorithms that are both fast (so that they can be used in an on-line manner) and robust (so that they can tolerated wide ranges of scenarios with possibly inaccurate information).
The goal of this seminar was to foster discussions on, and articulate novel and promising directions for addressing the challenges highlighted above. International experts in the field have investigated how to approach (and hopefully at least partially address) the challenges that algorithms and system designers face due to frequent failures and energy usage constraints. More specifically, the seminar has addressed the following topics:
- Multi-criteria optimization problems as applicable to fault-tolerance / energy management
- Resilience techniques for HPC and cloud systems
- Robust and energy-aware distributed algorithms for resource scheduling and allocation in large distributed systems.
- Application-specific approaches for fault-tolerance and energy management, with a focus on workflow-based applications
Although the presentations at the seminar were very diverse in scope, ranging from practice to theory, an interesting observation is that many works do establish strong links between practice (e.g., particular applications, programming models) and theory (e.g., abstract scheduling problems and results). In particular, it was found that workflow applications, far from being well-understood, in fact give rise to a range of interrelated and interesting practical and theoretical problems that must be solved conjointly to achieve efficiency at large scale. Estimating task weights, scheduling with uncertainties, mapping at scale, remapping after failures, trading performance and energy, these are a few challenges that have been discussed at length during the seminar. Such observations make it plain that forums that blends practice and theory, as is the case with this seminar, are very much needed.
The seminar brought together 41 researchers from Austria, France, Germany, Japan, Netherlands, New Zealand, Poland, Portugal, Spain, Sweden, Switzerland, UK and USA, with interests and expertise in different aspect of parallel and distributed computing. Among participants there was a good mix of senior researchers, junior researchers, postdoctoral researchers, and Ph.D. students. Altogether there were 29 presentations over the 5 days of the seminar, organized in morning and late-afternoon sessions. The program was as usual a compromise between allowing sufficient time for participants to present their work, while also providing unstructured periods that were used by participants to pursue ongoing collaborations as well as to foster new ones. The feedback provided by the participants show that the goals of the seminar, namely to circulate new ideas and create new collaborations, were met to a large extent.
The organizers and participants wish to thank the staff and the management of Schloss Dagstuhl for their assistance and support in the arrangement of a very successful and productive event.
Creative Commons BY 3.0 Unported license
Henri Casanova and Ewa Deelman and Yves Robert and Uwe Schwiegelshohn
Related Dagstuhl Seminar
- Data Structures / Algorithms / Complexity
- Operating Systems
- Optimization / Scheduling
- Fault tolerance
- Energy efficiency
- Distributed and high performance computing.