05.07.15 - 10.07.15, Seminar 15281

Algorithms and Scheduling Techniques to Manage Resilience and Power Consumption in Distributed Systems

Computer applications that are executed on large-scale platforms with many hardware components, ranging from traditional High Performance Computing (HPC) systems to Clouds, face two pressing challenges: failure management and energy management. Failure management, the goal of which is to achieve resilience, is necessary because large platforms are built form large numbers of unreliable components with finite Mean Time Between Failure (MTBF). Energy management, the goal of which is energy efficiency, is also necessary due to both monetary and environmental constraints. Although in today's platforms processors consume most of the energy, communication and I/O operations are expected to account for a much larger share of the energy consumption in upcoming years. This Dagstuhl Seminar will focus on algorithms, system design, and resource management techniques for addressing these two challenges in current and future platforms.

Resilience and energy questions are currently being studied by many researchers. Some of these researchers come from a “systems” culture and investigate particular system design and management strategies (resource-provisioning policies, resource allocation and scheduling heuristics, innovative system designs, etc.). Other researchers, instead, come from an “algorithms” culture and rely on (often idealized) system models to obtain their results (complexity/approximability results, optimal or guaranteed algorithms, sophisticated heuristics, etc.) These two communities are often quite separated, and yet, to move forward, they need to collaborate and cross-pollinate.

To provide a clear context for the seminar, many of the discussions will be centered around “workflow applications.” The workflow model is general and has become very popular in many domains, ranging from scientific to datacenter applications. Many of the invited workshop participants have worked and are currently working on issues related to the efficient, resilient, and energy-efficient execution of workflows in distributed platforms. Workflows thus provide an ideal main theme for the workshop, even though departures from this theme will be expected and useful. Current tools for workflow execution are ideal proving grounds for implementing both novel systems and novel algorithms techniques for optimizing performance, resilience, and energy efficiency.

This Dagstuhl Seminar will foster discussion and articulate novel and promising directions for addressing the challenges highlighted above, thereby attempting to bridge the divide between theoretical and practical research in the field. In this view, the seminar will allow plenty of time for informal discussions, as well as "open questions" sessions. The ultimate objective is that a few roadmaps for advancing the state of the field will have been identified by the end of the seminar, along with many potential collaborations among seminar participants.