- Michael Glaß (Universität Erlangen-Nürnberg, DE)
- Annette Beyer (for administrative matters)
Semiconductor industry is hitting the utilization wall and puts focus on parallel and heterogeneous many-core architectures. While continuous technological improvements in the chip manufacturing process enable the dense integration of more and more processing cores and, thus, processing capabilities, the resulting power consumption per area (the power density) increases enormously. With this density, the problem of dark silicon will become more prevalent in the future: It will be impossible to power all the components on the chip up due to the thermal constraints. But, this is not only an emerging threat for SoC and MPSoC designers, HPC faces the same problem as well: The power supplied by the energy companies as well as the cooling capacity does not allow to run the entire machine at highest performance anymore. The goal of this workshop is to increase the awareness of the research communities of those similarities and to explore solutions based on more flexible resource management schemes including run-time, design-time, and hybrid solutions.
Recent research work on power management for Dark Silicon aims at efficiently utilizing the TDP (Thermal Design Power) budget to maximize the performance or to allocate full power budget for boosting single-application performance by running a single core at the maximum voltage or multiple cores at nominal level for a very short time period. Control-based frame-works are proposed to ﬁnd the optimal trade-off between power and performance of many-core systems under a given power budget. The work on near-threshold computing (NTC) enables operating multiple cores at a voltage close to the threshold voltage. Though this approach favors applications with thread-level parallelism at low power, it severely suffers from errors or inefﬁciency due to process variations and voltage ﬂuctuations. On the other hand, the computational sprinting approach leverages Dark Silicon to power-on many extra cores for a very short time period (100s of millisecond) to facilitate sub-second bursts of parallel computations through multi-threading thereby wasting a signiﬁcant amount of energy due to leakage current.
The energy consumption of HPC systems is steadily growing. The costs for energy in the five year lifetime of large scale supercomputers already almost equal the cost of the machine. It is a necessity to carefully tune systems, infrastructure and applications to reduce the overall energy consumption. In addition, the computing centers running very big systems face the problem of limited power provided by the energy providers and of the requirement for an almost constant power draw from the grid. These challenges also require a careful and flexible power and resource management for HPC systems.
Modern Applications have to exploit the available parallelism and heterogeneity of – non-darkened – cores to meet their functional and non-functional requirements and to gain performance improvements. A main challenge originates from many-cores promoting (a) highly dynamic usage scenarios as already observable in today's "smart devices", where multiple and varying numbers of applications are running at different points in time while (b) the available cores are subject to change due to dark silicon. As a consequence, providing a mapping or pinning of applications to processor cores which is optimal and predictable with respect to performance, timing, energy consumption, etc. may not be guaranteed by static design-time optimization alone. At the same time, pure run-time resource management may result in unpredictable and non-optimal system states. A research direction that addresses this field of tension is invasive computing where design-time analysis and optimization of the applications is combined with run-time resource management approaches that try to balance the requirements of the individual applications with the system’s requirements e.g. to respect a maximum power density.
The goal of this Dagstuhl Seminar is to bring together experts from the different domains and to discuss the state-of-the-art and identify future collaboration topics based on common research interests. We will have three main parts on the topics Dark Silicon, Power and Energy Usage in HPC, and Hybrid Approaches to Resource Management with longer overview presentations by invited speakers and research presentations by the attendees. Each part will close with a discussion slot. After these three parts we plan for group discussion to identify future collaborative research directions.
Semiconductor industry is hitting the utilization wall and puts focus on parallel and heterogeneous many-core architectures. While continuous technological scaling enables the high integration of 100s-1000s of cores and, thus, enormous processing capabilities, the resulting power consumption per area (the power density) increases in an unsustainable way. With this density, the problem of Dark Silicon will become prevalent in future technology nodes: It will be infeasible to operate all on-chip components at full performance at the same time due to the thermal constraints (peak temperature, spatial and temporal thermal gradients etc.).
Recent research work on power management for Dark Silicon aims at efficiently utilizing the TDP (Thermal Design Power) budget to maximize the performance or to allocate full power budget for boosting single-application performance by running a single core at the maximum voltage or multiple cores at nominal level for a very short time period. Control-based frameworks are proposed to find the optimal trade-off between power and performance of many-core systems under a given power budget. The controllers are coordinated to throttle down the power when the system exceeds the TDP and to assign the task to the most suitable core to get the optimal performance. The work on near-threshold computing (NTC) enables operating multiple cores at a voltage close to the threshold voltage. Though this approach favors applications with thread-level parallelism at low power, it severely suffers from errors or inefficiency due to process variations and voltage fluctuations. On the other hand, the computational sprinting approach leverages Dark Silicon to power-on many extra cores for a very short time period (100s of millisecond) to facilitate sub-second bursts of parallel computations through multi-threading but thereby wasting a significant amount of energy due to leakage current. When doing so, it consumes power that significantly exceeds the sustainable TDP budget. Therefore, these cores are subsequently power-gated after the computational sprint. Alternate methods are Intel’s Turbo Boost and AMD’s Turbo CORE technologies that leverage the temperature headroom to favor high-ILP applications by increasing the voltage/frequency of a core while power-gating other cores. These techniques violate the TDP constraint for a short period (typically in terms of 10s of seconds) until the critical temperature is reached and then switches to a nominal operation. However, in case of dependent workloads, boosting of one core may throttle the other due to thermal coupling (i.e. heat exchange between different cores sharing the same die). Therefore, these boosting techniques lack efficiency in case dependent tasks of an application mapped to two different cores or, in general, for multiple concurrently executing applications with distinctive/dependent workloads.
State-of-the-art boosting techniques assume a chip with only 10-20 cores (typically 16) and accordingly a full chip temperature violation for short time. However, in a large-scale system (with 100s--1000s cores), temperature hot spots may occur on certain chip portions far before the full chip’s average temperature exceeds the critical temperature. Therefore, a chip may either get damaged before reaching the full chip critical temperature or TDP needs to be pessimistically designed. Advanced power management techniques are required to overcome these challenges in large-scale environments.
HPC - Dark Power
The energy consumption of HPC systems is steadily growing. The costs for energy in the five year lifetime of large scale supercomputers already almost equal the cost of the machine. It is a necessity to carefully tune systems, infrastructure and applications to reduce the overall energy consumption. In addition, the computing centers running very big systems face the problem of limited power provided by the energy providers and of the requirement for an almost constant power draw from the grid. The big machines, especially future exascale systems, are able to use more power if they are run at highest performance of all components than can be provided by the energy company. Thus, a carefully optimized power distribution is necessary to make most efficient use of the provided power. The second aspect is the requirement of an almost constant power draw: Sudden changes from 20 MW to 10 MW for example, will be dangerous for the components of the power grid. In addition, the contracts with the energy companies force the centers use the same power all the time by charging more, if it drops below or exceeds certain limits. These challenges also require a careful and flexible power and resource management for HPC systems.
For a certain class of high-end supercomputer, there is a standard pattern of power consumption: During burn-in (and perhaps while getting a result to go onto the top-500 list) the machine will run dozens or hundreds of instances of Linpack. This code is quite simple and often hand-optimized, resulting in an unusually well-balanced execution that manages to keep vector units, cache lines and DRAM busy simultaneously. The percent of allocated power often reaches 95% or greater, with one instance in recent memory exceeding 100% and blowing circuit breakers. After these initial runs, however, the mission-critical simulation codes begin to execute and they rarely exceed 60% of allocated power. The remaining 40% of electrical capacity is dark: just as unused and just as inaccessible as dark silicon. While we would like to increase the power consumption (and thus performance) of these simulation codes, a more realistic solution in the exascale timeframe is hardware overprovisioning. This solution requires buying more compute resources than can be executed at maximum power draw simultaneously. For example, if most codes are expected to use 50% of allocated power, the optimal cluster would have twice as many nodes.
Making this a feasible design requires management of power as a first-class resource at the level of the scheduler, the run-time system, and on individual nodes. Hardware power capping must be present. Given this, we can theoretically move power within and across jobs, using all allocated power to maximize throughput. The purpose of this seminar is to find this optimal level.
Hybrid (Design-time & Run-time) Resource Management
Today’s complex applications need to exploit the available parallelism and heterogeneity of -- non-darkened -- cores to meet their functional and non-functional requirements and to gain performance improvements. From a resource management’s point of view, modern many-core systems come with significant challenges: (a) Highly dynamic usage scenarios as already observable in today's "smart devices" result in a varying number of applications with different characteristics that are running concurrently at different points in time on the system. (b) Due to the constraints imposed by the power density, the frequency at which cores can be operated as well as their availability as a whole, are subject to change. Thus, resource management techniques are required that enable a resource assignment to applications that satisfies their requirements but at the same time can consider the challenging dynamics of modern many-cores as a result of Dark Silicon.
Traditional techniques to provide a binding or pinning of applications to processor that are optimal and predictable with respect to performance, timing, energy consumption, etc. are typically applied at design time and result in a kind of static system design. Such a static design may, on the one hand, be too optimistic by assuming that all assigned resources are always available or it may require for a kind of over-allocation of cores to compensate for worst-case scenarios, e.g., a frequent unavailability of cores due to Dark Silicon. Hence, the dynamic effects imposed in Dark Silicon require for novel modeling techniques already at design time.
Approaches that focus on pure run-time resource management are typically designed with flexibility in mind and should inherently be able to dynamically react to changing applications as well as to the described effects of Dark Silicon. But, future run-time resource management should not only react to a possible violation of a maximum power-density constraint, but also be able to proactively avoid such situations. The latter is an important aspect of the system’s dependability as well. At the same time, such dynamic resource management is also required to regard the applications’ requirements. Here, a careful consideration on whether pure run-time management strategies enable the amount of predictability of execution qualities required by some applications becomes necessary.
A recent research direction focuses on hybrid (design-time and run-time) approaches that explore this field of tension between a high predictability of design-time approaches and the dynamic adaptivity of run-time resource management. In such approaches, design-time analysis and optimization of the individual applications is carried out to capture information like core allocation, task binding, or message routing and predict resulting quality numbers like timeliness, energy consumption, or throughput. This information is then passed to the run-time resource management that then dynamically selects between the pre-optimized application embeddings. Such strategies may not only be able to achieve application requirements even in such highly dynamic scenarios, but could even balance the requirements of the individual applications with the system’s requirements -- in particular the maximum power density. On the other hand, coarse-grained resource management as required for core allocation etc. may be considered to happen on a longer time scale. The effects of Dark Silicon are instead on a smaller time scale with temperature almost immediately following changing workloads, thus, requiring for an intervention of the resource-management infrastructure. Therefore, novel concepts are required that enable a fine-grained resource management in the presence of Dark Silicon -- both in the context of abstraction layer and time scale -- without sacrificing the required efficiency but also predictable realization of application requirements via coarse-grained resource management.
Traditionally, resource management techniques play an important role in both domains -- targeting very different systems. But, as outlined before, resource management may be the key to tackle the problem of dark silicon that both communities face. The aim of this seminar is to give an overview of the state of the art in the area of both embedded and HPC. It will make both groups aware of similarities and differences. Here, the competences, experiences, and existing solutions of both communities shall stimulate discussions and co-operations that hopefully manifest in innovative research directions for many-core resource management in the dark silicon era.
Overview of Contributions
This seminar presentations on the state-of-the-art in power and energy management in HPC and on techniques mitigating the Dark Silicon problem in embedded systems. In a joint session commonalities and differences as well as collaboration potential in the area of Dark Silicon were explored. This subsection gives an overview of the topics covered by the individual speakers in the seminar. Please refer to the included abstracts to learn more about individual presentations.
The HPC-related presentations where started with an overview presentation by Barry Rountree from the Lawrence Livermore National Laboratory. He introduced the field of HPC and of exascale systems. The new challenge is that these systems will be power limited and the hardware is overprovisioned. Techniques increasing the efficient usage of the available power need to be developed. Exascale systems will be heterogeneous, even systems with homogeneous cores become heterogeneous due to production variability which takes effect under power limits. Careful distribution of power among jobs and within jobs as well as application and system configurations for jobs will be important techniques for these power limited and overprovisioned systems.
Axel Auweter added to this introduction deep insights into the electricity market in Germany, its complex price structure, and the challenges for German compute centers to act successfully on that market.
An introduction from the embedded field to Dark Silicon was given by Sri Parameswaran from the University of New South Wales. The continuous decrease in feature size without an appropriate decrease in the threshold voltage leads to increased power density. Between 50% and 90% of dark silicon is expected in future chips. Mitigation techniques are energy reduction techniques as well as spatial and temporal dimming of cores. Considerable energy reduction can be achieved from heterogeneity on various levels, e.g., heterogeneous cores and the DarkNoC approach.
Dark Silicon due to Power Density
Several techniques were presented to mitigate the effect of power density. Santiago Pagani presented spatial and temporal dimming of cores to make best use of the thermal distribution on the chip. He and Andrey Semin talked also about boosting the core frequency to exceed the power limit for a short time period to speedup computation. Sergio Bampi presented near threshold computing as a potential solution based on further lowering the threshold voltage. Michael Niemier explored the potential of new transistor technology to mitigate the Dark Silicon effect.
Dark Silicon due to Limited Power
Mitigation techniques in this field are quite similar in mobile computing and HPC, although the overall objective is a bit different. While in mobile computing the minimal power required to meet the QoS requirements of applications is the goal, in HPC it is to go as fast as possible with the available power, may be considering energy efficiency and system throughput as well.
The following approaches relevant for mobile computing and HPC were presented: Heterogeneity in various hardware aspects can be used to reduce the energy consumption of computations. Siddarth Garg and Tulika Mitra covered performance heterogeneity in scheduling tasks for big/little core combinations. Tulika Mitra and Andrea Bartolini talked about using function heterogeneity, e.g. accelerators, in mobile computing and HPC to increase energy efficiency. The Heterogeneous Tile Architecture was introduced in the presentations of Sri Parameswaran and Santiago Pagani as a general architecture enabling exploitation of heterogeneity to mitigate the Dark Silicon effect.
Another approach is to determine the most efficient application and system configuration. Static tuning of parameters, such as the power budget of an application, were presented by Michael Knobloch and Tapasya Patki. Dynamic tuning techniques were covered in the presentations of Michael Gerndt, Martin Schulz, and Per Gunnar Kjeldsberg. Jonathan Eastep introduced the GEO run-time infrastructure for distributed machine-learning based power and performance management.
Kirk Cameron highlighted the unexpected effects of changing the core frequency due to non-linear dependencies. Jürgen Teich talked about Invasive Computing providing dynamic resource management not only for improving certain non-functional application aspects but also for increasing the predictability of those aspects.
Wolfgang Nagel and Sri Parameswaran presented energy efficient network architectures. They covered heterogeneous on-chip network architectures and wireless communication within compute clusters.
Approximate computing was presented by Sergio Bampi. It allows trading off accuracy and energy. Pietro Cicotti covered in his presentation data movement optimization within a CPU to save energy.
Application and system monitoring is a pre-requisite for many of the above techniques. Michael Knobloch, Wolfgang Nagel, and Kathleen Shoga presented application and system monitoring techniques based on software as well as hardware instrumentation. Many compute centers are installing infrastructures to gather sensor values from the whole facility to enable future analysis. In addition to performance and energy measurements for application, higher level information about the application characteristics is useful in taking tuning decisions. Tapasay Patki presented application workflows as a mean to gather such information.
Besides these generally applicable techniques, some presentations covered also techniques that are specific to HPC installations with their batch processing approach and large compute systems.
Andrea Bartolini highlighted in his presentation the holistic multiscale aspect of power-limited HPC. The application, the compute system, and the cooling infrastructure have to be seen as a complex integrated system. Power-aware scheduling, presented by Tapasya Patki and Andrea Bartolini, can significantly improve the throughput of power-limit HPC systems and moldable jobs can improve the effect of power-aware scheduling significantly. Isaias Compres presented Invasive MPI, an extension of MPI for programming moldable application.
At the end of the seminar a list of takeaway messages was collected based on working-group discussions followed by an extensive discussion of all participants:
- Dark silicon is a thermal problem in embedded and a power problem in HPC. HPC can cool down while in the embedded world you can't. Therefore HPC can power up everything if they have enough power. But the costs for providing enough power for rare use cases have to be rectified.
- Better tools are required on both sides to understand and optimize applications.
- Better support for optimizations is required through the whole stack from high level languages down to the hardware.
- In both communities run-time systems will get more important. Applications will have to be written in a way that run-time systems can work effectively.
- Task migration is of interest to both groups in combination with appropriate run-time management techniques.
- Embedded also looks at specialized hardware designs while HPC has to use COTS. In HPC, the machine architecture might be tailored towards the application areas. Centers are specialized for certain customers.
- Heterogeneity on architecture level is important to both groups for energy reduction.
- Better analyzable programming models are required, providing composable performance models.
- HPC will have to live with variability. The whole tuning step has to change since reproducibility will no longer be given.
- Hardware-software co-design will get more important for both groups.
- Both areas will see accelerator-rich architectures. Some silicon has to be switched off anyway, thus these can be accelerators that might not be useful for the current applications.
- Axel Auweter (LRZ - München, DE) [dblp]
- Sergio Bampi (Federal University of Rio Grande do Sul, BR) [dblp]
- Andrea Bartolini (University of Bologna, IT & ETH Zürich, CH) [dblp]
- Kirk Cameron (Virginia Polytechnic Institute - Blacksburg, US) [dblp]
- Pietro Cicotti (San Diego Supercomputer Center, US) [dblp]
- Isaías Alberto Comprés Ureña (TU München, DE) [dblp]
- Jonathan Eastep (Intel - Hillsboro, US) [dblp]
- Siddharth Garg (New York University, US) [dblp]
- Michael Gerndt (TU München, DE) [dblp]
- Michael Glaß (Universität Erlangen-Nürnberg, DE) [dblp]
- Per Gunnar Kjeldsberg (NTNU - Trondheim, NO) [dblp]
- Michael Knobloch (Jülich Supercomputing Centre, DE) [dblp]
- Tulika Mitra (National University of Singapore, SG) [dblp]
- David Montoya (Los Alamos National Lab., US) [dblp]
- Wolfgang E. Nagel (TU Dresden, DE) [dblp]
- Michael Niemier (University of Notre Dame, US) [dblp]
- Santiago Pagani (KIT - Karlsruher Institut für Technologie, DE) [dblp]
- Sri Parameswaran (UNSW - Sydney, AU) [dblp]
- Tapasya Patki (LLNL - Livermore, US) [dblp]
- Barry L. Rountree (LLNL - Livermore, US) [dblp]
- Martin Schulz (LLNL - Livermore, US) [dblp]
- Andrey Semin (Intel GmbH - Feldkirchen, DE) [dblp]
- Kathleen Shoga (LLNL - Livermore, US)
- Jürgen Teich (Universität Erlangen-Nürnberg, DE) [dblp]
- modelling / simulation
- optimization / scheduling
- Parallel Computing
- Programming Tools
- Performance Analysis and Tuning
- Dark Silicon
- Power Density
- Power Modelling
- Resource Management