Dagstuhl Seminar 13401
Automatic Application Tuning for HPC Architectures
( Sep 29 – Oct 04, 2013 )
- Siegfried Benkner (Universität Wien, AT)
- Franz Franchetti (Carnegie Mellon University, US)
- Michael Gerndt (TU München, DE)
- Jeffrey K. Hollingsworth (University of Maryland - College Park, US)
Parallel computer systems especially for High Performance Computing are getting increasingly complex. The reasons are manyfold. HPC systems today with a peak performance of several petaflops have hundreds of thousands of cores that have to be able to work together efficiently. Those machines have a deep hierarchy, which has to be understood by the programmer to tune his program so that it profits from higher interconnection rates. In addition, to reduce the power consumption of those systems, advanced hard- and software techniques are applied, such as the usage of GPUs and the reduction of the clock frequency of processors. This transforms a homogeneous system into a heterogeneous system, which complicates programming tasks such as load balancing and efficient communication.
The complexity of today's parallel architectures has a significant impact on the performance of parallel applications and their energy consumption. Due to the high amount of energy and money being lost because of the low processor utilization, application developers are now investing significant time to tune their codes for the current and emerging systems. This tuning is a cyclic process of gathering data, identifying code region that can be improved, and tuning those code regions.
This seminar will bring together people working on performance and energy auto-tuning with people working on performance analysis tools. It will cover performance analysis techniques and tools based on profiling and tracing for large scale parallel systems as well as their extensions for energy measurement of applications. On the auto-tuning side, it will cover self-tuning libraries, tools to automatically apply compiler optimizations, auto-tuners for application-level parameters, as well as frameworks combining ideas of all the other areas.
While the analysis tools indicate performance problems, their combination with performance tuning might make those tools even more successful. The presentations of experts in both areas will increase the interest and the knowledge of the techniques applied in the other area. It will steer future collaborations and might also lead to concrete ideas for coupling performance analysis and performance tuning tools.
The seminar will be organized in the very successful style of many of the Dagstuhl seminars. We will have presentations of the participants as well as discussion sessions to look for possible connections between the performance tool developers and the tuning experts.
Parallel computer systems especially for High Performance Computing are getting increasingly complex. The reasons are manyfold. HPC systems today with a peak performance of several petaflops have hundreds of thousands of cores that have to be able to work together efficiently. Those machines have a deep hierarchy, which has to be understood by the programmer to tune his program so that it profits from higher interconnection rates. In addition, to reduce the power consumption of those systems, advanced hard- and software techniques are applied, such as the usage of GPUs that are highly specialized for regular data parallel computations via simple compute cores and high bandwidth to the graphics memory. Another technique is to reduce the clock frequency of processors when appropriate, e.g. when the application or phases of the execution are memory bound. This transforms a homogeneous system into a heterogeneous system, which complicates programming tasks such as load balancing and efficient communication.
The complexity of today's parallel architectures has a significant impact on the performance of parallel applications. Due to the high amount of energy and money being lost because of the low processor utilization, application developers are now investing significant time to tune their codes for the current and emerging systems. This tuning is a cyclic process of gathering data, identifying code regions that can be improved, and tuning those code regions.
There are a growing number of autotuning researchers in Europe, the United States, and Asia. However, there are relatively few opportunities for these researchers to meet together. The unique format of a Dagstuhl seminar provides the opportunity to bring together researchers from around the world that are using different approaches to autotuning.
This workshop brought together those people working on autotuning with people working on performance analysis tools. While the analysis tools indicate performance problems, their combination with performance tuning might make those tools even more successful. The presentations of experts in both areas will increase the interest and the knowledge of the techniques applied in the other area. It will steer future collaborations and might also lead to concrete ideas for coupling performance analysis and performance tuning tools.
The workshop was driven by the European FP7 project AutoTune that started on October 15th, 2011. It is the goal of AutoTune to implement the Periscope Tuning Framework based on the automatic performance analysis tool Periscope. It will couple Periscope's performance analysis with performance and energy efficiency tuning in an online approach.
Performance Analysis. Performance analysis tools support the programmer in the first two tasks of the tuning cycle. Performance data are gathered during program execution by monitoring the application's execution. Performance data are both summarized and stored as profile data or all details are stored in so called trace files. In addition to application monitoring, performance analysis tools also provide means to analyze and interpret the provided performance data and thus to detect performance problems. The analysis is either supported by graphical display or by annotating the source code.
State of the art performance analysis tools fall into two major classes depending on their monitoring approach: profiling tools and tracing tools. Profiling tools summarize performance data for the overall execution and provide information such as the execution time for code regions, number of cache misses, time spent in MPI routines, and synchronization overhead for OpenMP synchronization constructs. Tracing tools provide information about individual events, generate typically huge trace files and provide means to visually analyze those data to identify bottlenecks in the execution.
Representatives for these two classes are gprof, OMPP and Vampir. Gprof is the GNU Profiler tool. It provides a flat profile and a callpath profile for the program's functions. The measurements are done by instrumenting the application. OmpP is a profiling tool for OpenMP developed at TUM and the University of Tennessee. It is based on instrumentation with Opari and determines certain overhead categories of parallel regions. In contrast to the previous two tools, Vampir is a commercial trace-based performance analysis tool from Technische Universität Dresden. It provides a powerful visualization of traces and scales to thousands of processors based on a parallel visualization server.
The major research challenges in the development of PA tools are to automate the analysis and to improve the scalability of the tools. Automation of the analysis is important to facilitate the application developer's task. Starting from the formalization of performance properties in the European-American working group APART (http://www.fz-juelich.de/apart), automatic performance analysis tools were developed. Paradyn from University of Wisconsin was the first automatic online analysis tool. Its performance consultant guided the search for performance bottlenecks while the application was executing. The most important representatives are SCALASCA and Periscope. SCALASCA is an automatic performance analysis tool developed at Forschungszentrum Jülich and the German Research School on Simulation Sciences. It is based on performance profiles as well as on traces. The automatic trace analysis determines MPI wait time via a parallel trace replay on the application's processors after the application execution terminated.
Periscope is an automatic performance analysis tool for highly parallel applications written in MPI and/or OpenMP currently under development at Technische Universität München. It is a representative for a class of automatic performance analysis tools automating the whole analysis procedure. Unique to Periscope is that it is an online tool and it works in a distributed fashion. This means that the analysis is done while the application is executing (online) and by a set of analysis agents, each searching for performance problems in a subset of the application's processes (distributed). The properties found by Periscope point to code regions that might benefit from further tuning.
Performance Autotuning. The central part of the tuning process is the search for the best combination of code transformations and parameter settings of the execution environment. This creates an enormous search space, which further complicates the whole tuning task. As a result, much research has been dedicated to the area of autotuning in the last years and many different ideas have been gathered. These can be grouped into four categories:
- self-tuning libraries for linear algebra and signal processing like ATLAS, FFTW, OSKI and SPIRAL;
- tools that automatically analyze alternative compiler optimizations and search for their optimal combination;
- autotuners that search a space of application-level parameters that are believed to impact the performance of an application;
- frameworks that try to combine ideas from all the other groups.
The first category contains special purpose libraries that are highly optimized for one specific area. The Automatically Tuned Linear Algebra Software (ATLAS) supports the developers in creating numerical programs. It automatically generates and optimizes the popular Basic Linear Algebra Subroutines (BLAS) kernels for the currently used architecture. Similarly, FFTW is a library for computing the discrete Fourier transform on different systems. Due to the FFTW design, an application using it will perform well on most architectures without modification.
However, the growing diversity of parallel application areas requires a more general autotuning strategy. Thus, substantial research has been done in a different application-independent approach of autotuning. This is based on the automatic search for the right compiler optimizations on the specific platform. Such tools can be separated into two groups according to their methodology: iterative search tools and those using machine learning techniques. There has been much work in the first category. All these tools share the idea of iteratively enabling certain optimizations. They run the compiled program and monitor its performance. Based on the outcome, they decide on the new tuning combination. Due to the huge size of the search space, these tools are relatively slow. There exists an algorithm called combined elimination (CE) that greatly improves the previous search-based methods.
The second branch of compiler-based autotuners applies a different strategy to look for the best optimization settings. They use knowledge about the program's behavior and machine learning techniques to select the optimal combination. This approach is based on an automatically built per-system model, which maps performance counters to good optimization options. This model can then be used with different applications to guide their tuning. Current research work is also targeting the creation of a self-optimizing compiler that automatically learns the best optimization heuristics based on the behavior of the underlying platform.
Among the tools in the third category is the Active Harmony system. It is a runtime parameter optimization tool that helps focus on the application-dependent parameters that are performance critical. The system tries to improve performance during a single execution based on the observed historical performance data. It can be used to tune parameters such as the size of a read-ahead buffer or what algorithm is being used (e.g., heap sort vs. quick sort). As compared with Active Harmony, the work from Nelson uses a different approach that interacts with the programmer to get high-level models of the impact of parameter values. These models are then used by the system to guide the search for optimization parameters. This approach is called model-guided empirical optimization where models and empirical techniques are used in a hybrid approach.
Popular examples for the last group of autotuning tools are the newly released Parallel Active Harmony, and the Autopilot framework. The Parallel Active Harmonyis a combination of the Harmony system and the CHiLL compiler framework. It is an autotuner for scientific codes that applies a search-based autotuning approach. While monitoring the program performance, the system investigates multiple dynamically generated versions of the detected hot loop nests. The performance of these code segments is then evaluated in parallel on the target architecture and the results are processed by a parallel search algorithm. The best candidate is integrated into the application. The second popular example in this group is the Autopilot. It is an integrated toolkit for performance monitoring and dynamical tuning of heterogeneous computational grids based on closed loop control. It uses distributed sensors to extract qualitative and quantitative performance data from the executing applications. This data is processed by distributed actuators and the preliminary performance benchmark is reported to the application developer.
Energy efficiency autotuning. Multi-Petascale supercomputers consist of more than one hundred thousand processing cores and will consume many MW of electrical power. Energy efficiency will be crucial for both cost and environmental reasons, and may soon become as important as pure peak performance. This is exemplified by the fact that since a few years the TOP500 list (http://www.top500.org/) also contains power consumption values. Current procurements for high-end supercomputers show that the cost for electricity and cooling is nearly as high as for the hardware, particularly in countries with high energy costs such as Germany. Power consumption is considered one of the greatest challenges on the road to exascale systems.
Dynamic frequency and voltage scaling provides a mechanism to operate modern processors across a broad range of clock frequencies and voltage levels, allowing to trade off performance vs. energy consumption. Overall frequency scaling ideas are based on Advanced Configuration and Power Interface (ACPI, http://www.acpi.info/) specification with Intel's SpeedStep implementation or Cool'n'Quiet by AMD, respectively. Processors like Intel's Sandy Bridge are fully compliant with ACPI. Sets of utilities to exploit these techniques are available, and ideas to use them for complete jobs in multi user HPC clusters have already been described.
Whereas dynamic frequency scaling is commonly used in laptops, the impact and usability in HPC is still quite challenging. For applications using several hundreds or thousands of cores, uncoordinated manipulation of the frequency by some background daemon would introduce a new source of OS jitter. Moreover, changing the processor frequency requires on the order of milliseconds and only yields a benefit if a major part of an application can be run in a given mode continuously. Typically, lowering the CPU frequency can yield a 10% decrease in power consumption while increasing the application runtime by less than 1%. However, the impact of lowering the frequency and voltage on the application performance depends on whether it is CPU, memory, cache or I/O bound. Code regions that are CPU or cache bound can take advantage of higher frequencies, whereas regions that are memory or I/O bound experience only minor performance impacts when reducing the frequency. Therefore it is essential to identify applications and those parts of them that are appropriate for running within a specific power envelope without sacrificing too much performance.
Different metrics for performance, cost, energy, power, cooling and thermal conditions may apply for different usage and optimization scenarios e.g.
- minimizing the energy consumption by reducing the performance of an application by a given percentage
- considering outside temperature conditions, i.e, if it is cold outside and free cooling is applied, an increased power consumption by the compute nodes might be tolerated
- optimizing the total cost of ownership (including baseline investment, power and cooling) for given throughput requirements.
It is quite cumbersome to investigate all these conditions and the various frequency settings manually. Therefore automatic tools are required to automatically identify suitable applications and particular code regions, and finally automatically tune the frequency and power settings to yield optimal results for the desired objectives.
The seminar was organized as a series of thematic sessions. An initial session comprised two overview presentations about performance analysis and measurement tools as well as a general introduction to autotuning, setting the overall context for the seminar. A session on support tools covered code restructuring techniques, testing environments, and performance repositories for autotuning. Two sessions on infrastructures provided insights into frameworks and environments, language support for autotuning as well challenges and requirements in the context of very large-scale systems. A session on energy efficiency tuning gave insight into the challenges and recent developments in optimizing HPC systems and applications with respect to energy consumption. A session on accelerator tuning covered various issues in tuning for GPUs and accelerated parallel systems. A session on techniques covered various topics related to performance-guided tuning, modeling, and scalability. A session on tools covered recent developments in empirical autotuning, semantics support for performance tools and autotuners as well as synthesis of libraries. Various topics related to the tuning of message-passing applications and I/O-related autotuning were covered in a session on MPI and I/O tuning. The session on compiler transformations covered compiler transformations for multi-objective tuning, techniques for tuning irregular applications, as well as on language and compilation support for analysis of semantic graphs.
- Enes Bajrovic (Universität Wien, AT) [dblp]
- Shajulin Benedict (St. Xavier's Catholic College of Engineering, IN) [dblp]
- Siegfried Benkner (Universität Wien, AT) [dblp]
- Aydin Buluc (Lawrence Berkeley National Laboratory, US) [dblp]
- Milind Chabbi (Rice University - Houston, US) [dblp]
- I-hsin Chung (IBM TJ Watson Research Center - Yorktown Heights, US) [dblp]
- Isaías Alberto Comprés Ureña (TU München, DE) [dblp]
- Guojing Cong (IBM TJ Watson Research Center - Yorktown Heights, US) [dblp]
- Thomas Fahringer (Universität Innsbruck, AT) [dblp]
- Franz Franchetti (Carnegie Mellon University, US) [dblp]
- Grigori Fursin (University of Paris South XI, FR) [dblp]
- Michael Gerndt (TU München, DE) [dblp]
- Carla Guillen (LRZ - München, DE) [dblp]
- Torsten Hoefler (ETH Zürich, CH) [dblp]
- Jeffrey K. Hollingsworth (University of Maryland - College Park, US) [dblp]
- Paul D. Hovland (Argonne National Laboratory, US) [dblp]
- Toshiyuki Imamura (RIKEN - Kobe, JP) [dblp]
- Thomas Karcher (KIT - Karlsruher Institut für Technologie, DE) [dblp]
- Takahiro Katagiri (University of Tokyo, JP) [dblp]
- Michael Knobloch (Jülich Supercomputing Centre, DE) [dblp]
- Andreas Knüpfer (TU Dresden, DE) [dblp]
- Jakub Kurzak (University of Tennessee, US) [dblp]
- Allen D. Malony (University of Oregon - Eugene, US) [dblp]
- Andrea Martinez (Autonomus University of Barcelona, ES) [dblp]
- Renato Miceli Costa Ribeiro (ICHEC - Galway, IE) [dblp]
- Robert Mijakovic (TU München, DE) [dblp]
- Bernd Mohr (Jülich Supercomputing Centre, DE) [dblp]
- Shirley V. Moore (University of Texas - El Paso, US) [dblp]
- Carmen Novarrete (LRZ - München, DE) [dblp]
- Georg Ofenbeck (ETH Zürich, CH) [dblp]
- Antonio Pimenta (Autonomus University of Barcelona, ES) [dblp]
- Barry L. Rountree (LLNL - Livermore, US) [dblp]
- Martin Sandrieser (Universität Wien, AT) [dblp]
- Robert Schoene (TU Dresden, DE) [dblp]
- Martin Schulz (LLNL - Livermore, US) [dblp]
- Armando Solar-Lezama (MIT - Cambridge, US) [dblp]
- Walter F. Tichy (KIT - Karlsruher Institut für Technologie, DE) [dblp]
- Jesper Larsson Träff (TU Wien, AT) [dblp]
- Richard M. Veras (Carnegie Mellon University, US) [dblp]
- Richard Vuduc (Georgia Institute of Technology - Atlanta, US) [dblp]
- Felix Wolf (GRS for Simulation Sciences - Aachen, DE) [dblp]
- optimization / scheduling
- software engineering
- Parallel Computing
- Programming Tools
- Performance Anaylsis and Tuning