Dagstuhl Seminar 16201
Synergies among Testing, Verification, and Repair for Concurrent Programs
( May 16 – May 20, 2016 )
- Julian Dolby (IBM TJ Watson Research Center - Yorktown Heights, US)
- Orna Grumberg (Technion - Haifa, IL)
- Peter Müller (ETH Zürich, CH)
- Omer Tripp (IBM TJ Watson Research Center - Yorktown Heights, US)
- Annette Beyer (for administrative matters)
Concurrency bugs are notoriously hard to reproduce, localize and repair. This is because of the explosion in behavior space of concurrent software systems compared to sequential codes. This presents a challenge for bug-finding, verification, and bug-fixing tools, which are all required to disambiguate true problems from spurious defects and represent the defects in an effective and actionable manner.
As an example, a concrete-level data-race situation, whereby two threads access a shared memory location without proper synchronization, and at least one of these threads writes to that location, is not necessarily a bug. If such a race occurs within a lock-free data structure, then it may be admissible as part of some higher-level transactional behavior enforced by the data-structure operation. In other situations, where a race indeed proves to be a bug, there is the challenging question of linking the race back to its cause such as insufficient synchronization or failure to check a precondition.
Other questions that arise are whether the bug can be reproduced with only a small number of threads (or more generally, what the minimal reproduction scenario is for the bug); whether it is inherently concurrent or may also arise in a sequential context; whether it arises due to erroneous application code, library code, or a violation of the contract between the client and library; etc.
The inherent difficulties of reasoning about concurrency bugs complicate testing, verification, and repair. For a testing algorithm, the challenge is not only to come up with effective inputs, but also to induce thread interleavings that are likely to expose concurrency defects, and beyond that, the testing algorithm must also be able to characterize the bugs it observes to output actionable and reproducible bug reports. For a verification technique, there are for instance the challenges to support specifications that permit thread-modular and procedure-modular verification (even if threads share resources), to reason about thread interactions more abstractly than by exploring all possible interleavings to achieve scalability, and to capture the subtle semantics of weak memory models. For a repair algorithm, it is pertinent to introduce the “minimal” amount of synchronization to eliminate a problem, as otherwise the fix becomes a performance bottleneck, obviating the benefit of parallelism. The threat of introducing other bugs through the fix also exists (a classic example being the introduction of a deadlock when fixing an atomicity violation).
The testing, verification and repair communities have largely worked independently on the problems of concurrency, which remain extremely difficult for all three of these communities. What is missing is more sharing of techniques and algorithms between the communities to create “hybrid” approaches for tackling concurrency bugs. One example would be to guide the formulation and refinement of verification models by dynamic testing. Another example is to derive dynamic performance metrics from testing to inform the repair algorithm. Yet another possibility is to instantiate exhaustive testing as a form of bounded model checking or utilize testing within verification to confirm whether an alarm is correct.
A main goal of our proposed seminar is to bring together researchers from the bug-finding, verification and repair communities to think through these and other ideas, present their techniques and algorithms, and examine together hybrid analysis models. We hope to also benefit from industry practitioners, who have developed their own techniques and practices for detecting, isolating and fixing concurrency bugs, and could orient the discussion toward practical challenges and solutions.
Context and Motivation
Major trends in computing infrastructure, such as multicore processors and data centers, increase the demand for concurrent software that utilizes the available resources. However, concurrent programs are notoriously difficult to develop. They are susceptible to a number of specific errors that do not occur in sequential code, such as data races, deadlock, atomicity violations, starvation, and violations of consistency models. These errors typically manifest themselves only in certain executions (for instance, under certain thread schedules), which makes them extremely difficult to detect, reproduce, localize, and repair. Established techniques for testing, verifying and repairing sequential programs are insufficient to handle concurrent software. In particular, they do not address the following challenges:
- State space explosion: The execution of a concurrent program depends not only on the inputs but also on the thread schedule and optimizations, such as memory reordering. This results in an state space that is orders of magnitude larger than for sequential programs. Bug-finding techniques, such as testing and bounded model checking, require effective ways of pruning the state space. Static verification techniques, such as deductive verification and abstract interpretation, require suitable abstractions that allow one to reason about all possible program behaviors. Finally, program repair requires techniques to predict the impact of a program change on the set of possible executions.
- Modularity: Modular techniques, such as unit testing or compositional verification, scale to large applications. However, for many properties of concurrent programs there are no modular techniques, or they require a large annotation overhead, for instance to denote the locations protected by a lock or to specify a locking order (or discipline) that ensures deadlock freedom. It is crucial to develop techniques that allow programs to be checked and repaired modularly, for instance to fix an atomicity violation by adding more thread synchronization, but without introducing a deadlock globally.
- Specifications: Testing, verification and repair may rely on specifications that express the intended program behavior, for instance in the form of test oracles or program invariants. In addition to functional properties, specifications for concurrent programs also have to express how threads cooperate, for instance via a global locking strategy. While various specification approaches exist for concurrent programs, there is no uniform formalism that handles the full range of concurrency idioms and that supports testing, verification and repair.
- Error reporting: Testing, verification and repair techniques need to disambiguate true problems from spurious defects, which is often difficult in concurrent programs. For instance, a data race is not necessarily a bug. If a race occurs within a lock-free data structure, then it may be admissible as part of some higher-level transactional behavior enforced by the data-structure operation. Moreover, it is important to present bugs in an understandable manner, for instance by providing reports with only a small number of threads and by determining whether a bug is inherently concurrent or may also arise in a sequential context.
- Liveness: Whereas for most sequential programs, termination is the only relevant liveness property, liveness (such as fairness or the absence of livelocks) is often more prevalent in concurrent programs. It is, therefore, important to develop techniques to check and enforce progress.
Program testing, verification, and repair each offer partial solutions to these challenges. This seminar was conceived with the goal of bringing together these three communities in order to develop a common understanding of the issues as well as to enable collaboration at the level of techniques and tools.
The first step toward exposing, and enabling, synergies between the three main threads of research on correctness and reliability of concurrent programs -- verification, testing and repair -- is to analyze the challenges and contributions pertaining to each of these areas in isolation. We survey work that has been done in each of these communities, based on the available literature and presentations given in the seminar, to summarize the current state of the three communities.
A main challenge in verification of concurrency properties is the prohibitive state space unfolded by thread interleavings. A hybrid solution to this problem is to specialize the static abstraction according to necessary proof conditions, arising during dynamic runs, such that the verification algorithm can scale with fine-grained abstractions (Naik, Yang). Another approach is to retain correlations among local thread states as well as the shared program state (Sagiv, Segalov). In this way, useful invariants can be proved and exploited by the verifier even if an unbounded number of threads is assumed. Refinement techniques are useful when little information is required about the environment to prove a property (Gupta). A useful idea in error reporting is to pinpoint concurrency-specific bugs (differentiating them from sequential bugs) by also running a sequential verifier and performing delta analysis (Joshi). Much like other techniques, verification greatly benefits from user specifications. For example, a parallelizing compiler is more likely to prove disjointness between loop iterations if relevant data structures (or operations) are specified as linearizable (Rinard, Diniz). This also provides a measure of modularity, enabling the separation between library linearizability checking and client verification. Modern program logics (O’Hearn, Parkinson, Gardner) provide a way of constructing correctness proofs for concurrent programs, though in general modular verification of concurrent software remains a hard problem.
Similarly to verification, testing techniques are also challenged by the state-space problem. Several ideas have been proposed in response to this problem. Open-world testing, whereby data structures or libraries referenced by an application are tested in isolation for concurrency bugs (e.g., atomicity violations), reduces the scope of testing considerably (Shacham). Interestingly, even open-world issues that cannot be recreated within the client application are often fixed by developers, which encourages further research into modular consistency properties (e.g., linearizability) (Shacham). Predictive analysis is a recent form of testing that holds the promise of high coverage at an affordable cost (Smaragdakis). Starting from a concrete trace, predictive analysis applies feasibility-preserving transformations (reordering trace events, typically through constraint solving) to detect concurrency bugs, such that soundness is guaranteed (Dolby, Huang). Another source of state-space reduction is to exploit high-level semantic guarantees, like atomicity, to abstract away intermediate trace transitions (Shacham, Tripp). This also relates to error reporting, where certain read/write conflicts give rise to spurious conflicts that can be eliminated with a higher-level view of conflict as lack of commutativity between atomic operations (Koskinen, Kulkarni). Contrary to memory-level conflict detection, commutativity-based testing requires a specification (Shacham, Tripp). Another form of specification refers to consistency relaxations, e.g. permitting certain types of read/write conflict (Thies) or specifying a computation as nondeterministic (Burnim, Tripp).
In program repair, error reporting (or localization) plays a key role, deciding the effective scope and nature of the fix. Pinpointing the exact conditions that give rise to a concurrency bug is thus critical, emphasizing the need for better testing and verification tools. Importantly, incorrect fixing may introduce concurrency bugs (e.g., a deadlock resulting from additional synchronization to fix an atomicity violation), which again highlights the need for better synergy between repair and testing/verification (Liu). Incorrect fixing also turns liveness into a concrete concern: Assuming the program previously terminated, does it also terminate after the fix? Existing solutions that ensure termination rely on iterative transformation methods as well as specialized models like Petri nets (Liu, Zhang). A common assumption in the repair community, to hold back the state-space challenge, is that concurrency bugs involve a small number of threads (typically 2) (Liblit, Liu). The hope is that better synergy with testing and verification can work toward relaxing this assumption. Semantic lifting of the concrete code, exploiting e.g. linearizability, has recently been demonstrated as a useful means to apply bottom-up/top-down fixing: First, the code is lifted into an abstract workflow, and then the workflow is concretized into a correct reimplementation (Liu, Tripp). This motivates further exploration of useful specification media for repair of concurrency defects.
Goals of the Seminar
The goal of the seminar was to promote cross fertilization among the verification, testing and repair communities, as they seem to be running into the same challenges, thereby solving increasingly similar problems. At the extreme, verification is about all possible program behaviors, testing is about running the program to see what it does, and repair is about generating new code. However, many techniques in all communities now blur the distinction. Use of dynamic information to guide abstractions in verification is one example; another is how predictive testing looks for bugs in possible executions close to a dynamic one, leading to a form of verification; finally, program repair increasingly uses solvers to synthesize new programs and test them, which overlaps with techniques from the other areas. We intended for the seminar to bring out further areas in which these fields are closely related, and inspire further techniques that fuse these areas, which was fulfilled by some of the discussions throughout the seminar.
Below are concrete examples of connections that we meant to expose, some of which were discussed throughout the seminar:
Each area has a variety of benchmarks and competitions, and many of them ultimately focus on concurrency-specific challenges like interleavings. It seems likely that the different communities could benefit from sharing. For instance, predictive testing and verification could surely share many benchmarks, and a more standard set of benchmarks could make evaluations easier. At the same time, potential users could help ensure that any benchmarks actually measure what they care about.
Much progress in both testing and verification has been made possible by progress in solver technology, and a variety of solvers are now common in both areas. There is room to share the infrastructure itself and the common remaining challenges.
The path-specific focus of testing and the global focus of verification can aid each other, e.g. current work such as CLAP using a control flow from a specific execution to make model checking more scalable.
Though the seminar touched on techniques and approaches that generalize beyond analysis and repair of concurrent software, we feel that the overall focus on challenges posed by concurrency was justified. With this focus, we were able to stir concrete discussion and tightly connected talks.
- Mike Dodds (University of York, GB) [dblp]
- Julian Dolby (IBM TJ Watson Research Center - Yorktown Heights, US) [dblp]
- Derek Dreyer (MPI-SWS - Saarbrücken, DE) [dblp]
- Philippa Gardner (Imperial College London, GB) [dblp]
- Orna Grumberg (Technion - Haifa, IL) [dblp]
- Arie Gurfinkel (Carnegie Mellon University - Pittsburgh, US) [dblp]
- Cliff B. Jones (University of Newcastle, GB) [dblp]
- K. Rustan M. Leino (Microsoft Corporation - Redmond, US) [dblp]
- Ben Liblit (University of Wisconsin - Madison, US) [dblp]
- Andreas Lochbihler (ETH Zürich, CH) [dblp]
- Anders Møller (Aarhus University, DK) [dblp]
- Peter Müller (ETH Zürich, CH) [dblp]
- Wytse Oortwijn (University of Twente, NL) [dblp]
- Corina Pasareanu (NASA - Moffett Field, US) [dblp]
- Wolfgang J. Paul (Universität des Saarlandes, DE) [dblp]
- Arnd Poetzsch-Heffter (TU Kaiserslautern, DE) [dblp]
- Murali Krishna Ramanathan (Indian Institute of Science - Bangalore, IN) [dblp]
- Malavika Samak (Indian Institute of Science - Bangalore, IN) [dblp]
- Ilya Sergey (University College London, GB) [dblp]
- Natasha Sharygina (University of Lugano, CH) [dblp]
- Sharon Shoham Buchbinder (Tel Aviv University, IL) [dblp]
- Alexander J. Summers (ETH Zürich, CH) [dblp]
- Michael Tautschnig (Queen Mary University of London, GB) [dblp]
- Omer Tripp (IBM TJ Watson Research Center - Yorktown Heights, US) [dblp]
- Caterina Urban (ETH Zürich, CH) [dblp]
- Yakir Vizel (Princeton University, US) [dblp]
- Thomas Wahl (Northeastern University - Boston, US) [dblp]
- programming languages / compiler
- semantics / formal methods
- verification / logic
- program analysis
- program verification
- model checking
- program repair
- fault localization