21. – 25. April 2014, Dagstuhl-Seminar 14171

Evaluating Software Verification Systems: Benchmarks and Competitions


Dirk Beyer (Universität Passau, DE)
Marieke Huisman (University of Twente, NL)
Vladimir Klebanov (KIT – Karlsruher Institut für Technologie, DE)
Rosemary Monahan (NUI Maynooth, IE)

Auskunft zu diesem Dagstuhl-Seminar erteilt

Dagstuhl Service Team


Dagstuhl Report, Volume 4, Issue 4 Dagstuhl Report


The seminar aimed to advance comparative empirical evaluation of software verification systems by bringing together current and future competition organizers and participants, benchmark maintainers, as well as practitioners and researchers interested in the topic.

The objectives of the seminar were to (1) advance the technical state of comparative empirical evaluation of verification tools, (2) achieve cross-fertilization between verification communities on common/related issues such as selection of relevant problems, notions of correctness, questions of programming language semantics, etc., (3) explore common evaluation of different kinds of verification tools, its appropriate scope and techniques, (4) raise mutual awareness between verification communities concerning terminology, techniques and trends, and (5) promote comparative empirical evaluation in the larger formal methods community.

Altogether, 43 researchers and practitioners have attended the seminar. A vast majority of the attendees (almost 90%) have participated either in the SV-COMP or in the VerifyThis (and related, e.,g., VSComp/VSTTE) verification competitions. For lack of better terms, we tend to refer to these communities as the "automatic verification" and "deductive verification" community respectively, though, as Section 1 discusses in more detail, these labels are based on pragmatics and history rather than on technical aspects.

The presentations, hands-on sessions, and discussions provided valuable feedback that will help competition organizers improve future installments. To continue the effort, a task force will compile a map of the--in the meantime very diverse--competition landscape to identify and promote useful evaluation techniques. It was agreed that evaluation involving both automatic and deductive verification tools would be beneficial to both communities as it would demonstrate the strengths and weaknesses of each approach. Both SV-COMP and VerifyThis will be associated with the ETAPS conference in 2015.

A call to the public: It has been reported that competition-verification challenges have been used as homework for students. The seminar organisers would appreciate feedback and experience reports from such exercises.

Seminar Structure

The seminar was structured as a highly interactive event, rather than a sequence of talks, compared to workshops or conferences. The seminar opened with a session of lightning talks that gave every participant two minutes to introduce themselves and mark their activities and interests in verification and in comparative evaluation in particular.

In order to give participants insight into different verification techniques and practical capabilities of some existing verification tools, the seminar featured short overviews of the state of the art in deductive verification resp. automatic verification, as well as several tutorials, hands-on sessions, and accompanying discussions. These included a longer tutorial on deductive verification with the Dafny system together with a hands-on session, as well as short mini-tutorials on the automatic verifiers CPAchecker and CBMC, and deductive verifiers KIV and VeriFast. Another hands-on session concluded the seminar.

Discussions on evaluation techniques and setups were initiated with presentations by competition organizers and benchmark collectors. The presented evaluation vehicles included: VerifyThis competition (deductive verification), SV-COMP (automatic verification), VSTTE competition (deductive verification), SMT-COMP (Satisfiability Modulo Theories), Run-time Verification competition, RERS challenge (Rigorous Examination of Reactive Systems), INTS benchmarks (Integer Numerical Transition Systems).

Since evaluation must be grounded with the requirements of current and prospective users of verification technology, the seminar incorporated contributions from industrial participants. Among them were a talk on the use of the SPARK deductive verification tool-set as a central tool for the development of high-integrity systems at Altran UK, a talk on the use of automatic verification tools in the Linux Driver Verification project, an accompanying discussion, as well as statements by other industry representatives (from GrammaTech, Galois, LLNL, and Microsoft Research).

Verification Communities: Remarks on Commonalities, Differences, and Terminology

Important goals of the seminar were to raise mutual awareness and to foster cross-fertilization between verification communities. We may habitually refer to communities as "automatic verification" or "deductive verification", but as time passes, these labels become less adequate.

A trend is apparent that different types of tools are slowly converging, both technically and pragmatically. Instances of both automatic and deductive verifiers may use symbolic execution or SMT solvers. Automatic verifiers can synthesize (potentially quantified) invariants, verify infinite-state systems, or systems that are heap-centric.

The pace of development is high and the surveys are costly (the last comprehensive survey on automatic verification appeared in 2008). As a consequence, community outsiders typically have an outdated--sometimes by decades--view on verification technology that does not reflect the state of the art. We expect publications from competitions to fill the void between more comprehensive surveys.

One of the terminological pitfalls concerns the use of the attribute "automatic". For instance, model checking and data-flow analysis are typically advertised as "automatic". This is indeed true in the sense that the model-checking user does not have to supply proof hints such as loop invariants to the tool. On the other hand, using a model checker in practical situations may as well require user interaction for the purpose of creating test drivers, choosing parameters, tuning abstractions, or interpreting error paths (which can be quite complex). These aspects are typically abstracted away during evaluation of automatic verifiers, which allows better standardization but does not capture all aspects that are relevant in practice.

The situation is further confused by the fact that some deductive verifiers are also advertised as "automatic", even though all established deductive verification systems require user interaction and the amount of interaction that is needed with different tools is not radically different. The main meaningful differences are rather

  1. whether user interaction happens only at the beginning of a single proof attempt or whether the user can/has to intervene during proof construction, and
  2. whether user interaction happens in a purely textual manner or whether non-textual interaction is possible/required.

The seminar has confirmed the need for improved terminology, as well as made an attempt to eliminate misconceptions and communication pitfalls. Unfortunately, there is still no widely-accepted and usable terminology to communicate these distinctions.

Summary text license
  Creative Commons BY 3.0 Unported license
  Dirk Beyer, Marieke Huisman, Vladimir Klebanov, and Rosemary Monahan


  • Semantics / Formal Methods
  • Verification / Logic


  • Formal Verification
  • Model Checking
  • Deductive Verification
  • Competitions and Benchmarks


In der Reihe Dagstuhl Reports werden alle Dagstuhl-Seminare und Dagstuhl-Perspektiven-Workshops dokumentiert. Die Organisatoren stellen zusammen mit dem Collector des Seminars einen Bericht zusammen, der die Beiträge der Autoren zusammenfasst und um eine Zusammenfassung ergänzt.


Download Übersichtsflyer (PDF).

Dagstuhl's Impact

Bitte informieren Sie uns, wenn eine Veröffentlichung ausgehend von Ihrem Seminar entsteht. Derartige Veröffentlichungen werden von uns in der Rubrik Dagstuhl's Impact separat aufgelistet  und im Erdgeschoss der Bibliothek präsentiert.


Es besteht weiterhin die Möglichkeit, eine umfassende Kollektion begutachteter Arbeiten in der Reihe Dagstuhl Follow-Ups zu publizieren.