Search the Dagstuhl Website
Looking for information on the websites of the individual seminars? - Then please:
Not found what you are looking for? - Some of our services have separate websites, each with its own search option. Please check the following list:
Schloss Dagstuhl - LZI - Logo
Schloss Dagstuhl Services
Within this website:
External resources:
  • DOOR (for registering your stay at Dagstuhl)
  • DOSA (for proposing future Dagstuhl Seminars or Dagstuhl Perspectives Workshops)
Within this website:
External resources:
Within this website:
External resources:
  • the dblp Computer Science Bibliography

Dagstuhl Seminar 15472

Programming with "Big Code"

( Nov 15 – Nov 18, 2015 )

(Click in the middle of the image to enlarge)

Please use the following short url to reference this page:




This seminar aims to bring together researchers working in machine learning, natural language processing, programming languages and software engineering with the aim of developing new interdisciplinary techniques and tools for learning from massive code bases (aka “Big Code”).

The increased proliferation of freely available codebases in repositories such as GitHub has created a unique opportunity for learning and extracting valuable information from these codebases. Hidden within these codebases is implicit information about the correct usage of languages and libraries, common optimizations and bugs, and information about common programming styles. Statistical analysis of the source code has the potential to unlock this information, by finding common patterns in large amounts of code and integrating them with traditional development environments and program analysis tools. Similarly to statistical engines in other domains (e.g. Google Translate), the statistical information extracted from these repositories can in turn drive a new generation of probabilistic and approximate programming tools that can solve problems and programming challenges difficult or impossible to address with today’s rule-based techniques.

This difficult challenge requires an interdisciplinary approach spanning several areas of computer science: building statistical models inherently requires techniques from machine learning and natural language processing, extracting useful information from programs requires techniques from programming languages and in particular program analysis, while evaluating working end-to-end systems requires approaches from software engineering.

While there have been several encouraging works in this direction in the last few years, many open questions remain including:

  • What are the current statistical models which have proven successful and why?
  • What tasks in the design, development, understanding, and maintenance of programs can be improved by leveraging statistical inference?
  • Which programming challenges yield themselves to which statistical models?
  • How can the output of traditional program analysis techniques be fed into statistical analysis, e.g., as a feature in the classifier?
  • What representations of the program are most effective for which kinds of statistical analysis?
  • How do different families of programming languages — functional, declarative, statically/dynamically typed, etc. — affect which statistical models are most appropriate?
  • How do we meaningfully evaluate statistical reasoning systems?
  • How can statistical analysis of programs be used to automatically tutor students who are learning to program?
  • Can statistical analysis feed back into the design of the programming languages themselves?

Addressing these questions cannot be done in isolation and truly requires a joint effort from experts in different communities. Towards that, this seminar aims to bring together researchers working on programming languages and software engineering, machine learning and natural language processing.


The main objective of the seminar was to bring together several research communities which have so far been working separately on the emerging topic of "Big Code" and to foster a new community around the topic. Over the last 4-5 years there have been several developments and interesting results involving "Big Code" all spanning a wide range of fields and conferences: the seminar brought these communities together and enabled them to interact for the first time.

The program was structured as a series of talks interspersed with discussion. Almost all of seminar participants gave a talk on their latest research. Even though the initial plan was to include special discussion sessions, each talk triggered so much discussion, both during the talk itself, and also after, that there was no need for specific discussion slots. We believe the seminar was successful in setting the right atmosphere for open ended discussion and obtained the desired affect of triggering much organic interaction.

Only the last day (morning) included a short wrap-up discussion session focusing on the future of the area, defining common data sets and future challenges the community can address. That discussion is summarized in the working group report.

The seminar was highly inter-disciplinary involving experts from programming languages, software engineering, machine learning and natural language processing. Further, it brought together research groups from Europe, Asia and U.S., all working on the topic of "Big Code", and raised awareness and familiarity with what different research groups are working on.

The talks and discussions spanned several topics including: the kinds of statistical methods used (e.g., n-gram models, recurrent neural networks, graphical models, probabilistic grammars, etc), new programming applications that can benefit from these models (e.g., code completion, code search, code similarity, translating natural language to code, etc), and the interaction between these. Some of the presentations were more of an introductory/overview nature while others focused on the more technical aspects of particular programming tools and machine learning models.

After two days of presentations and discussions, we used the last day of the seminar (before lunch) to summarize the discussions and to outline a future research direction. A suggestion enthusiastically embraced by everyone was to create a web site which lists the current data sets, challenges, tools and research groups working on the topic. The view was that this will not only enable existing groups to compare their tools on common problems and data sets but will also make it much easier for other research groups and graduate students to get into the area and to start contributing. It also serves as a useful instrument for raising awareness about the topic:

We have now created this web site and have made it available here:

In a short time, several groups have started contributing by uploading links to tools, data sets and challenges.

Overall, the seminar was successful both in terms of stimulating new and fruitful interaction between research communities that were working in the area but were separated so far, but also in setting a common agenda moving forward. Due to the high interest and feedback from this seminar, we anticipate that in a year or two from now, we will be ready to propose a larger seminar on the topic.

Copyright William W. Cohen, Charles Sutton, and Martin Vechev

  • Miltos Allamanis (University of Edinburgh, GB) [dblp]
  • Earl T. Barr (University College London, GB) [dblp]
  • Jason Breck (University of Wisconsin - Madison, US) [dblp]
  • Swarat Chaudhuri (Rice University - Houston, US) [dblp]
  • William W. Cohen (Carnegie Mellon University, US) [dblp]
  • Premkumar T. Devanbu (University of California - Davis, US) [dblp]
  • Shi Han (Microsoft Research - Beijing, CN) [dblp]
  • Kenneth Heafield (University of Edinburgh, GB) [dblp]
  • Abram Hindle (University of Alberta - Edmonton, CA) [dblp]
  • Suresh Jagannathan (Purdue University - West Lafayette, US) [dblp]
  • Christopher M. Jermaine (Rice University - Houston, US) [dblp]
  • Dongsun Kim (University of Luxembourg, LU) [dblp]
  • Dana Movshovitz-Attias (Google Inc. - Mountain View, US) [dblp]
  • Tien N. Nguyen (Iowa State University, US) [dblp]
  • Sebastian Proksch (TU Darmstadt, DE) [dblp]
  • Christopher Quirk (Microsoft Corporation - Redmond, US) [dblp]
  • Veselin Raychev (ETH Zürich, CH) [dblp]
  • Armando Solar-Lezama (MIT - Cambridge, US) [dblp]
  • Charles Sutton (University of Edinburgh, GB) [dblp]
  • Daniel Tarlow (Microsoft Research UK - Cambridge, GB) [dblp]
  • Martin Vechev (ETH Zürich, CH) [dblp]
  • Nicolas Voirol (EPFL - Lausanne, CH) [dblp]
  • Eran Yahav (Technion - Haifa, IL) [dblp]
  • Andreas Zeller (Universität des Saarlandes, DE) [dblp]
  • Xin Zhang (Georgia Institute of Technology - Atlanta, US) [dblp]

  • artificial intelligence / robotics
  • programming languages / compiler
  • software engineering

  • statistical programming tools
  • machine learning
  • natural language processing
  • programming languages
  • software engineering