15.11.15 - 18.11.15, Seminar 15472

Programming with "Big Code"

The following text appeared on our web pages prior to the seminar, and was included as part of the invitation.


This seminar aims to bring together researchers working in machine learning, natural language processing, programming languages and software engineering with the aim of developing new interdisciplinary techniques and tools for learning from massive code bases (aka “Big Code”).

The increased proliferation of freely available codebases in repositories such as GitHub has created a unique opportunity for learning and extracting valuable information from these codebases. Hidden within these codebases is implicit information about the correct usage of languages and libraries, common optimizations and bugs, and information about common programming styles. Statistical analysis of the source code has the potential to unlock this information, by finding common patterns in large amounts of code and integrating them with traditional development environments and program analysis tools. Similarly to statistical engines in other domains (e.g. Google Translate), the statistical information extracted from these repositories can in turn drive a new generation of probabilistic and approximate programming tools that can solve problems and programming challenges difficult or impossible to address with today’s rule-based techniques.

This difficult challenge requires an interdisciplinary approach spanning several areas of computer science: building statistical models inherently requires techniques from machine learning and natural language processing, extracting useful information from programs requires techniques from programming languages and in particular program analysis, while evaluating working end-to-end systems requires approaches from software engineering.

While there have been several encouraging works in this direction in the last few years, many open questions remain including:

  • What are the current statistical models which have proven successful and why?
  • What tasks in the design, development, understanding, and maintenance of programs can be improved by leveraging statistical inference?
  • Which programming challenges yield themselves to which statistical models?
  • How can the output of traditional program analysis techniques be fed into statistical analysis, e.g., as a feature in the classifier?
  • What representations of the program are most effective for which kinds of statistical analysis?
  • How do different families of programming languages — functional, declarative, statically/dynamically typed, etc. — affect which statistical models are most appropriate?
  • How do we meaningfully evaluate statistical reasoning systems?
  • How can statistical analysis of programs be used to automatically tutor students who are learning to program?
  • Can statistical analysis feed back into the design of the programming languages themselves?

Addressing these questions cannot be done in isolation and truly requires a joint effort from experts in different communities. Towards that, this seminar aims to bring together researchers working on programming languages and software engineering, machine learning and natural language processing.