Dagstuhl Seminar 14511
Programming Languages for Big Data (PlanBig)
( Dec 14 – Dec 19, 2014 )
- James Cheney (University of Edinburgh, GB)
- Torsten Grust (Universität Tübingen, DE)
- Dimitrios Vytiniotis (Microsoft Research UK - Cambridge, GB)
- Annette Beyer (for administrative matters)
We have all witnessed a dramatic increase in the number of domain-specific languages or libraries for interfacing with other computing paradigms (data-parallelism, sensor networks, MapReduce-style fault-tolerant parallelism, distributed programming, Bayesian inference engines, SAT or SMT solvers, or multi-tier Web programming), as well as techniques for language-integrated querying or processing data over other data models (XML, RDF, JSON). Much of this activity is spurred by the opportunities offered by so-called "Big Data" — that is, large-scale, data-intensive computing on massive amounts of data. Such techniques have already benefited from concepts from programming languages. For example, MapReduce’s "map" and "reduce" operators are based on classical list-manipulation primitives introduced in LISP.
Programming systems that manipulate big data pose many challenges. As the amount of data being processed grows beyond the capabilities of any one computer system, the problem of effectively programming multiple computers, each possibly with multiple CPUs, GPUs, or software subsystems, becomes unavoidable; issues such as security, trust, and provenance become increasingly entangled with classical efficiency and correctness concerns. Some of these problems have been studied for decades: for example, integration of relational database capabilities and general-purpose programming languages has been a long-standing challenge, with some approaches now in mainstream use (such as Microsoft’s LINQ). Other problems may require advances in the foundations of programming languages.
Programming that crosses multiple execution models is increasingly required for modern applications, using paradigms both established (e.g., database, dataflow or data-parallel computing models) and emerging (e.g., multicore, GPU, or software-defined networking). Cross-model programs that execute in multiple (possibly heterogeneous) environments have much more challenging security, debugging, validation, and optimization problems than conventional programming languages. Both big data and massively parallel systems are currently based on systems-based methods and testing regimes that cannot offer guarantees of safety, security, correctness, and evolvability. In a purely system-based approach these problems are hard to even enunciate, let alone solve. Language-based techniques, particularly formalization, verification, abstraction, and representation independence, are badly needed to reconcile the performance benefits of advanced computational paradigms with the advantages of modern programming languages.
These problems are currently being addressed in a variety of different communities, often using methods that share a great deal of common features, for example the use of comprehensions to structure database queries, data-parallelism, or MapReduce/Hadoop jobs, the use of semantics to clarify the meaning of new languages and correctness of optimizations, the use of static analyses for effectively optimizing large-scale jobs, and the need for increased security and assurance including new techniques for provenance and trust. This Dagstuhl seminar on "Programming Languages for Big Data" seeks to identify and develop these common foundations in order to reap the full benefits of Big Data and associated data-intensive computing resources.
Four more specific topics are proposed to focus the seminar, although we anticipate that other topics may emerge due to future research developments or interactions at the seminar itself:
- Static analysis and types for performance/power optimization for and reliability of big data programming
- Language abstractions for cross-model programming
- Language design principles for distribution, heterogeneity, and preservation
- Trust, security, and provenance for high-confidence big data programming
Large-scale data-intensive computing, commonly referred to as "Big Data", has been influenced by and can further benefit from programming languages ideas. The MapReduce programming model is an example of ideas from functional programming that has directly influenced the way distributed big data applications are written. As the volume of data has grown to require distributed processing potentially on heterogeneous hardware, there is need for effective programming models, compilation techniques or static analyses, and specialized language runtimes. The motivation for this seminar has been to bring together researchers working on foundational and applied research in programming languages but also data-intensive computing and databases, in order to identify research problems and opportunities for improving data-intensive computing.
To this extent, on the database side, the seminar included participants who work on databases, query languages and relational calculi, query compilation, execution engines, distributed processing systems and networks, and foundations of databases. On the programming languages side, the seminar included participants who work on language design, integrated query languages and meta-programming, compilation, as well as semantics. There was a mix of applied and foundational talks, and the participants included people from universities as well as industrial labs and incubation projects.
The work that has been presented can be grouped in the following broad categories:
- Programming models and domain-specific programming abstractions (Cheney, Alexandrov, Vitek, Ulrich). How can data processing and query languages be integrated in general purpose languages, in type-safe ways and in ways that enable traditional optimizations and compilation techniques from database research? How can functional programming ideas such as monads and comprehensions improve the programmability of big data systems? What are some language design issues for data-intensive computations for statistics?
- Incremental data-intensive computation (Acar, Koch, Green). Programming language support and query compilation techniques for efficient incremental computation for data set or query updates. Efficient view maintainance.
- Interactive and live programming (Green, Vaz Salles, Stevenson, Binnig, Suciu). What are some challenges and techniques for interactive applications. How to improve the live programming experience of data scientists? Ways to offer data management and analytics as cloud services.
- Query compilation (Neumann, Henglein, Rompf, Ulrich). Compilation of data processing languages to finite state automata and efficient execution. Programming languages techniques, such as staging, for enabling implementors to concisely write novel compilation schemes.
- Data programming languages and semantics (Wisnesky, Vansummeren). Functorial semantics for data programming languages, but also foundations for languages for information extraction.
- Foundations of (parallel) query processing (Suciu, Neven, Hidders). Communication complexity results, program equivalence problems in relational calculi.
- Big data in/for science (Teubner, Stoyanovich, Ré). Challenges that arise in particle physics due to the volume of generated data. Howe we can use data to speed up new material discovery and engineering? How to use big data systems for scientific extraction and integration from many different data sources?
- Other topics: architecture and runtimes (Ahmad), coordination (Foster), language runtimes (Vytiniotis), weak consistency (Gotsman).
The seminar schedule involved three days of scheduled talks, followed by two days of free-form discussions, demos, and working groups. This report collects the abstracts of talks and demos, summaries of the group discussion sessions, and a list of outcomes resulting from the seminar.
- Umut A. Acar (Carnegie Mellon University, US) [dblp]
- Yanif Ahmad (Johns Hopkins University - Baltimore, US) [dblp]
- Alexander Alexandrov (TU Berlin, DE) [dblp]
- Carsten Binnig (DHBW - Mannheim, DE) [dblp]
- Giuseppe Castagna (University Paris-Diderot, FR) [dblp]
- James Cheney (University of Edinburgh, GB) [dblp]
- Laurent Daynès (Oracle Corporation, FR) [dblp]
- Nate Foster (Cornell University, US) [dblp]
- Pierre Geneves (INRIA - Grenoble, FR) [dblp]
- Alexey Gotsman (IMDEA Software - Madrid, ES) [dblp]
- Todd J. Green (LogicBlox - Atlanta, US) [dblp]
- Torsten Grust (Universität Tübingen, DE) [dblp]
- Fritz Henglein (University of Copenhagen, DK) [dblp]
- Jan Hidders (TU Delft, NL) [dblp]
- Christoph Koch (EPFL - Lausanne, CH) [dblp]
- Tim Kraska (Brown University - Providence, US) [dblp]
- Sam Lindley (University of Edinburgh, GB) [dblp]
- Todd Mytkowicz (Microsoft Corporation - Redmond, US) [dblp]
- Thomas Neumann (TU München, DE) [dblp]
- Frank Neven (Hasselt University - Diepenbeek, BE) [dblp]
- Ryan R. Newton (Indiana University - Bloomington, US) [dblp]
- Kim Nguyen (University Paris-Sud - Gif sur Yvette, FR) [dblp]
- Klaus Ostermann (Universität Tübingen, DE) [dblp]
- Christopher Ré (Stanford University, US) [dblp]
- Tiark Rompf (Purdue University - West Lafayette, US) [dblp]
- Andrew Stevenson (Queen's University - Kingston, CA) [dblp]
- Julia Stoyanovich (Drexel Univ. - Philadelphia, US) [dblp]
- Dan Suciu (University of Washington - Seattle, US) [dblp]
- Jens Teubner (TU Dortmund, DE) [dblp]
- Alexander Ulrich (Universität Tübingen, DE) [dblp]
- Jan Van den Bussche (Hasselt University - Diepenbeek, BE) [dblp]
- Stijn Vansummeren (University of Brussels, BE) [dblp]
- Marcos Vaz Salles (University of Copenhagen, DK) [dblp]
- Jan Vitek (Northeastern University - Boston, US) [dblp]
- Dimitrios Vytiniotis (Microsoft Research UK - Cambridge, GB) [dblp]
- Ryan Wisnesky (MIT - Cambridge, US) [dblp]
- data bases / information retrieval
- programming languages / compiler
- security / cryptology
- high-performance computing
- data-intensive research
- language-integrated query
- language-based security