14.12.14 - 19.12.14, Seminar 14511

Programming Languages for Big Data (PlanBig)

Diese Seminarbeschreibung wurde vor dem Seminar auf unseren Webseiten veröffentlicht und bei der Einladung zum Seminar verwendet.

Motivation

We have all witnessed a dramatic increase in the number of domain-specific languages or libraries for interfacing with other computing paradigms (data-parallelism, sensor networks, MapReduce-style fault-tolerant parallelism, distributed programming, Bayesian inference engines, SAT or SMT solvers, or multi-tier Web programming), as well as techniques for language-integrated querying or processing data over other data models (XML, RDF, JSON). Much of this activity is spurred by the opportunities offered by so-called "Big Data" — that is, large-scale, data-intensive computing on massive amounts of data. Such techniques have already benefited from concepts from programming languages. For example, MapReduce’s "map" and "reduce" operators are based on classical list-manipulation primitives introduced in LISP.

Programming systems that manipulate big data pose many challenges. As the amount of data being processed grows beyond the capabilities of any one computer system, the problem of effectively programming multiple computers, each possibly with multiple CPUs, GPUs, or software subsystems, becomes unavoidable; issues such as security, trust, and provenance become increasingly entangled with classical efficiency and correctness concerns. Some of these problems have been studied for decades: for example, integration of relational database capabilities and general-purpose programming languages has been a long-standing challenge, with some approaches now in mainstream use (such as Microsoft’s LINQ). Other problems may require advances in the foundations of programming languages.

Programming that crosses multiple execution models is increasingly required for modern applications, using paradigms both established (e.g., database, dataflow or data-parallel computing models) and emerging (e.g., multicore, GPU, or software-defined networking). Cross-model programs that execute in multiple (possibly heterogeneous) environments have much more challenging security, debugging, validation, and optimization problems than conventional programming languages. Both big data and massively parallel systems are currently based on systems-based methods and testing regimes that cannot offer guarantees of safety, security, correctness, and evolvability. In a purely system-based approach these problems are hard to even enunciate, let alone solve. Language-based techniques, particularly formalization, verification, abstraction, and representation independence, are badly needed to reconcile the performance benefits of advanced computational paradigms with the advantages of modern programming languages.

These problems are currently being addressed in a variety of different communities, often using methods that share a great deal of common features, for example the use of comprehensions to structure database queries, data-parallelism, or MapReduce/Hadoop jobs, the use of semantics to clarify the meaning of new languages and correctness of optimizations, the use of static analyses for effectively optimizing large-scale jobs, and the need for increased security and assurance including new techniques for provenance and trust. This Dagstuhl seminar on "Programming Languages for Big Data" seeks to identify and develop these common foundations in order to reap the full benefits of Big Data and associated data-intensive computing resources.

Four more specific topics are proposed to focus the seminar, although we anticipate that other topics may emerge due to future research developments or interactions at the seminar itself:

  • Static analysis and types for performance/power optimization for and reliability of big data programming
  • Language abstractions for cross-model programming
  • Language design principles for distribution, heterogeneity, and preservation
  • Trust, security, and provenance for high-confidence big data programming