The goal of database theory is to formalize the theoretical underpinnings of databases and then analyze them with mathematical tools. One central question in the area is query evaluation, that is the computational problem of computing the answer to a query in a database. Of particular importance in this area is understanding the complexity of query evaluation, i.e., which resources, in particular, how much time and space, are necessary and sufficient to compute the answers to a given query. This Dagstuhl Seminar will focus on three key aspects of query evaluation: representation, provenance, and explanations. One common theme of these three directions is the use of circuits, which have a long history in complexity theory. In recent years, the theory of query processing has witnessed the use of different types of circuits, sometimes over Boolean inputs and operations, but also over more general structures, which often can be formalized as different semirings. We hope to see use of circuits in different representations of inputs for expressing and computing provenance, as well as for computing explanations using scores such as Shapley values. We hope to strengthen our understandings of these topics within database theory and logic. The topics to be discussed in the seminar are the following:
(1)Representation: One of the most important basics of algorithms is the use of good data-structures. There is a general tradeoff between expressivity, compactness, and efficient usability. Finding the right compromise in this tradeoff is often crucial for having efficient algorithms. These general considerations also apply to query evaluation: since in certain situations query results can be very large when materialized extensionally, and therefore, for many efficient query answering algorithms it is important to represent them in a compact fashion. For example, this is often the key underlying elements of enumeration algorithms, sampling, access to query results under access patterns, or algorithms providing so-called direct access to query results.
(2) Provenance: It is known that the query semantics for First Order Logic admits an interpretation in an arbitrary semiring. The most specialized case of Boolean semirings denote how an output tuple has been obtained from the inputs with joint usage (joins – translate to conjunctions "∧"), and alternative usages (projections or unions – translate to disjunctions "∨"). Such semirings can be used to understand compactly how outputs are generated from inputs, and have applications in query evaluation in uncertain data and in deletion propagation – to understand how the outputs change if one or more inputs are deleted, without re-computing the query. There are more advanced semirings like tropical semirings that can capture minimum cost in network flows. Compact representations of provenance semirings for recursive queries using circuits have also been studied, although the complexity of recursive queries under semiring semantics is open (despite its major importance in modern systems for machine learning (ML) where tensor algebra is extended with recursion). Compact and efficient knowledge compilation of provenance circuits into ordered and free binary decision diagrams, and more generally as decomposable deterministic negation normal forms (OBDD, FBDD, DNNF), are also important questions in database theory with applications in probabilistic databases.
(3) Explanations: As of late, explanations based on the widely known Shapley values from co-operative game theory have been used in database theory to measure the relevance of a certain database fact to a query answer, and to measure the relevance of inputs to the outcome of an ML classifier. Since the naive computation of Shapley values is intractable, as it includes a summation over exponentially many subsets, one of the main themes behind this investigation has been the identification of practically relevant classes of database queries for which such explanations can be computed in polynomial time. As it has been recently noted, all such positive results can be seen as particular instances of the fact that for a certain well-behaved class of circuits, namely deterministic and decomposable circuits, Shapley-values can be computed efficiently. The study of the complexity of computing Shapley values on different classes of circuits is an active area of research in AI, and we strongly believe that establishing bridges between this work and the one carried out in database theory can be strongly beneficial for both communities. There are other interesting notions of explanations for databases, ML, and aspects of responsible data science like fairness, which we will also explore in this Dagstuhl Seminar.
- Data Structures and Algorithms
- Logic in Computer Science
- Factorized Databases
- Shapley values
- Database Theory
- Computational Complexity