Modern algorithms for data analysis require the use of advanced probabilistic methods to achieve desirable scalability and accuracy guarantees. At the same time, modern data-analysis tasks require the use of advanced statistics to handle challenges such as testing for multiple hypotheses or identifying dependencies among data points, such as in time series or graphs. Probabilistic methods are also at the core theoretical computer-science areas, such as sublinear algorithms and average-case analysis. To obtain efficient data-analysis algorithms, probabilistic methods require careful balancing of theoretical and practical aspects. This Dagstuhl Seminar brings together researchers interested in statistical and probabilistic methods to design and analyze scalable algorithms for discovering knowledge in large and rich datasets. We plan to cover the following topics, among others.
Approximation Algorithms for Data Analysis: Exploring data quickly is important to guide further analysis. Very large datasets will not fit in the main memory or even the disk of a single machine, making exploratory analysis excessively slow and cumbersome. Approximation algorithms trade off the accuracy of the answer for the speed in which the approximate answer is obtained. Such solutions are acceptable when they offer stringent theoretical guarantees on their quality.
Data Stream and Sketches: In many applications new data arrive at all times: e.g., in online social networks new edges, representing interactions between users, arrive continuously. Data deletions can also be considered. Such dynamic settings are modeled nicely by data streams. The computational challenge is to use limited memory to maintain different statistics (e.g., subgraph counts, patterns, etc.) always up to date as the stream evolves with insertions and deletions. The best solutions for these problems use sketches, such as samples and/or coresets, i.e., a succinct representations of the data seen so far (or of the relevant data, if the computation is to be performed over a sliding window), to be used for computing a good approximation of the quantities of interest. Their development usually requires the analysis of statistical properties by studying the probability distribution of the content of the sketch.
Mining Temporal Data and Time Series: There is a growing interest in mining temporal data, i.e., data that is generated by a process that varies in time. This type of data is not limited to “traditional” time series, but also includes more complex types of data, such as temporal networks. In contrast to the extensive work on model-based time-series analysis, in data mining the focus is on non-parametric, model-agnostic, analysis. For instance, to estimate the current demand for products based on their past demand, it is often unrealistic to assume that the underlying process is time invariant. Instead, we search for patterns and trends in the data with minimal prior assumptions on the stochastic data-generation process, and taking into account the possibility of unknown distribution drift. The challenge, in this case, would be to bound the error in the computation of patterns or trends in terms of minimal conditions on the drift. In temporal graphs, the focus is on computing graph characteristics and/or vertex or edge scores (such as centralities) defined over multiple snapshots of a graph.
Statistically-Sound Knowledge Discovery from Data: In statistically-sound Knowledge Discovery from Data (KDD), results obtained from the data are considered as hypotheses, and they must undergo statistical testing, before being deemed significant, i.e., informative about the random, partially unknown process that generated the data. The challenges in developing methods for Statistically-sound KDD include (1) understanding how to subject the hypotheses to severe testing to make it hard for them to be deemed significant; (2) considering the simultaneous testing of multiple hypotheses as the default setting, not as an afterthought; (3) offering flexible statistical guarantees at different stages of the discovery process; and (4) achieving scalability along multiple axes, from the dataset size to the number and complexity of hypotheses.
- Data Structures and Algorithms
- Social and Information Networks
- Data Mining
- Knowledge Discovery from Data
- Probabilistic Analysis
- Randomized Algorithms
- Statistical Hypothesis Testing