Dagstuhl Seminar 24391: Statistical and Probabilistic Methods in Algorithmic Data Analysis

Dagstuhl Seminar 24391

Statistical and Probabilistic Methods in Algorithmic Data Analysis

( Sep 22 – Sep 27, 2024 )

(Click in the middle of the image to enlarge)

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/24391

Organizers

Aristides Gionis (KTH Royal Institute of Technology - Stockholm, SE)
Matteo Riondato (Amherst College, US)
Eli Upfal (Brown University - Providence, US)

Contact

Andreas Dolzmann (for scientific matters)
Jutka Gasiorowski (for administrative matters)

Summary

Show Summary

The development of efficient methods for the analysis of data, including graphs, time series, and transactional datasets, is a key challenge in computer science. The availability of larger, richer datasets gathered about all aspects of human life, society, and nature, and which take a variety of forms (e.g., graphs, time series, tabular data) has created the need for such methods, which would find applications in areas such as biology, finance, and computational social sciences. It has also created challenges that must be tackled in order to develop efficient methods, such as the following.

Trade-offs between data, computation, and information: As datasets with trillions of points become available, how much data do we really need to (approximately) answer thousands or more exploratory analysis questions on them? We need new algorithms that carefully balance the minimum amount of data needed to obtain the desired information at the required accuracy, with the computation needed to extract it.
Balancing theory and practice in data science: How can we satisfy the needs for theoretical and statistical guarantees on one side and practical efficiency on the other? How do we connect concepts from different disciplines to exploit their power in useful algorithms? Developments from all areas of computer science and statistics should be used to obtain data analysis methods that are efficient from both a computational and a statistical point of view.
Enabling efficient and statistically-sound scientific discoveries: The practice of science requires testing multiple hypotheses on the same, often large, data. For example, the Human Protein Reference Database graph has ≈ 19k proteins and ≈ 37k interactions between them. Scientists want to understand the significance of small connected subgraphs in this network, representing pathways in cancer cells. There are > 10¹³ subgraphs of size 8, each corresponding to an hypothesis. We need computationally-efficient methods that scale to trillions of hypotheses, while controlling for different measures of statistical guarantee on false discoveries, e.g., the Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR).
Enhancing responsible and unbiased data analysis: There are numerous techniques for analyzing and processing datasets. While these techniques have overall been effective in extracting patterns in datasets, the choice of an appropriate technique and parameters for analyzing a dataset is often more art than science. Different techniques, each implementing a variety of heuristics, may give different results when applied to the same data. Moreover, many of these techniques do not provide an interpretable mechanistic explanation for their performance, and most do not provide a rigorous measure of statistical significance or robustness of the analysis. A principled approach to data analysis, based on well founded mathematical and statistical concepts, enforces objectivity and unbiased results. Such an approach enhances the effectiveness of evidence-based medicine, policy, and social applications of data analysis.

During the Dagstuhl Seminar, the participants engaged in lively presentations of recent results on these topics, ranging from the theory of machine learning, to bias reduction, to pattern mining, to approximation and online algorithms for problems arising in very different settings and with many different applications. An Open Problem session gave the participants a chance to share interesting problems for the community. While the research presentations took the majority of the schedule, there were plenty of chances for the participants to network, socialize, and engage in new and ongoing collaborations, which is, in our opinion, the real advantage of attending a Dagstuhl Seminar in person. Especially positive where the interactions between participants at different stages of their career, from institutions in different countries, and from both academia and industry.

Creative Commons BY 4.0

Aristides Gionis, Matteo Riondato, and Eli Upfal

Motivation

Show Motivation

Modern algorithms for data analysis require the use of advanced probabilistic methods to achieve desirable scalability and accuracy guarantees. At the same time, modern data-analysis tasks require the use of advanced statistics to handle challenges such as testing for multiple hypotheses or identifying dependencies among data points, such as in time series or graphs. Probabilistic methods are also at the core theoretical computer-science areas, such as sublinear algorithms and average-case analysis. To obtain efficient data-analysis algorithms, probabilistic methods require careful balancing of theoretical and practical aspects. This Dagstuhl Seminar brings together researchers interested in statistical and probabilistic methods to design and analyze scalable algorithms for discovering knowledge in large and rich datasets. We plan to cover the following topics, among others.

Approximation Algorithms for Data Analysis: Exploring data quickly is important to guide further analysis. Very large datasets will not fit in the main memory or even the disk of a single machine, making exploratory analysis excessively slow and cumbersome. Approximation algorithms trade off the accuracy of the answer for the speed in which the approximate answer is obtained. Such solutions are acceptable when they offer stringent theoretical guarantees on their quality.

Data Stream and Sketches: In many applications new data arrive at all times: e.g., in online social networks new edges, representing interactions between users, arrive continuously. Data deletions can also be considered. Such dynamic settings are modeled nicely by data streams. The computational challenge is to use limited memory to maintain different statistics (e.g., subgraph counts, patterns, etc.) always up to date as the stream evolves with insertions and deletions. The best solutions for these problems use sketches, such as samples and/or coresets, i.e., a succinct representations of the data seen so far (or of the relevant data, if the computation is to be performed over a sliding window), to be used for computing a good approximation of the quantities of interest. Their development usually requires the analysis of statistical properties by studying the probability distribution of the content of the sketch.

Mining Temporal Data and Time Series: There is a growing interest in mining temporal data, i.e., data that is generated by a process that varies in time. This type of data is not limited to “traditional” time series, but also includes more complex types of data, such as temporal networks. In contrast to the extensive work on model-based time-series analysis, in data mining the focus is on non-parametric, model-agnostic, analysis. For instance, to estimate the current demand for products based on their past demand, it is often unrealistic to assume that the underlying process is time invariant. Instead, we search for patterns and trends in the data with minimal prior assumptions on the stochastic data-generation process, and taking into account the possibility of unknown distribution drift. The challenge, in this case, would be to bound the error in the computation of patterns or trends in terms of minimal conditions on the drift. In temporal graphs, the focus is on computing graph characteristics and/or vertex or edge scores (such as centralities) defined over multiple snapshots of a graph.

Statistically-Sound Knowledge Discovery from Data: In statistically-sound Knowledge Discovery from Data (KDD), results obtained from the data are considered as hypotheses, and they must undergo statistical testing, before being deemed significant, i.e., informative about the random, partially unknown process that generated the data. The challenges in developing methods for Statistically-sound KDD include (1) understanding how to subject the hypotheses to severe testing to make it hard for them to be deemed significant; (2) considering the simultaneous testing of multiple hypotheses as the default setting, not as an afterthought; (3) offering flexible statistical guarantees at different stages of the discovery process; and (4) achieving scalability along multiple axes, from the dataset size to the number and complexity of hypotheses.

Creative Commons BY 4.0

Aristides Gionis, Matteo Riondato, and Eli Upfal

Participants

Show Participants

Florian Adriaens (University of Helsinki, FI) [dblp]
Aris Anagnostopoulos (Sapienza University of Rome, IT) [dblp]
Luca Becchetti (Sapienza University of Rome, IT) [dblp]
Fabian Brandt-Tumescheit (HU Berlin, DE) [dblp]
Gianmarco De Francisci Morales (CENTAI - Torino, IT) [dblp]
Alessandro Epasto (Google - New York, US) [dblp]
Esther Galbrun (University of Kuopio, FI) [dblp]
Aristides Gionis (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
Shahrzad Haddadan (Rutgers Business School - Piscataway, US) [dblp]
Ravi Kumar (Google - Mountain View, US) [dblp]
Silvio Lattanzi (Google - Barcelona, ES) [dblp]
Stefano Leonardi (Sapienza University of Rome, IT) [dblp]
Sebastian Lüderssen (TU Wien, AT)
Alessio Mazzetto (Brown University - Providence, US) [dblp]
Pauli Miettinen (University of Kuopio, FI) [dblp]
Stefan Neumann (Technische Universität Wien, AT) [dblp]
Lutz Oettershagen (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
Rasmus Pagh (University of Copenhagen, DK) [dblp]
Andrea Pietracaprina (University of Padova, IT) [dblp]
Giulia Preti (CENTAI - Torino, IT) [dblp]
Geppino Pucci (University of Padova, IT) [dblp]
Matteo Riondato (Amherst College, US) [dblp]
Will Rosenbaum (University of Liverpool, GB) [dblp]
Larry Rudolph (Two Sigma Investments LP - New York, US) [dblp]
Ilie Sarpe (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
Mauro Sozio (Telecom Paris, FR) [dblp]
Mahito Sugiyama (National Institute of Informatics - Tokyo, JP) [dblp]
Nikolaj Tatti (University of Helsinki, FI) [dblp]
Eli Upfal (Brown University - Providence, US) [dblp]
Fabio Vandin (University of Padova, IT) [dblp]
Sergei Vassilvitskii (Google - New York, US) [dblp]
Yllka Velaj (Universität Wien, AT) [dblp]
Geoffrey Webb (Monash University - Clayton, AU) [dblp]
Anthony Wirth (The University of Sydney, AU) [dblp]

Classification

Data Structures and Algorithms
Databases
Social and Information Networks

Keywords

Data Mining
Knowledge Discovery from Data
Probabilistic Analysis
Randomized Algorithms
Statistical Hypothesis Testing

Seminar 24391

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Seminar 24391

Statistical and Probabilistic Methods in Algorithmic Data Analysis

( Sep 22 – Sep 27, 2024 )

Permalink

Organizers

Contact

Publications

Impacts

Schedule

Summary

Motivation

Participants

Classification

Keywords