- Susanne Bach-Bernhard (for administrative matters)
The goal of this Dagstuhl Seminar is to foster the development of a solid and useful theoretical foundation for unsupervised machine learning tasks. With the recent explosion of data availability, there is a growing tendency in Machine Learning to „let the data speak itself“. Most of those huge amounts of easily accessible data are raw, i.e. not labeled or „unsupervised“. Furthermore, often several prediction tasks are to be performed on some data source. Thus, unsupervised learning is frequently employed as a first step in data analysis to build useful feature representations, and to detect patterns and regularities independently of any specific prediction task. However, in contrast to the well-developed theory of supervised learning, currently systematic analysis of unsupervised learning tasks is scarce and our understanding of the subject is rather meager.
By bringing together experts concerned with different aspects of this topic we wish to outline our current understanding as well as identify important challenges and directions for further research. Topics for discussion may include:
- Systematic understanding of the tasks The objective of an unsupervised task is often not a priori defined and the outcome of different methods may vary drastically. There is a need for reliable guidelines for users on how to choose methods, what criteria to apply and how to evaluate the results of their choices.
- Theoretical tools and frameworks of analysis While convergence of single algorithms of optimization of specific objective functions have been analyzed, there are no definitions of high level goals of unsupervised learning tasks and no theory that is independent of a specific algorithm or objective function or generative data model.
- Computational and statistical complexity Machine learning tasks inherently involve both computational and statistical challenges. We are only beginning to understand the interplay between these. One challenge is to quantify statistical stability of algorithms and convergence rates of sample based unsupervised learning procedures. Another aspect to be addressed is the gap between true computational costs and the pessimistic predictions of worst-case computational complexity analysis of many common unsupervised learning algorithmic tasks.
- Building bridges to practitioners and users Incorporating prior domain knowledge is crucial to successful use of machine learning. Modeling and communicating such knowledge are some of the most fundamental challenges. This is particularly pressing in unsupervised learning since a user has currently no way of evaluating the outcome of some central unsupervised learning task (e.g., clustering).
- Interactive data acquisition Often, there is not enough information about the data and about a user's intentions for an algorithm to fully determine by itself how best to extract knowledge. For supervised learning, various forms of active label feedback have been analyzed. Meanwhile, there is currently little understanding of how a user can interactively provide feedback to an algorithm for a task that is not aimed at predicting labels.
The success of Machine Learning methods for prediction crucially depends on data preprocessing such as building a suitable feature representation. With the recent explosion of data availability, there is a growing tendency to "let the data speak itself". Thus, unsupervised learning is often employed as a a first step in data analysis to build a good feature representation, but also, more generally, to detect patterns and regularities independently of any specific prediction task. There is a wide rage of tasks frequently performed for these purposes such as representation learning, feature extraction, outlier detection, dimensionality reduction, manifold learning, clustering and latent variable models.
The outcome of such an unsupervised learning step has far reaching effects. The quality of a feature representation will affect the quality of a predictor learned based on this representation, a learned model of the data generating process may lead to conclusions about causal relations, a data mining method applied to a database of people may identify certain groups of individuals as "suspects" (for example of being prone to developing a specific disease or of being likely to commit certain crimes).
However, in contrast to the well-developed theory of supervised learning, currently systematic analysis of unsupervised learning tasks is scarce and our understanding of the subject is rather meager. It is therefore more than timely to put effort into developing solid foundations for unsupervised learning methods. It is important to understand and be able to analyze the validity of conclusions being drawn from them. The goal of this Dagstuhl Seminar was to foster the development of a solid and useful theoretical foundation for unsupervised machine learning tasks.
The seminar hosted academic researchers from the fields of theoretical computer science and statistics as well as some researchers from industry. Bringing together experts from a variety of backgrounds, highlighted the many facets of unsupervised learning. The seminar included a number of technical presentations and discussions about the state of the art of research on statistical and computational analysis of unsupervised learning tasks.
We have held lively discussions concerning the development of objective criteria for the evaluation of unsupervised learning tasks, such as clustering. These converged to a consensus that such universal criteria cannot exist and that there is need to incorporate specific domain expertise to develop different objectives for different intended uses of the clusterings. Consequently, there was a debate concerning ways in which theoretical research could build useful tools for practitioners to assist them in choosing suitable methods for their tasks. One promising direction for progress towards better alignment of algorithmic objectives with application needs is the development of paradigms for interactive algorithms for such unsupervised learning tasks, that is, learning algorithms that incorporate adaptive "queries" to a domain expert. The seminar included presentations and discussions of various frameworks for the development of such active algorithms as well as tools for analysis of their benefits.
We believe, the seminar was a significant step towards further collaborations between different research groups with related but different views on the topic. A very active interchange of ideas took place and participants expressed their satisfactions of having gained new insights into directions of research relevant to their own. As a group, we developed a higher level perspective of the important challenges that research of unsupervised learning is currently facing.
- Sanjeev Arora (Princeton University, US) [dblp]
- Pranjal Awasthi (Rutgers University - New Brunswick, US) [dblp]
- Shai Ben-David (University of Waterloo, CA) [dblp]
- Olivier Bousquet (Google Switzerland - Zurich, CH) [dblp]
- Kamalika Chaudhuri (University of California - San Diego, US) [dblp]
- Sanjoy Dasgupta (University of California - San Diego, US) [dblp]
- Debarghya Ghoshdastidar (Universität Tübingen, DE) [dblp]
- Barbara Hammer (Universität Bielefeld, DE) [dblp]
- Matthias Hein (Universität des Saarlandes, DE) [dblp]
- Christian Hennig (University College London, GB) [dblp]
- Adam Tauman Kalai (Microsoft New England R&D Center - Cambridge, US) [dblp]
- Ravindran Kannan (Microsoft Research India - Bangalore, IN) [dblp]
- Samory Kpotufe (Princeton University, US) [dblp]
- Marina Meila (University of Washington - Seattle, US) [dblp]
- Claire Monteleoni (George Washington University - Washington, D.C., US) [dblp]
- Lev Reyzin (University of Illinois - Chicago, US) [dblp]
- Heiko Röglin (Universität Bonn, DE) [dblp]
- Sivan Sabato (Ben Gurion University - Beer Sheva, IL) [dblp]
- Melanie Schmidt (Universität Bonn, DE) [dblp]
- Karin Schnass (Universität Innsbruck, AT) [dblp]
- Hans Ulrich Simon (Ruhr-Universität Bochum, DE) [dblp]
- Christian Sohler (TU Dortmund, DE) [dblp]
- Ingo Steinwart (Universität Stuttgart, DE) [dblp]
- Ilya Tolstikhin (MPI für Intelligente Systeme - Tübingen, DE) [dblp]
- Ruth Urner (MPI für Intelligente Systeme - Tübingen, DE) [dblp]
- Ulrike von Luxburg (Universität Tübingen, DE) [dblp]
- Robert C. Williamson (Australian National University, AU) [dblp]
- artificial intelligence / robotics
- data structures / algorithms / complexity
- Machine learning
- theory of computing
- unsupervised learning
- representation learning