18.09.16 - 23.09.16, Seminar 16382

Foundations of Unsupervised Learning

Diese Seminarbeschreibung wurde vor dem Seminar auf unseren Webseiten veröffentlicht und bei der Einladung zum Seminar verwendet.

Motivation

The goal of this Dagstuhl Seminar is to foster the development of a solid and useful theoretical foundation for unsupervised machine learning tasks. With the recent explosion of data availability, there is a growing tendency in Machine Learning to „let the data speak itself“. Most of those huge amounts of easily accessible data are raw, i.e. not labeled or „unsupervised“. Furthermore, often several prediction tasks are to be performed on some data source. Thus, unsupervised learning is frequently employed as a first step in data analysis to build useful feature representations, and to detect patterns and regularities independently of any specific prediction task. However, in contrast to the well-developed theory of supervised learning, currently systematic analysis of unsupervised learning tasks is scarce and our understanding of the subject is rather meager.

By bringing together experts concerned with different aspects of this topic we wish to outline our current understanding as well as identify important challenges and directions for further research. Topics for discussion may include:

  • Systematic understanding of the tasks The objective of an unsupervised task is often not a priori defined and the outcome of different methods may vary drastically. There is a need for reliable guidelines for users on how to choose methods, what criteria to apply and how to evaluate the results of their choices.
  • Theoretical tools and frameworks of analysis While convergence of single algorithms of optimization of specific objective functions have been analyzed, there are no definitions of high level goals of unsupervised learning tasks and no theory that is independent of a specific algorithm or objective function or generative data model.
  • Computational and statistical complexity Machine learning tasks inherently involve both computational and statistical challenges. We are only beginning to understand the interplay between these. One challenge is to quantify statistical stability of algorithms and convergence rates of sample based unsupervised learning procedures. Another aspect to be addressed is the gap between true computational costs and the pessimistic predictions of worst-case computational complexity analysis of many common unsupervised learning algorithmic tasks.
  • Building bridges to practitioners and users Incorporating prior domain knowledge is crucial to successful use of machine learning. Modeling and communicating such knowledge are some of the most fundamental challenges. This is particularly pressing in unsupervised learning since a user has currently no way of evaluating the outcome of some central unsupervised learning task (e.g., clustering).
  • Interactive data acquisition Often, there is not enough information about the data and about a user's intentions for an algorithm to fully determine by itself how best to extract knowledge. For supervised learning, various forms of active label feedback have been analyzed. Meanwhile, there is currently little understanding of how a user can interactively provide feedback to an algorithm for a task that is not aimed at predicting labels.