##### Dagstuhl Seminar 16481

### New Directions for Learning with Kernels and Gaussian Processes

##### ( Nov 27 – Dec 02, 2016 )

##### Permalink

##### Organizers

- Arthur Gretton (University College London, GB)
- Philipp Hennig (MPI für Intelligente Systeme - Tübingen, DE)
- Carl Edward Rasmussen (University of Cambridge, GB)
- Bernhard Schölkopf (MPI für Intelligente Systeme - Tübingen, DE)

##### Contact

- Simone Schilke (for administrative matters)

##### Impacts

- Convolutional Gaussian Processes : article : Advances in Neural Information Processing Systems 30 : NIPS 2017 - nips.cc, 2018. - 10 pp. - Wilk, Mark van der; Rasmussen, Carl Edward; Hensman, James - nips.cc, 2018. - 10 pp..
- Gaussian Processes and Kernel Methods : A Review on Connections and Equivalences - Kanagawa, Motonobu; Hennig, Philipp; Sejdinovic, Dino; Sriperumbudur, Bharath K. - Cornell University : arXiv.org, 2018. - 64 S..

##### Schedule

Positive definite kernels dominated machine learning research in the first decade of the millennium. They provide infinite-dimensional hypothesis classes that deliver expressive power in an elegant analytical framework. In their probabilistic interpretation as Gaussian process models, they are also a fundamental concept of Bayesian inference.

In the past years, hierarchical ('deep') parametric models have bounced back and delivered a series of impressive empirical successes. In areas like speech recognition and image classification, deep networks now far surpass the predictive performance previously achieved with nonparametric models. The central lessons from the 'deep resurgence' is that the kernel community has been too reliant on theoretical notions of universality. Instead, representations must be learned on a more general level than previously accepted. This process is often associated with an ‘engineering’ approach to machine learning, in contrast to the supposedly more 'scientific' air surrounding kernel methods. One central goal of this seminar is to discuss how the superior adaptability of deep models can be transferred to the kernel framework while retaining at least some analytical clarity.

In a separate but related development, kernels have had their own renaissance lately, in the young areas of probabilistic programming ('computing of probability measures') and probabilistic numerics ('probabilistic descriptions of computing'). In both areas, kernels and Gaussian processes have been used as a descriptive language. And, similar to the situation in general machine learning, only a handful of comparably simple kernels have so far been used. The central question here, too, is thus how kernels can be designed for challenging, in particular high-dimensional regression problems. In contrast to the wider situation in ML, though, kernel design here should take place at compile-time, and be a structured algebraic process mapping source code describing a graphical model into a kernel. This gives rise to new fundamental questions for the theoretical computer science part of machine learning. With the goal to spur research progress in these two related areas of research, some of the questions to be discussed at the seminar are:

- Are there 'deep' kernel methods? What are they? Are they necessary?
- How can nonparametric kernel models be scaled to Big Data? Is non-parametricity actually necessary, or does ‘big’ suffice in some clear sense?
- If a computational task is defined by source code, is it possible to parse this code and map it to a Gaussian process hypothesis class? What are the theoretical limits of this parsing process, and can it be practically useful? Is it possible for kernel models to solve the fundamental problems of high-dimensional integration (i.e. marginalization in graphical models)? If so, how?

We believe that this is a crucial point in time for these discussions to take place.

Machine learning is a young field that currently enjoys rapid, almost dizzying
advancement both on the theoretical and the practical side. On account of
either, the until quite recently obscure discipline is increasingly turning
into a central area of computer science. Dagstuhl seminar 16481 on *"New
Directions for Learning with Kernels and Gaussian Processes"* attempted to
allow a key community within machine learning to gather its bearings at this
crucial moment in time.

Positive definite kernels are a concept that dominated machine learning research in the first decade of the millennium. They provide infinite-dimensional hypothesis classes that deliver expressive power in an elegant analytical framework. In their probabilistic interpretation as Gaussian process models, they are also a fundamental concept of Bayesian inference:

A *positive definite kernel k*: X times X to Re on some input domain X is a function with the property that, for all finite sets {x_1,...,x_N} subset X, the matrix K in Re^{NxN}, with elements k_{ij}=k(x_i,x_j), is positive semidefinite. According to a theorem by Mercer, given certain regularity assumptions, such kernels can be expressed as a potentially *infinite* expansion

k(x,x') = sum_{i=1} ^infty lambda_i phi_i(x) phi_i ^* (x'), qquad with qquad sum_{i=1} ^{infty} lambda_i < infty,

where * is the conjugate transpose, lambda_i in Re_+ is a non-negative
*eigenvalue* and phi_i is an *eigenfunction* with respect to some measure u(x): a function satisfying

Random functions f(x) drawn by independently sampling Gaussian weights for each eigenfunction,

f(x) = sum_{j=1} ^infty f_j phi_j(x) qquad where qquad f_j sim N(0,lambda_i),
are draws from the centered *Gaussian process* (GP) p(f)=GP(f;0,k)
with *covariance function* k. The logarithm of this Gaussian process
measure is, up to constants and some technicalities, the square of the norm
|f |^2 _k associated with the *reproducing kernel Hilbert space*
(RKHS) of functions reproduced by k.

Supervised machine learning methods that *infer* an unknown function f
from a data set of input-output pairs (X,Y):={(x_i,y_i)}_{i=1,dots,N} can
be constructed by minimizing an empirical risk ell(f(X);Y) regularized by
|cdot|^2 _k. Or, algorithmically equivalent but with different
philosophical interpretation, by computing the *posterior* Gaussian
process measure arising from conditioning GP(f;0,k) on the observed data
points under a likelihood proportional to the exponential of the empirical
risk.

The prominence of kernel/GP models was founded on this conceptually and algorithmically compact yet statistically powerful description of inference and learning of nonlinear functions. In the past years, however, hierarchical ('deep') parametric models have bounced back and delivered a series of impressive empirical successes. In areas like speech recognition and image classification, deep networks now far surpass the predictive performance previously achieved with nonparametric models. One central goal of the seminar was to discuss how the superior adaptability of deep models can be transferred to the kernel framework while retaining at least some analytical clarity. Among the central lessons from the 'deep resurgence' identified by the seminar participants is that the kernel community has been too reliant on theoretical notions of universality. Instead, representations must be learned on a more general level than previously accepted. This process is often associated with an 'engineering' approach to machine learning, in contrast to the supposedly more 'scientific' air surrounding kernel methods. But its importance must not be dismissed. At the same time, participants also pointed out that deep learning is often misrepresented, in particular in popular expositions, as an almost magic kind of process; when in reality the concept is closely related to kernel methods, and can be understood to some degree through this connection: Deep models provide a hierarchical parametrization of the feature functions phi_i(x) in terms of a finite-dimensional family. The continued relevance of the established theory for kernel/GP models hinges on how much of the power of deep models can be understood from within the RKHS view, and how much new concepts are required to understand the expressivity of a deep learning machine.

There is also unconditionally good news: In a separate but related development, kernels have had their own renaissance lately, in the young areas of probabilistic programming ('computing of probability measures') and probabilistic numerics ('probabilistic descriptions of computing'). In both areas, kernels and Gaussian processes have been used as a descriptive language. And, similar to the situation in general machine learning, only a handful of comparably simple kernels have so far been used. The central question here, too, is thus how kernels can be designed for challenging, in particular high-dimensional regression problems. In contrast to the wider situation in ML, though, kernel design here should take place at compile-time, and be a structured algebraic process mapping source code describing a graphical model into a kernel. This gives rise to new fundamental questions for the theoretical computer science of machine learning.

A third thread running through the seminar concerned the internal conceptual
schism between the probabilistic (Gaussian process) view and the statistical
learning theoretical (RKHS) view on the model class. Although the algorithms
and algebraic ides used on both sides overlap *almost* to the point of
equivalence, their philosophical interpretations, and thus also the required
theoretical properties, differ strongly. Participants for the seminar were
deliberately invited from both "denominations" in roughly equal number.
Several informal discussions in the evenings, and in particular a lively
break-out discussion on Thursday helped clear up the mathematical connections
(while also airing key conceptual points of contention from either side).
Thursday's group is planning to write a publication based on the results of
the discussion; this would be a highly valuable concrete contribution arising
from the seminar, that may help drawing this community closer together.

Despite the challenges to some of the long-standing paradigms of this community, the seminar was infused with an air of excitement. The participants seemed to share the sensation that machine learning is still only just beginning to show its full potential. The mathematical concepts and insights that have emerged from the study of kernel/GP models may have to evolve and be adapted to recent developments, but their fundamental nature means they are quite likely to stay relevant for the understanding of current and future model classes. Far from going out of fashion, mathematical analysis of the statistical and numerical properties of machine learning model classes seems slated for a revival in coming years. And much of it will be leveraging the notions discussed at the seminar.

- Florence d'Alché-Buc (Telecom ParisTech, FR) [dblp]
- Marc Deisenroth (Imperial College London, GB) [dblp]
- David Duvenaud (Toronto, CA) [dblp]
- Roman Garnett (Washington University - St. Louis, US) [dblp]
- Arthur Gretton (University College London, GB) [dblp]
- Stefan Harmeling (Universität Düsseldorf, DE) [dblp]
- Philipp Hennig (MPI für Intelligente Systeme - Tübingen, DE) [dblp]
- James Hensman (Lancaster University, GB) [dblp]
- José Miguel Hernández-Lobato (Harvard University - Cambridge, US) [dblp]
- Frank Hutter (Universität Freiburg, DE) [dblp]
- Motonobu Kanagawa (Institute of Statistical Mathematics - Tokyo, JP) [dblp]
- Andreas Krause (ETH Zürich, CH) [dblp]
- Neil D. Lawrence (University of Sheffield, GB) [dblp]
- David Lopez-Paz (Facebook - AI Research, US) [dblp]
- Maren Mahsereci (MPI für Intelligente Systeme - Tübingen, DE) [dblp]
- Julien Mairal (INRIA - Grenoble, FR) [dblp]
- Krikamol Muandet (Mahidol University, TH) [dblp]
- Hannes Nickisch (Philips - Hamburg, DE) [dblp]
- Sebastian Nowozin (Microsoft Research UK - Cambridge, GB) [dblp]
- Cheng Soon Ong (Data61 - Canberra, AU) [dblp]
- Manfred Opper (TU Berlin, DE) [dblp]
- Peter Orbanz (Columbia University - New York, US) [dblp]
- Michael A. Osborne (University of Oxford, GB) [dblp]
- Carl Edward Rasmussen (University of Cambridge, GB) [dblp]
- Stephen Roberts (University of Oxford, GB) [dblp]
- Volker Roth (Universität Basel, CH) [dblp]
- Simo Särkkä (Aalto University, FI) [dblp]
- Bernt Schiele (MPI für Informatik - Saarbrücken, DE) [dblp]
- Michael Schober (MPI für Intelligente Systeme - Tübingen, DE) [dblp]
- Bernhard Schölkopf (MPI für Intelligente Systeme - Tübingen, DE) [dblp]
- Dino Sejdinovic (University of Oxford, GB) [dblp]
- Carl-Johann Simon-Gabriel (MPI für Intelligente Systeme - Tübingen, DE) [dblp]
- Bharath Sriperumbudur (Pennsylvania State University - University Park, US) [dblp]
- Ingo Steinwart (Universität Stuttgart, DE) [dblp]
- Zoltán Szabó (Ecole Polytechnique - Palaiseau, FR) [dblp]
- Ilya Tolstikhin (MPI für Intelligente Systeme - Tübingen, DE) [dblp]
- Raquel Urtasun (University of Toronto, CA) [dblp]
- Mark van der Wilk (University of Cambridge, GB) [dblp]
- Harry van Zanten (University of Amsterdam, NL) [dblp]
- Richard Wilkinson (University of Sheffield, GB) [dblp]

##### Classification

- artificial intelligence / robotics
- data structures / algorithms / complexity
- modelling / simulation

##### Keywords

- Machine Learning
- Kernel Methods
- Gaussian Processes
- Probabilistic Programming
- Probabilistic Numerics