https://www.dagstuhl.de/16481

### November 27 – December 2 , 2016, Dagstuhl Seminar 16481

# New Directions for Learning with Kernels and Gaussian Processes

## Organizers

Arthur Gretton (University College London, GB)

Philipp Hennig (MPI für Intelligente Systeme – Tübingen, DE)

Carl Edward Rasmussen (University of Cambridge, GB)

Bernhard Schölkopf (MPI für Intelligente Systeme – Tübingen, DE)

## For support, please contact

## Documents

Aims & Scope

List of Participants

Shared Documents

Dagstuhl's Impact: Documents available

Dagstuhl Seminar Schedule [pdf]

## Summary

Machine learning is a young field that currently enjoys rapid, almost dizzying
advancement both on the theoretical and the practical side. On account of
either, the until quite recently obscure discipline is increasingly turning
into a central area of computer science. Dagstuhl seminar 16481 on *"New
Directions for Learning with Kernels and Gaussian Processes"* attempted to
allow a key community within machine learning to gather its bearings at this
crucial moment in time.

Positive definite kernels are a concept that dominated machine learning research in the first decade of the millennium. They provide infinite-dimensional hypothesis classes that deliver expressive power in an elegant analytical framework. In their probabilistic interpretation as Gaussian process models, they are also a fundamental concept of Bayesian inference:

A *positive definite kernel k*: X times X to Re on some input domain X is a function with the property that, for all finite sets {x_1,...,x_N} subset X, the matrix K in Re^{NxN}, with elements k_{ij}=k(x_i,x_j), is positive semidefinite. According to a theorem by Mercer, given certain regularity assumptions, such kernels can be expressed as a potentially *infinite* expansion

k(x,x') = sum_{i=1} ^infty lambda_i phi_i(x) phi_i ^* (x'), qquad with qquad sum_{i=1} ^{infty} lambda_i < infty,

where * is the conjugate transpose, lambda_i in Re_+ is a non-negative
*eigenvalue* and phi_i is an *eigenfunction* with respect to some measure u(x): a function satisfying

Random functions f(x) drawn by independently sampling Gaussian weights for each eigenfunction,

f(x) = sum_{j=1} ^infty f_j phi_j(x) qquad where qquad f_j sim N(0,lambda_i),
are draws from the centered *Gaussian process* (GP) p(f)=GP(f;0,k)
with *covariance function* k. The logarithm of this Gaussian process
measure is, up to constants and some technicalities, the square of the norm
|f |^2 _k associated with the *reproducing kernel Hilbert space*
(RKHS) of functions reproduced by k.

Supervised machine learning methods that *infer* an unknown function f
from a data set of input-output pairs (X,Y):={(x_i,y_i)}_{i=1,dots,N} can
be constructed by minimizing an empirical risk ell(f(X);Y) regularized by
|cdot|^2 _k. Or, algorithmically equivalent but with different
philosophical interpretation, by computing the *posterior* Gaussian
process measure arising from conditioning GP(f;0,k) on the observed data
points under a likelihood proportional to the exponential of the empirical
risk.

The prominence of kernel/GP models was founded on this conceptually and algorithmically compact yet statistically powerful description of inference and learning of nonlinear functions. In the past years, however, hierarchical (`deep') parametric models have bounced back and delivered a series of impressive empirical successes. In areas like speech recognition and image classification, deep networks now far surpass the predictive performance previously achieved with nonparametric models. One central goal of the seminar was to discuss how the superior adaptability of deep models can be transferred to the kernel framework while retaining at least some analytical clarity. Among the central lessons from the `deep resurgence' identified by the seminar participants is that the kernel community has been too reliant on theoretical notions of universality. Instead, representations must be learned on a more general level than previously accepted. This process is often associated with an `engineering' approach to machine learning, in contrast to the supposedly more `scientific' air surrounding kernel methods. But its importance must not be dismissed. At the same time, participants also pointed out that deep learning is often misrepresented, in particular in popular expositions, as an almost magic kind of process; when in reality the concept is closely related to kernel methods, and can be understood to some degree through this connection: Deep models provide a hierarchical parametrization of the feature functions phi_i(x) in terms of a finite-dimensional family. The continued relevance of the established theory for kernel/GP models hinges on how much of the power of deep models can be understood from within the RKHS view, and how much new concepts are required to understand the expressivity of a deep learning machine.

There is also unconditionally good news: In a separate but related development, kernels have had their own renaissance lately, in the young areas of probabilistic programming ('computing of probability measures') and probabilistic numerics ('probabilistic descriptions of computing'). In both areas, kernels and Gaussian processes have been used as a descriptive language. And, similar to the situation in general machine learning, only a handful of comparably simple kernels have so far been used. The central question here, too, is thus how kernels can be designed for challenging, in particular high-dimensional regression problems. In contrast to the wider situation in ML, though, kernel design here should take place at compile-time, and be a structured algebraic process mapping source code describing a graphical model into a kernel. This gives rise to new fundamental questions for the theoretical computer science of machine learning.

A third thread running through the seminar concerned the internal conceptual
schism between the probabilistic (Gaussian process) view and the statistical
learning theoretical (RKHS) view on the model class. Although the algorithms
and algebraic ides used on both sides overlap *almost* to the point of
equivalence, their philosophical interpretations, and thus also the required
theoretical properties, differ strongly. Participants for the seminar were
deliberately invited from both "denominations" in roughly equal number.
Several informal discussions in the evenings, and in particular a lively
break-out discussion on Thursday helped clear up the mathematical connections
(while also airing key conceptual points of contention from either side).
Thursday's group is planning to write a publication based on the results of
the discussion; this would be a highly valuable concrete contribution arising
from the seminar, that may help drawing this community closer together.

Despite the challenges to some of the long-standing paradigms of this community, the seminar was infused with an air of excitement. The participants seemed to share the sensation that machine learning is still only just beginning to show its full potential. The mathematical concepts and insights that have emerged from the study of kernel/GP models may have to evolve and be adapted to recent developments, but their fundamental nature means they are quite likely to stay relevant for the understanding of current and future model classes. Far from going out of fashion, mathematical analysis of the statistical and numerical properties of machine learning model classes seems slated for a revival in coming years. And much of it will be leveraging the notions discussed at the seminar.

**Summary text license**

Creative Commons BY 3.0 Unported license

Arthur Gretton, Philipp Hennig, Carl Edward Rasmussen, and Bernhard Schölkopf

## Classification

- Artificial Intelligence / Robotics
- Data Structures / Algorithms / Complexity
- Modelling / Simulation

## Keywords

- Machine Learning
- Kernel Methods
- Gaussian Processes
- Probabilistic Programming
- Probabilistic Numerics