https://www.dagstuhl.de/16481

# New Directions for Learning with Kernels and Gaussian Processes

## Organizers

Arthur Gretton (University College London, GB)
Philipp Hennig (MPI für Intelligente Systeme – Tübingen, DE)
Carl Edward Rasmussen (University of Cambridge, GB)
Bernhard Schölkopf (MPI für Intelligente Systeme – Tübingen, DE)

## Summary

Machine learning is a young field that currently enjoys rapid, almost dizzying advancement both on the theoretical and the practical side. On account of either, the until quite recently obscure discipline is increasingly turning into a central area of computer science. Dagstuhl seminar 16481 on "New Directions for Learning with Kernels and Gaussian Processes" attempted to allow a key community within machine learning to gather its bearings at this crucial moment in time.

Positive definite kernels are a concept that dominated machine learning research in the first decade of the millennium. They provide infinite-dimensional hypothesis classes that deliver expressive power in an elegant analytical framework. In their probabilistic interpretation as Gaussian process models, they are also a fundamental concept of Bayesian inference:

A positive definite kernel k: X times X to Re on some input domain X is a function with the property that, for all finite sets {x_1,...,x_N} subset X, the matrix K in Re^{NxN}, with elements k_{ij}=k(x_i,x_j), is positive semidefinite. According to a theorem by Mercer, given certain regularity assumptions, such kernels can be expressed as a potentially infinite expansion
k(x,x') = sum_{i=1} ^infty lambda_i phi_i(x) phi_i ^* (x'), qquad with qquad sum_{i=1} ^{infty} lambda_i < infty,

where * is the conjugate transpose, lambda_i in Re_+ is a non-negative eigenvalue and phi_i is an eigenfunction with respect to some measure u(x): a function satisfying

int k(x,x') phi_i(x) du(x) = lambda_i phi_i(x').

Random functions f(x) drawn by independently sampling Gaussian weights for each eigenfunction,

f(x) = sum_{j=1} ^infty f_j phi_j(x) qquad where qquad f_j sim N(0,lambda_i),

are draws from the centered Gaussian process (GP) p(f)=GP(f;0,k) with covariance function k. The logarithm of this Gaussian process measure is, up to constants and some technicalities, the square of the norm |f |^2 _k associated with the reproducing kernel Hilbert space (RKHS) of functions reproduced by k.

Supervised machine learning methods that infer an unknown function f from a data set of input-output pairs (X,Y):={(x_i,y_i)}_{i=1,dots,N} can be constructed by minimizing an empirical risk ell(f(X);Y) regularized by |cdot|^2 _k. Or, algorithmically equivalent but with different philosophical interpretation, by computing the posterior Gaussian process measure arising from conditioning GP(f;0,k) on the observed data points under a likelihood proportional to the exponential of the empirical risk.

The prominence of kernel/GP models was founded on this conceptually and algorithmically compact yet statistically powerful description of inference and learning of nonlinear functions. In the past years, however, hierarchical (`deep') parametric models have bounced back and delivered a series of impressive empirical successes. In areas like speech recognition and image classification, deep networks now far surpass the predictive performance previously achieved with nonparametric models. One central goal of the seminar was to discuss how the superior adaptability of deep models can be transferred to the kernel framework while retaining at least some analytical clarity. Among the central lessons from the `deep resurgence' identified by the seminar participants is that the kernel community has been too reliant on theoretical notions of universality. Instead, representations must be learned on a more general level than previously accepted. This process is often associated with an `engineering' approach to machine learning, in contrast to the supposedly more `scientific' air surrounding kernel methods. But its importance must not be dismissed. At the same time, participants also pointed out that deep learning is often misrepresented, in particular in popular expositions, as an almost magic kind of process; when in reality the concept is closely related to kernel methods, and can be understood to some degree through this connection: Deep models provide a hierarchical parametrization of the feature functions phi_i(x) in terms of a finite-dimensional family. The continued relevance of the established theory for kernel/GP models hinges on how much of the power of deep models can be understood from within the RKHS view, and how much new concepts are required to understand the expressivity of a deep learning machine.

There is also unconditionally good news: In a separate but related development, kernels have had their own renaissance lately, in the young areas of probabilistic programming ('computing of probability measures') and probabilistic numerics ('probabilistic descriptions of computing'). In both areas, kernels and Gaussian processes have been used as a descriptive language. And, similar to the situation in general machine learning, only a handful of comparably simple kernels have so far been used. The central question here, too, is thus how kernels can be designed for challenging, in particular high-dimensional regression problems. In contrast to the wider situation in ML, though, kernel design here should take place at compile-time, and be a structured algebraic process mapping source code describing a graphical model into a kernel. This gives rise to new fundamental questions for the theoretical computer science of machine learning.

A third thread running through the seminar concerned the internal conceptual schism between the probabilistic (Gaussian process) view and the statistical learning theoretical (RKHS) view on the model class. Although the algorithms and algebraic ides used on both sides overlap almost to the point of equivalence, their philosophical interpretations, and thus also the required theoretical properties, differ strongly. Participants for the seminar were deliberately invited from both "denominations" in roughly equal number. Several informal discussions in the evenings, and in particular a lively break-out discussion on Thursday helped clear up the mathematical connections (while also airing key conceptual points of contention from either side). Thursday's group is planning to write a publication based on the results of the discussion; this would be a highly valuable concrete contribution arising from the seminar, that may help drawing this community closer together.

Despite the challenges to some of the long-standing paradigms of this community, the seminar was infused with an air of excitement. The participants seemed to share the sensation that machine learning is still only just beginning to show its full potential. The mathematical concepts and insights that have emerged from the study of kernel/GP models may have to evolve and be adapted to recent developments, but their fundamental nature means they are quite likely to stay relevant for the understanding of current and future model classes. Far from going out of fashion, mathematical analysis of the statistical and numerical properties of machine learning model classes seems slated for a revival in coming years. And much of it will be leveraging the notions discussed at the seminar. Creative Commons BY 3.0 Unported license Arthur Gretton, Philipp Hennig, Carl Edward Rasmussen, and Bernhard Schölkopf

## Classification

• Artificial Intelligence / Robotics
• Data Structures / Algorithms / Complexity
• Modelling / Simulation

## Keywords

• Machine Learning
• Kernel Methods
• Gaussian Processes
• Probabilistic Programming
• Probabilistic Numerics

## Documentation

In the series Dagstuhl Reports each Dagstuhl Seminar and Dagstuhl Perspectives Workshop is documented. The seminar organizers, in cooperation with the collector, prepare a report that includes contributions from the participants' talks together with a summary of the seminar.

## Publications

Furthermore, a comprehensive peer-reviewed collection of research papers can be published in the series Dagstuhl Follow-Ups.

## Dagstuhl's Impact

Please inform us when a publication was published as a result from your seminar. These publications are listed in the category Dagstuhl's Impact and are presented on a special shelf on the ground floor of the library.