DAGSTUHL Seminar 01301
23.07  27.07.2001


[Schedule]
Abstracts
This page is maintained by Mikio Braun
Please send comments, requests, etc. also to Mikio Braun
[top]
Information Geometry and Inference Principles
Shunichi Amari
RIKEN Brain Science Institute
Information geometry studies the intrinsic geometrical structure of
a family of probability distributions. The structure is uniquely
defined from the principle of invariance, giving a Riemannian metric
(due to the Fisher information matrix) and a dual pair of affine
connections. It is useful in many problems related to stochastic
phenomena such as statistical inference, model selection, information
theory, control systems theory, etc.
We apply the method of information geometry to multilayer
perceptrons, which have nonlinear inputoutput relations depending on
the modifiable parameters. They are modified by learning from
examples. When noises disturb the output, the behavior of a
multilayer perceptron is described by the conditional probability
distribution of the output conditioned on the input, and the
probability distribution is parameterized by the modifiable
parameters.
The parameter space, called the neuromanifold, is a family of
probability distributions in which the learning process is represented
by a trajectory. The stochastic gradient learning method is most
popular in online learning. However, when the parameter space has a
Riemannian structure, the gradient does not represent the true
steepest direction and should be replaced by the Riemannian or natural
gradient. The backprop method is notorious for slow convergence, due
to plateaus. Such plateaus are given rise to by the underlying
geometrical structure, and the natural gradient method is shown to
have a very good convergence property. It is, however, difficult to
calculate the Fisher information matrix explicitly and to invert it. We
give an adaptive method of obtaining the inverse of the Fisher information
matrix directly.
If the neuromanifold is not so strongly curved, the natural
gradient is not so different from the ordinary gradient. This
suggests that the neuromanifold is strongly curved. We show many
hierarchical structures such as multilayer perceptrons, Gaussian
mixtures, ARMA models in time series, etc., include singular points in
the parameter spaces, where the Fisher information matrix degenerates.
The singularities are given rise to by its inner symmetry, and occurs
at the points on which the system parameters become redundant. Model
selection is important when the true system lies in a neighborhood of
such singularities.
Therefore, we need to analyze the behaviors of statistical
inference and learning, when the true system lies in a neighborhood of
a singular point. The conventional CramerRao paradigm does not hold
in such a case, because the Fisher information is degenerate. The
central limit theorem cannot be applied, either. One should remark
that model selection is important in such a situation, but the
conventional theories of AIC and MDL are based on the CramerRao
paradigm which does not hold. Hence, we need to have a new
theoretical paradigm. The Baysian framework should be also
modified.
The present talk will discuss these aspects of geometry of neuro
manifolds in connection with learning, inference and model
selection.
[top]
Gradient Estimates in Reinforcement Learning
Peter Bartlett
Biowulf technologies
We consider the problem of controlling a partially observable Markov
decision process (POMDP), so as to maximize the time average of a
reward criterion. For parameterized stochastic policies, one approach
is to use the gradient of the performance criterion with respect to
the policy parameters. We present algorithms to estimate these
gradients from a single sample path, by relying on mixing properties
of the controlled POMDP. We give bounds on the estimation and
approximation errors fo these estimates for finite samples, in terms
of a certain mixing time of the controlled POMDP. The variance of
these Monte Carlos estimates can be reduced using additive control
variate methods. Two commonly used approaches, reward baselines and
actorcritic algorithms, are special cases. We present bounds on the
expected error for these algorithms, and derive the baselines and
critics that minimize these bounds. These results allow us to evaluate
how suboptimal commonly used algorithms are, and lead to new
algorithms for gradient estimates.
(joint work with Evan Greensmith and Jonathan Baxter)
[top]
Concentration inequalities and penalization methods in model selection
Stephane Boucheron
LRI  CNRS  Université ParisSud
Concentration inequalities constitute "natural" extensions of the
classical exponential bounds for sums fo independent random
variables. (AzumaHoeffdings, Bennett, Bernstein). Concentration may be
regarded as a "new look at independence". The basic message may be
formulated as follows: any function of many independent random
variables that is smooth in an appropriate sense is almost constant.
The definition of smoothness or alternatively of enlargement of
sets, starting from smoothness w.r.t. Hamming distance, to the recent
formulations by Talagrand, is not straightforward. Such extensions are
very useful when trying to characterize the fluctuations of
quantities such as empirical VCdimension, empricial VCentropies,
Rademacher complexities. Those last results can be obtained using the
relatively transparent "entropy method" proposed by Ledoux  One of
the killer applications of the concentration approach was the tails
of suprema of empirical processes indexed by bounded functions
(Talagrand 96, Ledoux 97, Massart 2000, Rio 2001) Concentration
inequalities deal with the very topic of the VapnikChervonenkis
inequalities. In contrast to the latter, concentration inequalities
only deal with fluctations around the mean, leaving the
characterization of this mean to chaining techniques (for empirical
processes).
As far as model selection is concerned, concentration inequalities
prove useful in the design and anlysis of penalization strategies as
advocated by Birg&eactue; and Massart, Vapnik, etc. In the Structural
Risk Minimization framework, we are interested in selecting among a
set of models F_{1},...,F_{k},..., i.e. among
a set of estimates ^f_{1},...^f_{k}...
obtained from some randomly collected data set D_{n} =
((x_{1},y_{1}),...(x_{n}, y_{n}))
such that the effective loss E[ l(^f(X), Y) ] is
minimal. (l might be absolute, quadratic loss)
The basic idea is that one should minimize a penalized empirical risk
n
___
^ \ 1 ^
Ln(f ) = /__  l( f (x ), y )
k i = 1 n k i i
plus some penalty term pen(n,k). pen(n,k) should
minimize the amount of overfitting in F_{k}, i.e.
L(^f_{k})  L_{n}(^f_{k}).
Concentration inequalities prove useful in designing and analyzing
datadependent penalties. The latter constitute an important
ingredient in any wouldbe practical system with guaranteed
performance.
It remains the question to determine when such datadependent
penalization techniques can achieve adaptivity in the DonohoJohnstone
sense.
[top]
Olivier Bousquet
[top]
Learning and Combinatorial Optimization: The noisy Traveling Salesman
Problem
Joachim Buhmann
CVPR Group, University of Bonn
[download slides (pdf)]
Many problems in the real world which are modeled as combinatorial
optimization problems are stochastic in nature, i.e. the parameters
defining the problem are random variables. This fact is traditionally
neglected when a combinatorial optimization problem is formulated. I
demonstrate with an example of the traveling salesman, that the
minimal solution computed on a single (training) instance of a random
problem can perform suboptimally on a second (test) instance. Computer
experiments provide empirical evidence that certain Markov Chain Monte
Carlo algorithms yield solutions which are more robust than the
optimal training solution. Learning is performed by sampling a typical
permutation matrix or by suitably averaging over TSP solutions.
The overfitting behavior of the ERM solution can be understood in
terms of statistical learning theory. The MCMC algorithm computes an
approximation to the empirical risk and the approximation accuracy
should be controlled by robustness against overfitting. Too precise
approximations of the training risk overfit a test instance, whereas
too crude approximations introduce an underfitting bias. A
generalization of the VC inequality quantifies this bias variance
tradeoff. Large deviations between training and test performance are
bounded by Bernstein's inequality. The minimum of this bound
determines the stop temperature for an annealing scheme. (joint work
with Mikio Braun)
[top]
Hilbertian Learning
Stephane Canu
INSA de Rouen, France
Kernels and in particular Mercer or reproducing kernels play crucial
role in the statistical learning theory and functional estimation.
But very few is known about the underlying functional space where
algorithms are looking for the solution. How to choose it? How to
build it? What is its relationship with regularization?
Introducing HilbertSchmidt operators helps to answer some of
these questions. This allow to introduce learnable frames as a
powerful and promising functional tool to build relevant kernels.
Furthermore the learnable frames framework clarify the
relationship between kernels and parametric components. This is a
theory for semiparametric learning. In particular wavelets are
included in this framework together with their associated kernels.
[top]
Vicinal Risk Minimization
Olivier Chapelle
The Vicinal Risk Minimization principle establishes a bridge between
generative models and methods derived from the Structural Risk
Minimization Principle such as Support Vector Machines or Statistical
Regularization. We explain how VRM provides a framework which
integrates a number of existing algorithms, such as Parzen windows,
Support Vector Machines, Ridge Regression, Constrained Logistic
Classifiers and TangentProp.
We will show how the approach implies new algorithms for solving
problems usually associated with generative models. New algorithms are
described for dealing with pattern recognition problems with very
different pattern distributions and dealing with unlabeled data.
[top]
Bayesian and Prequential Inference for Model Selection
A. P. Dawid
University College London
The Bayesian approach to inference allows us to express, in simple
probabilistic form, two kinds of uncertainty: both about an unknown
parameter of an assumed model, and about the model itself. It also
very naturally allows us to make predictions for as yet unobserved
quantities, a feature that turns out to be particularly valuable for
model selection, as well as important in its own right.
In my talk I shall describe how the Bayesian approach naturally
behaves in an asymptotically desirable way in problems of model
selection, without the need for any extraneous or ad hoc ingredients
such as regularisation, nor oversimplifying assumptions such as
independent observations. Some of the problems arising with finite
datasets will also be considered.
I shall then generalise the Bayesian predictive approach by
introducing the methodology of Prequential Analysis, which assesses
and compares model directly in terms of their 1step ahead predictive
performance. It is thus naturally suited to the task of model
criticism and model selection. An important result is the
availability of "optimal" forecasts, based on an extension of Emprical
Risk Minimisation, even when the model is incorrectly specified.
[top]
Model Selection and Infinite Models
Zoubin Ghahramani
University College London
[download slides (ps.gz)]
I will discuss two apparently conflicting views of Bayesian
Learning. The first invokes automatic Occam's Razor (which results
from averaging over the parameters) to do model selection, usually
preferring models of low complexity. The second advocates not limiting
the number of parameters in the model and doing inference in the limit
of a large number of parameters if computationally possible. The first
view lends itself to methods of approximating the evidence such as
variational approximations. I will briefly describe these and give
examples. For the second view, I will show that for a variety of
models it is possible to do efficient inference even with an infinite
number of parameters. I will discuss pros and cons of both views and
how they can be reconciled.
Joint work with Carl E. Rasmussen and Matthew J. Beal.
[top]
Bagging equalizes influence
Yves Grandvalet
Bagging constructs an estimator by averaging predictors trained on
bootstrap samples. Bagged estimates almost consistently improve on
the original predictor. It is thus important to understand the
reasons for this success, and also for the occasional failures. It is
widely believed that bagging is effective thanks to the variance
reduction stemming from averaging predictors. However, seven years
from its introduction, bagging is still not fully understood.
We provide experimental evidence supporting that bagging stabilizes
prediction by equalizing the influence of training examples.
Bagging's improvements/deteriorations can be explained by the
goodness/badness of highly influential examples, whereas other
arguments reach their limits. Finally, the reasons for the
equalization effect support that other resampling strategies such as
halfsampling should provide qualitatively identical effects while
being computationally more efficient than bootstrap sampling.
[top]
Kernel Methods
Isabelle Guyon
Kernel methods address a wide variety of induction problems, including
function approximation (interpolation and regression), classification,
density estimation, clustering and solving linear operator
equations. The history of kernel machines started in the 19th century
when Hilbert and Schmidt introduced integral equations of the form
/
 K(s, t) f(t) dt = F(s).
/
This triggered a lot of research on the conditions that the kernel
function K(s,t) must satisfy. In 1909 Mercer stated the
equivalence of positive definite kernels and valid "dot products" that
opened the doors to a lot of theoretical derivations. Kernels then
appeared in density estimation (Parzen Windows), classifiction and
regression (splines). Parzen windows type kernels are shift invariant
and include Gaussian kernels and potential functions. They were
introduced in the 1960's (Parzen 1962, Aizerman et al. 1964). Kernels
are also similarity measures and include various dot products. One of
the most widely used one is the polynomial kernel
(x.y)^{q}, q being the polynomial
degree. Kernels have been used in signal processing (convolutions)
and image processing. Using kernels in the preprocessing stes leads
to a a nice unified framework of creating new kernels by combining
kernels. The duality between approximation functions linear in their
parameters f(x) = w.phi(x) and kernel approximationn
functions
___
\
f(x) = /__ alpha K(x, x )
k k k
is known since the 1960's. It has known a regain of interest since
1992 when it was first used in the context of support vector machines
(SVMs). Since then a lot of algorithms that exploit this duality have
been derived. Many extensions of the original simple kernel machines
have been made allowing users to treat nonvectorial inputs (strings,
tree, sets) and fancy outputs (multiclass, mulitlabel, sequences) and
to address a variety of complex optimization problems. As of today
kernel machines span a wide range of applications with a wide spectrum
of sizes of input space and training data sets. The most popular
kernels are the Guassian kernel and the polynomial kernel (with its
special case the linear kernel) but specialized kernels are an active
area of reserach.
[top]
Algorithmic Luckiness
Ralf Herbrich
Microsoft Research, Cambridge, United Kingdom
[download slides (ps.gz)]
In contrast to standard statistical learning theory which studies
uniform bounds on the expected error we present a framework that
exploits the specific learning algorithm used. Motivated by the
luckiness framework [Taylor et al., 1998] we are also able to exploit
the serendipity of the training sample. The main difference to
previous approaches lies in the complexity measure; rather than
covering all hypotheses in a given hypothesis space it is only
necessary to cover the functions which could have been learned using
the fixed learning algorithm. We show how the resulting framework
relates to the VC, luckiness and compression frameworks. Finally, we
present an application of this framework to the maximum margin
algorithm for linear classifiers which results in a bound that
exploits both the margin and the distribution of the data in feature
space.
This work is joint work with Bob Williamson.
[top]
Matthias Hild
[top]
Gaps and Bridges between Inductive Inference and Statistical Learning Theory
Wolfram Menzel
Institute for Logic, Complexity and Deduction Systems
Computer Science Department, University of Karlsruhe
[download slides (ps.gz)]
These two worlds look totally different, hardly
comparable to each other. Still, both come from a common intuition to
model phenomena of learning. Analysis of their differences and
possible relationship leads (among others) to three main points:
 Probability measures on the power set of N are
``finitary'', thus contrary to all ``Lebesquestyle'' ones.
 Uniformity as commonly required in learnability definitions of
statistical learning theory is a sensitive and crucial point.
 When a hypothesis space is parameterized, the relationship between
distance measuring among parameters on the one hand, and among the
meant functions on the other hand seems to be important, at least in
application oriented approaches.
Results can be presented on two kinds of questions:
 A tried combination of learning as ``approximating in a uniform
and arbitrarily good way'' and as ``finding an ultimate, stable
regularity''.
 In the Inductive Inference scenario: Can hopes for some kind of
``continuity'' or at least ``compatibility'' between distance
measuring among programs (parameters) and among computable functions
ever be satisfied?
[top]
Assessing Reliability of Unsupervised Learning: a Resampling
Approach
KlausRobert Müller^{1,2}
Frank Meinecke^{1,2}, Andreas Ziehe^{1}
and Motoaki Kawanabe^{2}
Frauenhofer Gesellschaft FIRST, Berlin
Contact:
1. Frauenhofer Inst. FIRST, Kekuléstr. 7, Berlin
2. Univ. of Potsdam, Am neuen Palais 10, Potsdam
When applying unsupervised learning techniques like ICA or temporal
decorrelation, a key question is whether the discovered projections
are reliable. In other woirds, can we give error bars or can we
assess the quality of our seperation? We use resampling
methods to tackle these questions and show experimentally that our
proposed variance estimations are strongly correlated to the
separation error.
We demonstrate that this reliability estimation can be used to select
the appropriate ICAmodel to enhance significantly the separation
performance, and, most important, to mark the components that can
really have a physical mearning. Application to data from an MEG
experiment unerlines the usefulness of our approach.
[top]
Bias of Estimators and Regularization Terms
Noboru Murata
Department of Electrical, Electronics, and Computer Engineering,
Waseda University,
We deal with the role of regularization terms (penalty terms) from
the view point of bias of the minimum training error estimation.
In the field of neural networks, for instance, regularization
terms are often utilized to avoid overfitting, however
most of the time crossvalidation is chosen to determine the
strength of the regularization.
First we will clarify the bias of minimum training error
estimation, which is caused by the nonlinearity of the learning
system and depends on the size of training samples. Then taking
this bias into account, we consider an appropriate size of the
regularization term which is minimizing the predictive errors.
The optimal size of the reguralization term in this sense is
calculated from the second and third order information of the loss
function. When the learning system has a large number of
modifiable parameters, it is computationally expensive to
calculate the higher order information, thus we propose a simple
method of approximating the optimal size via a generalized AIC.
[top]
Models in Hyperbolic Space
Helge Ritter
Neuroinformatics Group
University of Bielefeld
A crucial question for model selection is the proper structural bias
for the task at hand. Hyperbolic spaces offer in certain cases an
attractive alternative to the usually employed Euclidean
R^{n}, since they offer exponentially growing
neighborhoods already in d = 2. After a brief synopsis of
recent work we present as an example the use of the discretized
hyperbolic plane for the creation of dimensionreduced mappings of
textdocument data, exhibiting semantic relationships as neighborhood
in hyperbolic 2dspace.
[top]
Optimal Inductive Inference Under an Algorithmic Prior Reflecting
Maximally Efficient Data Generation
Jürgen Schmidhuber
IDSIA, Lugano
Solomonoff's optimal but noncomputable strategy for inductive
inference assumes the observations are drawn from a recursive
prior. Here we make the additional assumption that the process
computing the data is optimally efficient, and that the cumulative
prior probability of all data whose computation costs at least O(n)
time is inversely proportional to n. Since in fact there exists a
very simple, general, asymptotically optimal algorithm for all
computable data, we can explicitly extract the corresponding
speed prior, and derive a computable strategy for optimal
inductive reasoning.
[top]
Stability of Posterior Estimates for Kernels
Alex Smola
Australian National University, MLG
Maximum a posteriori approximation is a popular technique for Gaussian
Processes, since the value of the negative logposterior can be used
as an indicator of how plausible a certain hypothesis happens to
be. The minimum of the regularized risk functionals in Support Vector
Machines can be used for a similar purpose.
We prove that the minimum of the negative logposterior and the
regularized risk functional are concentrated random variables,
provided the likelihood (or loss function) is a logconcave function.
[top]
Statistical Inference and Relevant Information Encoding
Naftali Tishby
The "Information bottleneck method" is an unsupervised nonparametric
data organization technique that aims at extracting the relevant
information in one random variable with respect to another one. Given
a joint distribution, p(x,y), this method constructs a new
variable ^X (or T) that infers partitions (soft)
over the values of X that are informative about
Y.
Many problems can be cast into this general framework, such as: time
series prediction, supervised and unsupervised learning, noise
filtering, feature extraction, etc. It can be formulated as a tradeoff
between two mutual information measures:
L[p(^xx)] = I(X; ^X)  beta I(^X; Y)
which as a closed set of selfconsistent equations has to be satisfied at
the stationary points of this langrangian for every value of the
positive Lagrange multiplier beta. We have proved the general
convergence of an iterative algorithm  similar to the Blahut Arimoto
algorithm in RateDistortion theory  that finds the optimal tradeoff
and partition p(^xx). The algorithm has an agglomerative
greedy version which has been applied successfully to problems such as
document classification, gene expression analysis, spectral analysis
and neuronal coding. We have recently extended the method to
multivariate cases, by using the Lagrangian
G1 G2
L = I (X1,...,Xn, T)  beta I (X1,....,Xn, T)
where I^{G1,2} are multiinformations of a
Bayesian net.
[top]
Optimal aggregation of classifiers in statistical learning
Alexandre Tsybakov
Université Paris 6
The problem of statistical learning can be considered as a problem of
nonparametric estimation of sets where the risk is defined by means of
a specific distance function between sets associated to the
misclassification error. The rates of convergence of classifiers
depend on two parameters: the complexity of the class of candidate
sets and the "margin" parameter. The dependence is explicitly given,
in particular the optimal rates up to O(n^{1}) can
be attained where n is the sample size, and the proposed
classifiers have the property of robustness to the margin. The main
result of the paper concerns optimal aggregation of classifiers: we
suggest a classifier that automatically adapts both to the complexity
and to the margin, and attains the optimal fast rates, up to a
logarithmic factor.
[top]
Development of Statistical Learning Theory
Vladimir Vapnik
AT&T Labs / Holloway College London
The development of Statistical Learning Theory is considered from the
point of view of foundations of statistics. Two different foundatiaons
of classical statistics are considered: the
GlivenkoCantelliKolmogorov (theoretical) approach, and Fisher's
(simplified) applied approach. In the early 60s, it was realized that
Fisher's approach is not sufficiently powerful for solving
highdimensional problems. Therefore, a theory continuing the ideas
of the theoretical approach (Kolmogorov appraoch) was
developed. Initially, it was rejected by the statistics community,
which is why it found its home in computer science. Now, it is a
welldeveloped branch of science studied by both statistics and
computer sciences.
[top]
Predictive complexity: theory, possible applications, and open problems
Volodya Vovk
Royal Holloway, University of London
This talk will give a highlevel review of some applications of
Kolmogorov's notion of complexity and its variants to the problems of
inference and model selection. We will argue that approaches to
inference can be broadly classified as belonging to either "Bayesian"
or "Popperian" paradigm. Kolmogorov complexity and its
generalization, predictive complexity, are technical tools useful in
both paradigms. In particular, we will discuss the use of Kolmogorov
and predictive complexity in the MDL principle and its generalization,
"Complexity Approximation Principle"; the latter will be contrasted
with the Bayesiantype approach of the theory of prediction with
expert advice. The notion of predictive complexity makes it possible
to generalize the usual notions of randomness and information to a
wide class of loss functions; this generalization allows us to
formalize interesting questions about the limits of inference.
Several open problems about predictive complexity will also be
stated.
[top]
Online learning  Methods and Open Problem
Manfred Warmuth
University of California at Santa Cruz
[download slides (ps.gz)]
An "online" learning algorithm sees the examples one at a time
and incurs a loss on each new example based on its current
hypothesis. This hypothesis is updated online as more examples
are seen. We are given a comparison class of predictors. The loss
of the online algorithm on a sequence of examples is typically
larger than the loss of the best offline predictor in the
comparison class. The goal of the learner is to bound the
additional loss of the online algorithm over the best offline
predictor on an arbitrary sequence of examples. Such bounds are
called "relative loss bounds" and quantify the price of hiding
the future examples from the learner.
We discuss method for deriving online algorithms and for proving
relative loss bounds. No background is required. We will stay at
a high level and discuss directions for future research.
The key tool we use is Bregman divergences. They are used as
loss functions and as measures of "distance" between two members
of the comparison class.
We discuss families of algorithms that are characterized by
different Bregman divergences. The two main families are the
gradient descent and exponentiated gradient family. The former
family includes all the kernel based algorithms and the latter
family is motived by the minimum relative entropy principle (i.e.
information theoretic motivation). We contrast the merits of both
families of algorithm.
[top]
Reinforcement Learning with Many Parameters
Chris Watkins
Royal Holloway, University of London
In my talk, I described a (wellknown) reinforcement learning
algorithm, the "Relative Payoff Process", which can be motivated both
as a simple and biologically feasible learning method, and as a simple
model of evolution. The performance of the RPP was compared to that of
genetic algorithms, and, surprisingly, it emerged that genetic
algorithms have some desirable properties: th expected fitness of the
next generation bred from a selected population is independent of the
population size; the expected fitness of the next generation is
concentrated about the expected value; and there is a better bound on
the improvement in expected fitness in one generation for a genetic
algorithm than for the reinforcement learning algorithm. Hence the
comparison of GAs with a reinforcement learning approach to a similiar
problem revealed that GAs had some advantages.
[top]
Constructive Model Building
Chris Williams
Institute for Adaptive and Neural Computation
Division of Informatics, University of Edinburgh
Much work in statistical modelling consists of
fitting the parameters of a given model to data, and can be carried
out e.g. in a maximum likelihood or Bayesian setting. However, there
is also the question of model structure choice, for example the number
of components in a mixture model or a search over belief network
structures.
In this talk I will give an overview of different
methods that have been used in constructive model building approaches,
where the structure of the model is built up depending on the data. I
have identified three main approaches: (1) constructive learning by
repeated rerepresentation, (2) constructive learning by data merging,
(3) constructive learning as (greedy) search. Examples of models from
these categories will be given.
[top]
SVM and VC Theory (Statistical Learning Theory)
Robert Williamson
Australian National University
I presented a highlevel view of the core insights of statistical
learning theory. Considering as the goal of learning the minimization
of expected risk, I considered three main induction principles for
abstract learning algorithms: ERM (Empirical Risk Minimization), SRM
(Structural Risk Minimization), DSRM (DataDependent SRM). I explained
why convering numbers were the "right" qunatity to consider in
analysing ERM. I explained the difficulties of DSRM and indicated how
the so called luckiness framework was one way to rigorously reason
about DSRM. I also pointed out relationships with the work of Jack
Kieffer, Kylie Minogue and Britney Spears.