http://www.dagstuhl.de/04161

### 12. – 16. April 2004, Dagstuhl Seminar 04161

# Detecting Local Patterns

## Organisatoren

Jean-François Boulicaut (INSA Lyon, FR)

Katharina Morik (TU Dortmund, DE)

Arno Siebes (Utrecht University, NL)

## Auskunft zu diesem Dagstuhl Seminar erteilt

## Dokumente

Dagstuhl Seminar Proceedings

Teilnehmerliste

Dagstuhl's Impact: Dokumente verfügbar

Externe Homepage

## Introduction

The dramatic increase in available computer storage capacity over the last 10 years has led to the creation of very large databases of scientific and commercial information. The need to analyse these masses of data has led to the evolution of the new field *Knowledge Discovery in Databases (KDD)* at the intersection of machine learning, statistics and database technology (Fayyad et al, 1996). Being interdisciplinary by nature, the field offers the opportunity to combine the expertise of different fields to a common objective. Moreover, within each field diverse methods have been developed and justified with respect to different quality criteria. It has to investigated, in which way these methods can contribute to solving the problem of KDD.

Traditionally, KDD was seeking to find global models for the data that explain most of the instances of the database and describe the general structure of the data. Examples are statistical time series models, cluster models, logic programs with high coverage or classification models like decision trees or linear decision functions. In practice, though, the use of this models often is very limited, because global models tend to find only the obvious patterns in the data, that domain experts already are aware of (Guyon et al., 1996). What is really of interest to the users are the local patterns that deviate from the already known background knowledge. The new field of *local patterns* has been proposed by David Hand who organised a workshop in 2002.

The Dagstuhl Seminar on Local Pattern Detection brought together experts from Europe, Japan, and the United States -- 13 countries were represented. Moreover, the participants brought with them expertise in the following fields: Decision trees, Regression methods, Bayesian models, Kernel methods, Inductive Logic Programming, Deductive Databases, Constraint Propagation, Time Series Analysis, Query Optimisation, Outlier Detection, Frequent Set Mining, and Subgroup Detection. All talks were focused on the topic of local patterns in order to come to a clearer view of this new field.

## Novelty of Local Pattern Detection

Research has investigated global models for a long time in statistics and machine learning. The database community has inspected the storage and retrieval of very large datasets. When statistical methods encounter the extremely large amount of records and the high dimensionality of the stored observations, exploratory methods failed. Machine learning already scaled up to build-up global models, either in the form of complete decision functions or in the form of learning all valid rules from databases. However, the classification does not deliver new, surprising insights in the data, and the valid rules reflect almost exactly the domain knowledge of the database designers. In contrast, what users expect from the exploratory analysis of databases are new insights into their data. Hence, the matter of interestingness has become a key issue.

The success of Apriori or subsequently frequent set mining can be explained by it being the first step into the direction of local patterns. The correlation of more than the few features, which standard statistics could analyse, could successfully be determined by frequent set mining. Frequent set mining already outputs local patterns. Current research tasks within this set of methods are algorithmic concerns as well as the issues of interestingness measures and redundancy prevention. The collaboration of database specialists and data miners has led to the notion of inductive databases. The new approach writes measures of interest and the prevention of redundancy in terms of constraints. Also users can formulate their interests in terms of constraints. The constraints are pushed into the search process. This new approach was discussed at the seminar intensively and it was found a view covering diverse aspects of local patterns, namely their internal structure and the subjective part of interestingness as given by users. Talks on frequent set mining were presented by

- Francesco Bonchi and Fosca Giannotti showing the use of constraints within the search for local patterns,
- Rosa Meo presenting a language for inductive queries expressing constraints,
- Jean-Francois Boulicaut applying frequent set mining to gene expression data by exploiting Galois operators and mining bi-sets, which link situations and genes,
- Céline Rouveirol reporting on the combination of frequent sets found in gene expression and genome alteration data,
- Bart Goethals offering a new constraint on the patterns, namely that of s% of the database containing the minimal number of tiles, where each tile has the maximal number of 1s..

Subgroup discovery has had already some successes. Stefan Wrobel and Nada Lavrac reported on theory and applications of subgroup discovery, in particular focusing on interest measures and the significance of patterns found.

- Stefan Wrobel clearly indicated the problem of false discoveries and presented two approaches: the MIDOS algorithm, which finds subgroups according to the true deviation, and a sequential sampling algorithm, GSS, which makes subgroup discovery fast. He also tackled the redundancy problem by maximum entropy suppression effectively. Applications on spatial subgroup discovery concluded the talk.
- Nada Lavrac reported on successful applications of subgroup mining in medicine.
- Josef Fürnkranz presented a unifying view of diverse evaluation measures.

The statistical view was presented in five talks:

- Xiaohui Liu builds a noise model using supervised machine learning methods and after that local patterns are detected. Testing them against the noise model yields clean data. The approach was illustrated with two biomedical applications.
- Niall Adams and David Hand distinguish two stages in pattern discovery
- Identify potential patterns (given a suitable definition)
- Of these, identify significant (in some sense) patterns (expert or automatic)

They noticed that the former is primarily algorithmic and the latter has the potential to be statistical. They illustrated this with an application on discovering cheating students.

- Claus Weihs focused on the transformation of local patterns into global models illustrated with the transcription of vocal time series into sheet music
- Frank Höppner discussed the similarities and differences between clustering and pattern discovery. In particular he showed how interesting patterns can be found by the clever usage of a hierarchical clustering algorithm.
- Stefan Rüping introduced a general framework in which local patterns being produced by different processes are identified using a hidden variable. This allows for the use of the EM algorithm to discover the local patterns directly, that is, without reference to the global data distribution. A new scaling algorithm handles the combination of classifiers. The method was illustrated with business cycle data.

Local patterns need to have an internal structure in order to be meaningful, interesting, and distinguished from noise. Several talks focused on internal structures:

- Arno Siebes employed a graphical view on data and patterns to express this internal structure. Moreover, aggregate functions along paths in these graphs were used to compute new features.
- Katharina Morik discussed the importance of the example representation LE, because it determines the applicability of methods. For local pattern detection frequency features are best suited. She showed how to characterize time-stamped data using a frequency model.
- Helena Ahonen-Myka gave an overview of sequence discovery with a focus on applications on text.
- Myra Spiliopoulou gave an overview of local patterns exhibiting temporal structures.
- Thorsten Joachims investigated internal structures such as parse-trees and co-reference pairing. He presented a general method how such structures can be analyzed by SVMs. Moreover he showed how the combinatorial explosion of the number of constraints can be controlled by the upper bounds derived from statistical learning theory.

## Seminar Results

Based on the definition of David Hand (2002)

data = background model + local patterns + random

seminar participants came up with 12 definitions, what local patterns actually are. These were intensively discussed and we finally agreed on the following.

- Local Patterns cover small parts of the data space. If the learning result is considered a function, then global models are a complete function, whereas local patterns are partial.
- Local Patterns deviate from the distribution of the population of which they are part. This can be done iteratively - a local pattern can be considered the overall population and deviating parts of it are then determined.
- Local Patterns show some internal structure. For example, correlations of features, temporal or spatial ordering attributes, and sequences tie together instances of a local pattern.

Local patterns pose very difficult mining tasks:

- Interestingness measures (differ from standard criteria for global models)
- Deviation from background knowledge (global model) asks for good estimates of the global mode, where local patterns deviate from the overall distribution
- Modeling noise (for data cleaning, distinguish from local patterns)
- Automatic feature generation and selection for local patterns (for local patterns other features are successful than for global models, standard feature selection does not work)
- Internal structures of the patterns (correlation of several features, graph, sequence, spatial closeness, shape) can be expressed in several ways, e.g., TCat, constraints.
- Test theory for an extremely large space of possible hypotheses (large sets are less likely, hence global models do not encounter this problem)
- Curse of exponentiality - complexity
- Redundancy of learned patterns
- Sampling for local patterns speeds up mining and enhances quality of patterns
- Evaluation: benchmark missing
- Algorithm issues

The speakers at the workshop have been invited to submit a chapter for edited post-workshop proceedings.

## Related Dagstuhl Seminar

- 07181: "Parallel Universes and Local Patterns " (2007)