Dagstuhl Seminar 11171: Challenges in Document Mining

Dagstuhl Seminar 11171

Challenges in Document Mining

( Apr 25 – Apr 29, 2011 )

(Click in the middle of the image to enlarge)

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/11171

Organizers

Hamish Cunningham (University of Sheffield, GB)
Oren Etzioni (University of Washington - Seattle, US)
Norbert Fuhr (Universität Duisburg-Essen, DE)
Benno Stein (Bauhaus-Universität Weimar, DE)

Contact

Simone Schilke (for administrative matters)

Publications

Challenges in Document Mining (Dagstuhl Seminar 11171). Hamish Cunningham, Norbert Fuhr, and Benno M. Stein. In Dagstuhl Reports, Volume 1, Issue 4, pp. 65-99, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2011)

Schedule

Schedule

Summary

Show Summary

Document mining is the process of deriving high-quality information from large collections of documents like news feeds, databases, or the Web. Document mining tasks include cluster analysis, classification, generation of taxonomies, information extraction, trend identification, sentiment analysis, and the like. Although some of these tasks have a long research history, it is clear that the potential of document mining is still to be fully realised.

Part of the problem is that relevant document mining techniques are often applied in an isolated manner, addressing -- from a user perspective -- only a part of a task. For example, an intelligent cluster analysis requires adequate document models (from information retrieval) that are combined with sensible merging algorithms (from unsupervised learning), complemented by an intuitive labelling (from information extraction, natural language processing).

The deficit that we observe may also be understood as a lack of application and user orientation in research. For example, given a result set clustering task, users expect:

as many clusters as they identify topics in the result set,
that the documents within each cluster are semantically similar to each other, and
that each cluster is labeled intuitively.

In order to achieve such a satisfying solution, the state-of-art of concepts and algorithms from information retrieval, unsupervised learning, information extraction, and natural language processing have to be combined in a user-focussed manner.

Goals of the Seminar

The general idea was to to take an overview the state of the art in documentmining research and to define a research agenda for further work. Since document mining tasks are not tackled by a single technology, we wanted to bring a sample of the leading teams together and look at the area from a multidisciplinary point of view. In particular, the seminar should focus on the following questions:

What are the relevant document mining tasks? The expectations and the potential for document mining changed significantly over time. Influential in this connection is the discovery of the enormous contributions of users to the Web, among others in the form of blogs, comments and reviews, as highly valuable information source.
What are the options and limitations of cluster analysis? A major deal of cluster analysis research has been spent to merging principles and algorithms; today, and especially in document mining, the focus is on tailored document models, user integration, topic identification and cluster labelling, on the combination with retrieval technology (e.g.\ as result set clustering). Especially non-topical classification tasks attracted interest in this connection, such as genre classification, sentiment analysis, or authorship grouping. Moreover, theoretical foundations of cluster analysis performance in document mining as well as commonly accepted optimality measures are open questions.
What are the document mining challenges from a machine learning perspective? A crucial constraint is the lack of sufficient amounts of labelled data. This situation will become even more unbalanced in the future, and current research---to mention domain transfer learning and transductive learning---aim at the development of technology to exploit the huge amount of unlabelled data to improve supervised classification.
How will NLP and IE affect the development of the field? The use of NLP and IE in document mining is a success factor of increasing importance for document mining. NLP contributes technology for document modelling, style quantification, document segmentation, topic identification, and various information extraction and semantic annotation tasks. In this regard authorship and writing style modelling is still coming of age; this area forms the heart for high-level document mining tasks such as plagiarism analysis, authorship attribution, and information quality assessment.
Are new interaction paradigms on the rise? Interface design and visualization are very important for effective user access to the output of the document mining process. Moreover, interactive document mining approaches like e.g.\ scatter-gather clustering pose new challenges for both the interface and the backend.
How to evaluate and compare the different research efforts? Evaluation is essential for developing any kind of data mining method. So far, mainly system-oriented evaluation approaches have been used, where the data mining output is compared to some ``gold standard''. There is a lack of user-oriented evaluations (e.g. observing users browsing a cluster hierarchy), that also take into account the tasks the users want to perform---e.g. using Borlund's concept of simulated work tasks.

Participants

Show Participants

Leif Azzopardi (University of Glasgow, GB) [dblp]
Ted Briscoe (University of Cambridge, GB)
Steven Burrows (Bauhaus-Universität Weimar, DE)
John A. Carroll (University of Sussex - Brighton, GB) [dblp]
Massimiliano Ciaramita (Google Switzerland, CH)
Hamish Cunningham (University of Sheffield, GB)
Arjen P. de Vries (CWI - Amsterdam, NL) [dblp]
Norbert Fuhr (Universität Duisburg-Essen, DE) [dblp]
Tim Gollub (Bauhaus-Universität Weimar, DE) [dblp]
Thomas Gottron (Universität Koblenz-Landau, DE) [dblp]
Michael Granitzer (Know-Center Graz, AT) [dblp]
Andreas Henrich (Universität Bamberg, DE) [dblp]
Gerhard Heyer (Universität Leipzig, DE) [dblp]
Dennis Hoppe (Bauhaus-Universität Weimar, DE)
Melikka Khosh Niat (Universität Duisburg-Essen, DE)
Marc Lechtenfeld (Universität Duisburg-Essen, DE)
Alexander Löser (TU Berlin, DE) [dblp]
Peter Prettenhofer (TU Graz, AT)
Andreas Rauber (TU Wien, AT) [dblp]
Harald Reiterer (Universität Konstanz, DE) [dblp]
Stefan M. Rüger (The Open University - Milton Keynes, GB) [dblp]
Hinrich Schütze (Universität Stuttgart, DE) [dblp]
Wolf Siberski (Leibniz Universität Hannover, DE)
Benno Stein (Bauhaus-Universität Weimar, DE) [dblp]

Classification

Information Extraction / Information Retrieval
Data Mining / Natural Language Processing

Keywords

Cluster analysis
HCI
retrieval models
social mining and search

Seminar 11171

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Seminar 11171

Challenges in Document Mining

( Apr 25 – Apr 29, 2011 )

Permalink

Organizers

Contact

Publications

Schedule

Summary

Goals of the Seminar

Participants

Classification

Keywords