Search the Dagstuhl Website
Looking for information on the websites of the individual seminars? - Then please:
Not found what you are looking for? - Some of our services have separate websites, each with its own search option. Please check the following list:
Schloss Dagstuhl - LZI - Logo
Schloss Dagstuhl Services
Within this website:
External resources:
  • DOOR (for registering your stay at Dagstuhl)
  • DOSA (for proposing future Dagstuhl Seminars or Dagstuhl Perspectives Workshops)
Within this website:
External resources:
Within this website:
External resources:
  • the dblp Computer Science Bibliography

Dagstuhl Seminar 13441

Evaluation Methodologies in Information Retrieval

( Oct 27 – Nov 01, 2013 )

(Click in the middle of the image to enlarge)

Please use the following short url to reference this page:




Evaluation of information retrieval (IR) systems has a long tradition. However, the test-collection based evaluation paradigm is of limited value for assessing today's IR applications, since it fails to address major aspects of the IR process. Thus there is a need for new evaluation methodologies, which are able to deal with the following issues:

  • In interactive IR, users have a wide variety of interaction possibilities. The classical paradigm only regards the document ranking for a single query. In contrast, new functions such as search term completion, query term suggestion, faceted search, document clustering, query-biased summaries etc. also have a strong influence on the user's search experience and thus should be considered in an evaluation.
  • From a user's point of view, evaluation of performance of IR systems should be in terms of how well they are supported with respect to whole search sessions. Typically, users initiate a session with a specific goal (e. g. acquiring crucial information for making a decision, learning about topics or events they are interested in, or just for getting entertained). Thus, the overall quality of a system should be evaluated wrt. the user's goal. However, it is an open research issue how this can be achieved.
  • There is an increasing number of search applications (especially on mobile devices), which support specific tasks (e. g. finding the next restaurant, comparing prices for a specific product). Here goal-oriented evaluation may be more straightforward. From an IR researcher's point of view, however, we would like to learn about the quality of contribution of the underlying IR engine, and how it can possibly be improved.
  • Besides ad-hoc-retrieval, also monitoring or filtering is an important IR task type. Here streams of short messages (e.g. tweets, chats) pose new challenges. It is an open question whether or not the classical relevance-based evaluation is sufficient for a user-oriented evaluation.

In order to address these issues, there is a need for the development of appropriate methodologies such as:

  • Evaluation infrastructures provide test-beds along with evaluation methods, software and databases for computing measures, collecting and comparing results.
  • Test-beds for interactive IR evaluation are hardly reusable at the moment (with the exception of simulation approaches like e.g. in the TREC session track). However, sharing data from user experiments might be an important step in this direction.
  • Living labs use operational systems as experimental platforms on which to conduct user-based experiments at scale. To be usable, we need a sites attracting enough traffic, and an architecture that allows for plugging in components from different researcher groups.
  • Frameworks for modeling system-user interactions with clear methodological implications

This seminar aims to

  • Increase understanding of the central problems in evaluating information retrieval
  • Meld a cross-fertilization of ideas in the evaluation approaches from the different IR evaluation communities
  • Create new methodologies and approaches for solving existing problems
  • Enhance the validity and reliability of future evaluation experiments
  • Examine how to extract pertinent IR systems design elements from the results of evaluation experiments, in the long run.

To attain the goals of the seminar, each participant will be expected to identify one to five crucial issues in IR evaluation methodology. These perspectives will result in primarily theoretical presentations with empirical examples from current studies. Based on these contributions we will identify a selected set of methodological issues for further development in smaller working groups. The expected outcomes of the seminar will be the basis for one or more new evaluation frameworks and improved methodological solutions.


Evaluation of information retrieval (IR) systems has a long tradition. However, the test-collection based evaluation paradigm is of limited value for assessing today's IR applications, since it fails to address major aspects of the IR process. Thus there is a need for new evaluation approaches, which was the focus of this seminar.

Before the event, each participant was asked to identify one to five crucial issues in IR evaluation methodology. Pertti Vakkari presented a summary of this homework, pointing out that there are five major themes deemed relevant by the participants: 1) Evaluation frameworks, 2) Whole session evaluation and evaluation over sessions, 3) Evaluation criteria: from relevance to utility, 4) User modeling, and 5) Methodology and metrics.

Based on the evaluation model proposed in Saracevic & Covi [1], the seminar started with four introductory talks covering major areas of IR evaluation: Nick Belkin gave a survey over "Framework(s) for Evaluation (of whole-session) IR", addressing the system components to be evaluated and the context to be considered. In his presentation "Modeling User Behavior for Information Retrieval Evaluation", Charlie Clarke described efforts for improving system-oriented evaluation through explicit models of user behavior. Kal Järvelin talked about "Criteria in User-oriented Information Retrieval Evaluation", characterizing them as different types of experimental variables and distinguishing between output- and (task-)outcome related criteria. "Evaluation Measures in Information Retrieval" by Norbert Fuhr outlined the steps necessary for defining a new metric and the underlying assumptions, calling for empiric foundation and theoretic soundness. Diane Kelly presented problematic issues related to "Methodology in IR Evaluation", such as the relationship between observation variables and criteria, the design of questionnaires, the difference between explanatory and predictive research and the appropriateness of statistical methods when dealing with big data. The round of introductory talks was concluded with Maristella Agosti's presentation "Future in Information Retrieval Evaluation", where she summarized challenges identified in three recent workshops in this area.

For the rest of the week, the participants then formed working groups described in the following.

"From Searching to Learning" focused on the learning as search outcome and the need for systems supporting this process. Learning may occur at two different levels, namely the content level and the search competence level. There is a need for understanding of the learning process, its relationship to the searcher's work task, the role of the system, and the development of appropriate evaluation methods. Approaches may address different aspects of the problem, such as the system, the interaction, the content, the user and the process. For evaluation, the framework from Ingwersen and Jarvelin [2] suggests criteria and measures at the levels of information retrieval, information seeking, the work task and the social-organizational and culture level.

"Social Media" allow users to create and share content, with a strong focus on personal connections. While web search engines are still the primary starting point for many information seeking activities, information access activities are shifting to more personalized services taking into account social data. This trend leads to new IR-related research issues, such as e.g. utility, privacy, the influence of diverse cultural backgrounds, data quality, authority, content ownership, and social recommendations. Traditional assumptions about information seeking will have to be revised, especially since social media may play a role in a broad range of information spaces, ranging form everyday life and popular culture to professional environments like journalism and research literature.

"Graph Search and Beyond" starts from the observation that an increasing amount of information on the Web is structured in terms of entities and relationships, thus forming a graph, which, in turn allows for answering more complex information needs. For handling these, search engines should support incremental structured query input and dynamic structured result set exploration, Thus, in contrast to the classical search engine result page, graph search calls for an incremental query exploration page, where entries represent the answers themselves (in the form of entities, relationships and sub-graphs). The new possibilities of querying and result presentation call for the development of adequate evaluation methods

"Reliability and Validity" is considered as the most central issue in IR evaluation, especially in the current situation where there is increasing discussion in the research community about reproducibility and generalizability of experimental results. Thus, this working group decided to start the preparation of a book on best practices in IR evaluation, which will cover the following aspects: Basic definitions and concepts, reliability and validity in experimentation, reporting out experiments, failure analysis, definition of new measures and methods, guidelines for reviewing experimental papers.

"Domain Specific Information Retrieval" in specific domains like e.g. in cultural heritage, patents and medical collections is not only characterized through the specifics of the content, but also through the typical context(s) in which this information is accessed and used, which requires specific functionalities that go beyond the simple search interaction. Also, context often plays an important role, and thus should be considered by the information system. However, there is a lack of appropriate evaluation methods for considering contexts and new functions.

"Task-Based IR" typically refers to research focusing on the task or goal motivating a person to invoke an IR system, thus calling for systems being able to recognize the nature of the task and to support the accompanying search process. As task types, we can distinguish between motivating tasks, seeking tasks, and search tasks. Task-based IR approaches should be able to model people as well as the process, and to distinguish between the (task-related) outcome and the (system) output.

"Searching for Fun" refers to the interaction with an information system without a specific search objective, like e.g. online window shopping, watching pictures or movies, or reading online. This type of activity requires different evaluation criteria, e.g. with regard to stopping behavior, dwell time and novelty. Also, it is important to distinguish between system criteria and user criteria, where the latter may be subdivided into process criteria and outcome criteria. A major problem in this area is the design of user studies, especially since the starting points (e.g. casual or leisure needs) are difficult to create under experimental conditions. A number of further issues was also identified.

The working group "The Significance of Search, Support for Complex Tasks, and Searcher-aware Information Access Systems" addressed three loosely related challenges. The first topic addresses the definition of IR in the light of the dramatic changes during the last two decades, and the limited impact of our research. The second topic is the development of tools supporting more complex tasks, and their evaluation. Finally, information systems should become more informed about the searcher and the progress in user's task.

"Interaction, Measures and Models" discussed the need for a common framework for user interaction models and associated evaluation measures, especially as a means for achieving a higher degree of reliability in interactive IR experiments. This would allow for evaluating the effect of the interaction and the interface on performance. A possible solution could consist of three components, namely an interaction model, a gain model and a cost model.

Finally, many of the attendees were planning to continue to collaborate on the topics addressed during the seminar since the fruitful discussions were a useful base for future cooperation.


  1. Tefko Saracevic, Lisa Covi (2000). Challenges for digital library evaluation. In D. H. Kraft (Ed.), Knowledge Innovations: Celebrating Our Heritage, Designing Our Future. Proceedings of the 63rd Annual Meeting of the American Society for Information Science. Washington, D.C.: American Society for Information Science. pp. 341--350.
  2. Peter Ingwersen, Kalervo Järvelin (2005). The Turn: Integration of Information Seeking and Retrieval. In Context. Dortrecht, NL: Springer. ISBN 1-4020-3850-X
Copyright Maristella Agosti, Norbert Fuhr, Elaine Toms, and Pertti Vakkari

  • Maristella Agosti (University of Padova, IT) [dblp]
  • Omar Alonso (Microsoft Corp. - Mountain View, US) [dblp]
  • Leif Azzopardi (University of Glasgow, GB) [dblp]
  • Nicholas J. Belkin (Rutgers University - New Brunswick, US) [dblp]
  • Ann Blandford (University College London, GB) [dblp]
  • Charles Clarke (University of Waterloo, CA) [dblp]
  • Maarten de Rijke (University of Amsterdam, NL) [dblp]
  • Arjen P. de Vries (CWI - Amsterdam, NL) [dblp]
  • Floriana Esposito (University of Bari, IT) [dblp]
  • Nicola Ferro (University of Padova, IT) [dblp]
  • Luanne Freund (University of British Columbia - Vancouver, CA) [dblp]
  • Norbert Fuhr (Universität Duisburg-Essen, DE) [dblp]
  • Jacek Gwizdka (University of Texas - Austin, US) [dblp]
  • Matthias Hagen (Bauhaus-Universität Weimar, DE) [dblp]
  • Preben Hansen (Stockholm University, SE) [dblp]
  • Jiyin He (CWI - Amsterdam, NL) [dblp]
  • Kalervo Järvelin (University of Tampere, FI) [dblp]
  • Hideo Joho (University of Tsukuba, JP) [dblp]
  • Jaap Kamps (University of Amsterdam, NL) [dblp]
  • Noriko Kando (National Institute of Informatics - Tokyo, JP) [dblp]
  • Evangelos Kanoulas (Google Switzerland, CH) [dblp]
  • Diane Kelly (University of North Carolina - Chapel Hill, US) [dblp]
  • Birger Larsen (Aalborg University Copenhagen, DK) [dblp]
  • Dirk Lewandowski (HAW - Hamburg, DE) [dblp]
  • Christina Lioma (University of Copenhagen, DK) [dblp]
  • Thomas Mandl (Universität Hildesheim, DE) [dblp]
  • Peter Mutschke (GESIS - Köln, DE) [dblp]
  • Ragnar Nordlie (Oslo University College, NO) [dblp]
  • Heather O'Brien (University of British Columbia - Vancouver, CA) [dblp]
  • Doug Oard (University of Maryland - College Park, US) [dblp]
  • Vivien Petras (HU Berlin, DE) [dblp]
  • Martin Potthast (Bauhaus-Universität Weimar, DE) [dblp]
  • Soo Young Rieh (University of Michigan - Ann Arbor, US) [dblp]
  • Gianmaria Silvello (University of Padova, IT) [dblp]
  • Paul Thomas (CSIRO - Canberra, AU) [dblp]
  • Elaine Toms (Sheffield University, GB) [dblp]
  • Tuan Vu Tran (Universität Duisburg-Essen, DE) [dblp]
  • Pertti Vakkari (University of Tampere, FI) [dblp]
  • C .J. Keith van Rijsbergen (University of Cambridge, GB) [dblp]
  • Robert Villa (University of Sheffield, GB) [dblp]
  • Max L. Wilson (University of Nottingham, GB) [dblp]
  • Christa Womser-Hacker (Universität Hildesheim, DE) [dblp]

  • data bases / information retrieval

  • Evaluation design
  • evaluation analysis
  • test collections
  • lab experiments
  • living labs