October 27 – November 1 , 2013, Dagstuhl Seminar 13441

Evaluation Methodologies in Information Retrieval


Maristella Agosti (University of Padova, IT)
Norbert Fuhr (Universität Duisburg-Essen, DE)
Elaine Toms (Sheffield University, GB)
Pertti Vakkari (University of Tampere, FI)

For support, please contact

Dagstuhl Service Team


Dagstuhl Report, Volume 3, Issue 10 Dagstuhl Report
Aims & Scope
List of Participants
Dagstuhl's Impact: Documents available


Evaluation of information retrieval (IR) systems has a long tradition. However, the test-collection based evaluation paradigm is of limited value for assessing today's IR applications, since it fails to address major aspects of the IR process. Thus there is a need for new evaluation approaches, which was the focus of this seminar.

Before the event, each participant was asked to identify one to five crucial issues in IR evaluation methodology. Pertti Vakkari presented a summary of this homework, pointing out that there are five major themes deemed relevant by the participants: 1) Evaluation frameworks, 2) Whole session evaluation and evaluation over sessions, 3) Evaluation criteria: from relevance to utility, 4) User modeling, and 5) Methodology and metrics.

Based on the evaluation model proposed in Saracevic & Covi [1], the seminar started with four introductory talks covering major areas of IR evaluation: Nick Belkin gave a survey over "Framework(s) for Evaluation (of whole-session) IR", addressing the system components to be evaluated and the context to be considered. In his presentation "Modeling User Behavior for Information Retrieval Evaluation", Charlie Clarke described efforts for improving system-oriented evaluation through explicit models of user behavior. Kal Järvelin talked about "Criteria in User-oriented Information Retrieval Evaluation", characterizing them as different types of experimental variables and distinguishing between output- and (task-)outcome related criteria. "Evaluation Measures in Information Retrieval" by Norbert Fuhr outlined the steps necessary for defining a new metric and the underlying assumptions, calling for empiric foundation and theoretic soundness. Diane Kelly presented problematic issues related to "Methodology in IR Evaluation", such as the relationship between observation variables and criteria, the design of questionnaires, the difference between explanatory and predictive research and the appropriateness of statistical methods when dealing with big data. The round of introductory talks was concluded with Maristella Agosti's presentation "Future in Information Retrieval Evaluation", where she summarized challenges identified in three recent workshops in this area.

For the rest of the week, the participants then formed working groups described in the following.

"From Searching to Learning" focused on the learning as search outcome and the need for systems supporting this process. Learning may occur at two different levels, namely the content level and the search competence level. There is a need for understanding of the learning process, its relationship to the searcher's work task, the role of the system, and the development of appropriate evaluation methods. Approaches may address different aspects of the problem, such as the system, the interaction, the content, the user and the process. For evaluation, the framework from Ingwersen and Jarvelin [2] suggests criteria and measures at the levels of information retrieval, information seeking, the work task and the social-organizational and culture level.

"Social Media" allow users to create and share content, with a strong focus on personal connections. While web search engines are still the primary starting point for many information seeking activities, information access activities are shifting to more personalized services taking into account social data. This trend leads to new IR-related research issues, such as e.g. utility, privacy, the influence of diverse cultural backgrounds, data quality, authority, content ownership, and social recommendations. Traditional assumptions about information seeking will have to be revised, especially since social media may play a role in a broad range of information spaces, ranging form everyday life and popular culture to professional environments like journalism and research literature.

"Graph Search and Beyond" starts from the observation that an increasing amount of information on the Web is structured in terms of entities and relationships, thus forming a graph, which, in turn allows for answering more complex information needs. For handling these, search engines should support incremental structured query input and dynamic structured result set exploration, Thus, in contrast to the classical search engine result page, graph search calls for an incremental query exploration page, where entries represent the answers themselves (in the form of entities, relationships and sub-graphs). The new possibilities of querying and result presentation call for the development of adequate evaluation methods

"Reliability and Validity" is considered as the most central issue in IR evaluation, especially in the current situation where there is increasing discussion in the research community about reproducibility and generalizability of experimental results. Thus, this working group decided to start the preparation of a book on best practices in IR evaluation, which will cover the following aspects: Basic definitions and concepts, reliability and validity in experimentation, reporting out experiments, failure analysis, definition of new measures and methods, guidelines for reviewing experimental papers.

"Domain Specific Information Retrieval" in specific domains like e.g. in cultural heritage, patents and medical collections is not only characterized through the specifics of the content, but also through the typical context(s) in which this information is accessed and used, which requires specific functionalities that go beyond the simple search interaction. Also, context often plays an important role, and thus should be considered by the information system. However, there is a lack of appropriate evaluation methods for considering contexts and new functions.

"Task-Based IR" typically refers to research focusing on the task or goal motivating a person to invoke an IR system, thus calling for systems being able to recognize the nature of the task and to support the accompanying search process. As task types, we can distinguish between motivating tasks, seeking tasks, and search tasks. Task-based IR approaches should be able to model people as well as the process, and to distinguish between the (task-related) outcome and the (system) output.

"Searching for Fun" refers to the interaction with an information system without a specific search objective, like e.g. online window shopping, watching pictures or movies, or reading online. This type of activity requires different evaluation criteria, e.g. with regard to stopping behavior, dwell time and novelty. Also, it is important to distinguish between system criteria and user criteria, where the latter may be subdivided into process criteria and outcome criteria. A major problem in this area is the design of user studies, especially since the starting points (e.g. casual or leisure needs) are difficult to create under experimental conditions. A number of further issues was also identified.

The working group "The Significance of Search, Support for Complex Tasks, and Searcher-aware Information Access Systems" addressed three loosely related challenges. The first topic addresses the definition of IR in the light of the dramatic changes during the last two decades, and the limited impact of our research. The second topic is the development of tools supporting more complex tasks, and their evaluation. Finally, information systems should become more informed about the searcher and the progress in user's task.

"Interaction, Measures and Models" discussed the need for a common framework for user interaction models and associated evaluation measures, especially as a means for achieving a higher degree of reliability in interactive IR experiments. This would allow for evaluating the effect of the interaction and the interface on performance. A possible solution could consist of three components, namely an interaction model, a gain model and a cost model.

Finally, many of the attendees were planning to continue to collaborate on the topics addressed during the seminar since the fruitful discussions were a useful base for future cooperation.


  1. Tefko Saracevic, Lisa Covi (2000). Challenges for digital library evaluation. In D. H. Kraft (Ed.), Knowledge Innovations: Celebrating Our Heritage, Designing Our Future. Proceedings of the 63rd Annual Meeting of the American Society for Information Science. Washington, D.C.: American Society for Information Science. pp. 341--350.
  2. Peter Ingwersen, Kalervo Järvelin (2005). The Turn: Integration of Information Seeking and Retrieval. In Context. Dortrecht, NL: Springer. ISBN 1-4020-3850-X
Summary text license
  Creative Commons BY 3.0 Unported license
  Maristella Agosti, Norbert Fuhr, Elaine Toms, and Pertti Vakkari


  • Data Bases / Information Retrieval


  • Evaluation design
  • Evaluation analysis
  • Test collections
  • Lab experiments
  • Living labs


In the series Dagstuhl Reports each Dagstuhl Seminar and Dagstuhl Perspectives Workshop is documented. The seminar organizers, in cooperation with the collector, prepare a report that includes contributions from the participants' talks together with a summary of the seminar.


Download overview leaflet (PDF).

Dagstuhl's Impact

Please inform us when a publication was published as a result from your seminar. These publications are listed in the category Dagstuhl's Impact and are presented on a special shelf on the ground floor of the library.


Furthermore, a comprehensive peer-reviewed collection of research papers can be published in the series Dagstuhl Follow-Ups.