Dagstuhl Perspectives Workshop 27042
Experimental Evidence for Generative and Agentic AI in Information Access
( Jan 24 – Jan 29, 2027 )
Permalink
Organizers
- Nicola Ferro (University of Padova, IT)
- Norbert Fuhr (Universität Duisburg-Essen, DE)
- Iryna Gurevych (TU Darmstadt, DE)
- Damiano Spina (RMIT University - Melbourne, AU)
Contact
- Marsha Kleinbauer (for scientific matters)
- Christina Schwarz (for administrative matters)
Generative AI (GenAI) has been revolutionizing our society and even more it is expected by agentic AI. In particular, in the information access area, i.e. systems such as search engines, recommender systems and various forms of chatbots, GenAI and agentic AI are expected to definitely reshape the field. Therefore, we wish to focus on the need for evaluation and experimental evidence of such systems for information access, as appropriate methods are still missing, their development is still very fragmented, and today's experiments provide little more than anecdotal evidence. As an example, most evaluation efforts are now focused on Retrieval-Augmented Generation (RAG) and they still fail to develop a comprehensive methodology. When it comes to Agentic AI, the very notion of evaluation is still ill-defined, boiling down to separate and uncoordinated evaluation of constituting components at best.
The goal of this Dagstuhl Perspectives Workshop is to question evaluation from its fundamentals, e.g. the Cranfield paradigm or dataset-based benchmarks, in order to understand whether developing a unifying framework for generative and agentic AI evaluation can be achieved simply by adjusting current approaches or whether a complete paradigm shift is needed. Moreover, user aspects should be integrated into such an evaluation framework, as both the user base and the tasks performed by these systems are becoming increasingly diverse. In particular, we aim at exploring some of the following issues:
- Is extending Cranfield enough to evaluate generative and agentic AI or do we need a paradigm shift, moving away from static benchmarks and syntactical checking of "answer correctness"?
- What are the current criteria for experimental evidence in information access and which of them are still applicable for generative and agentic AI in information access?
- What are possible new evidence criteria for generative and agentic AI in information access?
- Can we develop a taxonomy of experimental evidence, similar to the levels of evidence in medicine?
- What governance and reporting standards (e.g., model cards, datasheets, evaluation protocols) are needed to make the evidence interpretable by non-technical stakeholders?
- Which additional measures can increase evidence, like e.g. preregistered studies, results-blind reviewing, and more?
- How do we bring systematic and factorial experimentation back for generative and agentic AI, overcoming scale issues and black-box models?
- How do we enable scientifically solid, valid, and generalizable statements when faced with a lack of transparency and black-box models?
- How do we set reference models, in order to allow for systematic comparability and tracking of progress, in a continuously evolving landscape?
- How do we ensure reproducibility and replicability of experiments when confronted with non-determinism, scale issues, and lack of transparency?
- How do we evolve from a notion of static information needs to dynamic ones, where queries become queries+context? How do we account for users' tasks, personalization (expertise, language, …), session and conversations?
- How do we measure transparency, explainability, and auditability in GenAI and agentic AI systems?
- How do we account for generated text/media instead of static documents? How does this impact veracity of answers (hallucination) and traceability? And, especially, what does the notion of re-using a test collection become?
- How can the concepts, methodologies, and evidence standards developed in information access for GenAI and agentic AI be transferred, adapted, and validated in other domains?
Nicola Ferro, Norbert Fuhr, Iryna Gurevych, and Damiano Spina
Classification
- Artificial Intelligence
- Computation and Language
- Information Retrieval
Keywords
- Generative Artificial Intelligence
- Information Retrieval
- Recommender Systems
- Evaluation
- User Interaction

Creative Commons BY 4.0
