Dagstuhl Perspectives Workshop 27042: Experimental Evidence for Generative and Agentic AI in Information Access

Dagstuhl Perspectives Workshop 27042

Experimental Evidence for Generative and Agentic AI in Information Access

( Jan 24 – Jan 29, 2027 )

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/27042

Organizers

Nicola Ferro (University of Padova, IT)
Norbert Fuhr (Universität Duisburg-Essen, DE)
Iryna Gurevych (TU Darmstadt, DE)
Damiano Spina (RMIT University - Melbourne, AU)

Contact

Marsha Kleinbauer (for scientific matters)
Christina Schwarz (for administrative matters)

Motivation

Show Motivation

Generative AI (GenAI) has been revolutionizing our society and even more it is expected by agentic AI. In particular, in the information access area, i.e. systems such as search engines, recommender systems and various forms of chatbots, GenAI and agentic AI are expected to definitely reshape the field. Therefore, we wish to focus on the need for evaluation and experimental evidence of such systems for information access, as appropriate methods are still missing, their development is still very fragmented, and today's experiments provide little more than anecdotal evidence. As an example, most evaluation efforts are now focused on Retrieval-Augmented Generation (RAG) and they still fail to develop a comprehensive methodology. When it comes to Agentic AI, the very notion of evaluation is still ill-defined, boiling down to separate and uncoordinated evaluation of constituting components at best.

The goal of this Dagstuhl Perspectives Workshop is to question evaluation from its fundamentals, e.g. the Cranfield paradigm or dataset-based benchmarks, in order to understand whether developing a unifying framework for generative and agentic AI evaluation can be achieved simply by adjusting current approaches or whether a complete paradigm shift is needed. Moreover, user aspects should be integrated into such an evaluation framework, as both the user base and the tasks performed by these systems are becoming increasingly diverse. In particular, we aim at exploring some of the following issues:

Is extending Cranfield enough to evaluate generative and agentic AI or do we need a paradigm shift, moving away from static benchmarks and syntactical checking of "answer correctness"?
What are the current criteria for experimental evidence in information access and which of them are still applicable for generative and agentic AI in information access?
What are possible new evidence criteria for generative and agentic AI in information access?
Can we develop a taxonomy of experimental evidence, similar to the levels of evidence in medicine?
What governance and reporting standards (e.g., model cards, datasheets, evaluation protocols) are needed to make the evidence interpretable by non-technical stakeholders?
Which additional measures can increase evidence, like e.g. preregistered studies, results-blind reviewing, and more?
How do we bring systematic and factorial experimentation back for generative and agentic AI, overcoming scale issues and black-box models?
How do we enable scientifically solid, valid, and generalizable statements when faced with a lack of transparency and black-box models?
How do we set reference models, in order to allow for systematic comparability and tracking of progress, in a continuously evolving landscape?
How do we ensure reproducibility and replicability of experiments when confronted with non-determinism, scale issues, and lack of transparency?
How do we evolve from a notion of static information needs to dynamic ones, where queries become queries+context? How do we account for users' tasks, personalization (expertise, language, …), session and conversations?
How do we measure transparency, explainability, and auditability in GenAI and agentic AI systems?
How do we account for generated text/media instead of static documents? How does this impact veracity of answers (hallucination) and traceability? And, especially, what does the notion of re-using a test collection become?
How can the concepts, methodologies, and evidence standards developed in information access for GenAI and agentic AI be transferred, adapted, and validated in other domains?

Creative Commons BY 4.0

Nicola Ferro, Norbert Fuhr, Iryna Gurevych, and Damiano Spina

Classification

Artificial Intelligence
Computation and Language
Information Retrieval

Keywords

Generative Artificial Intelligence
Information Retrieval
Recommender Systems
Evaluation
User Interaction

Seminar 27042

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Perspectives Workshop 27042

Experimental Evidence for Generative and Agentic AI in Information Access

( Jan 24 – Jan 29, 2027 )

Permalink

Organizers

Contact

Motivation

Classification

Keywords