Dagstuhl Perspectives Workshop 24352
Conversational Agents: A Framework for Evaluation (CAFE)
( Aug 25 – Aug 30, 2024 )
Permalink
Organizers
- Christine Bauer (Paris Lodron Universität Salzburg, AT)
- Li Chen (Hong Kong Baptist University, HK)
- Nicola Ferro (University of Padova, IT)
- Norbert Fuhr (Universität Duisburg-Essen, DE)
Contact
- Marsha Kleinbauer (for scientific matters)
- Jutka Gasiorowski (for administrative matters)
Dagstuhl Reports
As part of the mandatory documentation, participants are asked to submit their talk abstracts, working group results, etc. for publication in our series Dagstuhl Reports via the Dagstuhl Reports Submission System.
- Upload (Use personal credentials as created in DOOR to log in)
Dagstuhl Seminar Wiki
- Dagstuhl Seminar Wiki (Use personal credentials as created in DOOR to log in)
Shared Documents
- Dagstuhl Materials Page (Use personal credentials as created in DOOR to log in)
Schedule
- Upload (Use personal credentials as created in DOOR to log in)
Conversational Agents (CA) as frontends to Information Retrieval (IR) and Recommender Systems (RS) become more popular in everyday life, with a wider range of users and usages. The latest developments in Large Language Models (LLMs) will have tremendous consequences, especially for the workplace and education. In this Dagstuhl Perspectives Workshop, we want to focus on the evaluation of these conversational systems, as appropriate methods are still missing. The quality of these systems is limited in terms of personalization, veracity and correctness, bias, transparency, trustworthiness, and understandability. Thus, evaluation methods must address these shortcomings. Furthermore, user- and usage-oriented aspects should become a more prominent and integral component in evaluations, as the user population as well as the tasks these systems are used for become more heterogeneous. For this reason, the topic-centric view of relevance has to be extended to a broad range of facets which are important for the different usage scenarios. Therefore, suitable evaluation criteria have to be specified, which form the basis for defining appropriate measures. Most importantly, the range of evaluation methods must be revisited and extended, as popular methods like the Cranfield approach or crowdsourcing must be complemented by new evaluation methods and strategies specifically tailored to this new type of system.
More in detail, we will focus our discussion on several key open issues, among which are the following:
- how to cross the borders of different areas, mainly Information Retrieval and Recommender Systems in our case, but also Natural Language Processing;
- how to create experimental collections and evaluate Large Language Models in terms of their bias, explainability, veracity, correctness, and hallucination in the CA context;
- how to incorporate user- and usage-oriented facets in order to understand how users’ perceived conversation qualities (e.g., attentiveness, adaptability, understanding, and response quality) and perceived recommendation qualities (e.g.,, accuracy, novelty, interaction adequacy, and explanation) might interact with each other in a CA to affect user beliefs (e.g., perceived usefulness, perceived ease of use, transparency, user control, rapport, humanness), user attitudes (e.g., user satisfaction, trust), and behavioral intentions (e.g., intention to use);
- how to measure information leakage and privacy, and how to ensure that a CA does not propagate sensitive information;
- how to devise proper simulation approaches to support both the development and the evaluation of a CA, avoiding circularity (the techniques used for simulation are similar to those used for developing systems), ensuring reliability, and reducing the gap between offline measurements and online user evaluations;
- how to evaluate to what extent answers/recommendations produced by a CA are appropriate, tailored to, and understandable for a specific audience, e.g., school kids, the general public, professionals, and people with (cognitive) disabilities.
Overall, all the above questions call, as one possible output of the workshop, for envisioning some reference architecture for CA systems, geared towards evaluation, which allows the different areas to cooperate on a common ground and to share a common roadmap for improving our understanding of CA systems and making them more effective.
- Avishek Anand (TU Delft, NL) [dblp]
- Christine Bauer (Paris Lodron Universität Salzburg, AT) [dblp]
- Timo Breuer (TH Köln, DE) [dblp]
- Li Chen (Hong Kong Baptist University, HK) [dblp]
- Guglielmo Faggioli (University of Padova, IT) [dblp]
- Nicola Ferro (University of Padova, IT) [dblp]
- Ophir Frieder (Georgetown University - Washington, DC, US) [dblp]
- Norbert Fuhr (Universität Duisburg-Essen, DE) [dblp]
- Hideo Joho (University of Tsukuba - Ibaraki, JP) [dblp]
- Jussi Karlgren (Silo Ai - Helsinki, FI) [dblp]
- Johannes Kiesel (Bauhaus-Universität Weimar, DE) [dblp]
- Bart Knijnenburg (Clemson University, US) [dblp]
- Aldo Lipani (University College London, GB) [dblp]
- Lien Michiels (University of Antwerp, BE) [dblp]
- Andrea Papenmeier (University of Twente, NL) [dblp]
- Sole Pera (TU Delft, NL) [dblp]
- Mark Sanderson (RMIT University - Melbourne, AU) [dblp]
- Scott Sanner (University of Toronto, CA) [dblp]
- Benno Stein (Bauhaus-Universität Weimar, DE) [dblp]
- Johanne Trippas (RMIT University - Melbourne, AU) [dblp]
- Karin Verspoor (RMIT University - Melbourne, AU) [dblp]
- Martijn Willemsen (Eindhoven University of Technology, NL & JADS, NL) [dblp]
Classification
- Artificial Intelligence
- Human-Computer Interaction
- Information Retrieval
Keywords
- Conversational Agents
- Information Retrieval
- Recommender Systems
- Evaluation
- User Interaction