Conversational Agents (CA) as frontends to Information Retrieval (IR) and Recommender Systems (RS) become more popular in everyday life, with a wider range of users and usages. The latest developments in Large Language Models (LLMs) will have tremendous consequences, especially for the workplace and education. In this Dagstuhl Perspectives Workshop, we want to focus on the evaluation of these conversational systems, as appropriate methods are still missing. The quality of these systems is limited in terms of personalization, veracity and correctness, bias, transparency, trustworthiness, and understandability. Thus, evaluation methods must address these shortcomings. Furthermore, user- and usage-oriented aspects should become a more prominent and integral component in evaluations, as the user population as well as the tasks these systems are used for become more heterogeneous. For this reason, the topic-centric view of relevance has to be extended to a broad range of facets which are important for the different usage scenarios. Therefore, suitable evaluation criteria have to be specified, which form the basis for defining appropriate measures. Most importantly, the range of evaluation methods must be revisited and extended, as popular methods like the Cranfield approach or crowdsourcing must be complemented by new evaluation methods and strategies specifically tailored to this new type of system.
More in detail, we will focus our discussion on several key open issues, among which are the following:
- how to cross the borders of different areas, mainly Information Retrieval and Recommender Systems in our case, but also Natural Language Processing;
- how to create experimental collections and evaluate Large Language Models in terms of their bias, explainability, veracity, correctness, and hallucination in the CA context;
- how to incorporate user- and usage-oriented facets in order to understand how users’ perceived conversation qualities (e.g., attentiveness, adaptability, understanding, and response quality) and perceived recommendation qualities (e.g.,, accuracy, novelty, interaction adequacy, and explanation) might interact with each other in a CA to affect user beliefs (e.g., perceived usefulness, perceived ease of use, transparency, user control, rapport, humanness), user attitudes (e.g., user satisfaction, trust), and behavioral intentions (e.g., intention to use);
- how to measure information leakage and privacy, and how to ensure that a CA does not propagate sensitive information;
- how to devise proper simulation approaches to support both the development and the evaluation of a CA, avoiding circularity (the techniques used for simulation are similar to those used for developing systems), ensuring reliability, and reducing the gap between offline measurements and online user evaluations;
- how to evaluate to what extent answers/recommendations produced by a CA are appropriate, tailored to, and understandable for a specific audience, e.g., school kids, the general public, professionals, and people with (cognitive) disabilities.
Overall, all the above questions call, as one possible output of the workshop, for envisioning some reference architecture for CA systems, geared towards evaluation, which allows the different areas to cooperate on a common ground and to share a common roadmap for improving our understanding of CA systems and making them more effective.
- Artificial Intelligence
- Human-Computer Interaction
- Information Retrieval
- Conversational Agents
- Information Retrieval
- Recommender Systems
- User Interaction