Dagstuhl Perspectives Workshop 24492: Human in the Loop Learning through Grounded Interaction in Games

Dagstuhl Perspectives Workshop 24492

Human in the Loop Learning through Grounded Interaction in Games

( Dec 01 – Dec 06, 2024 )

(Click in the middle of the image to enlarge)

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/24492

Organizers

Raffaella Bernardi (University of Trento, IT)
Julia Hockenmaier (University of Illinois - Urbana-Champaign, US)
Udo Kruschwitz (Universität Regensburg, DE)
Massimo Poesio (Queen Mary University of London, GB & Utrecht University, NL)

Contact

Marsha Kleinbauer (for scientific matters)
Christina Schwarz (for administrative matters)

Shared Documents

Dagstuhl Materials Page (Use personal credentials as created in DOOR to log in)

Publications

Raffaella Bernardi, Julia Hockenmaier, Udo Kruschwitz, Prashant Jayannavar, and Massimo Poesio. Human in the Loop Learning through Grounded Interaction in Games (Dagstuhl Perspectives Workshop 24492). In Dagstuhl Reports, Volume 14, Issue 12, pp. 28-45, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Schedule

Schedule

Summary

Show Summary

Background and Motivation

Over the past few years, there has been a decisive move in Artificial Intelligence (AI) towards human-centered intelligence and towards AI models that can learn through interaction. An important reason for this shift has been the appearance of the latest generation of Large Language Models such as InstructGPT, ChatGPT, BARD, or Lamda-2 [13, 12, 17] capable of a step-increase in performance. A good part of the success of these models is due to the adoption of training regimes involving a combination of supervised learning and learning from interaction with humans, such as Reinforcement Learning Through Human Feedback [2, 13]. And particularly the most recent among such models, such as GPT-4, are not simply language models, but are trained with multimodal data and are capable of producing output in different modalities as well. However, these models still suffer from a number of widely discussed issues, such as hallucinations.

In parallel, there has also been substantial progress on grounded interaction – developing models aware of the situation in which they operate (a physical world in the case of robots, a virtual world in the case of artificial agents) and able to, e.g., understand / produce references to this situation [4, 7, 8, 1, 16] perhaps through negotiation [3]. However, the communication between the interactive learning and grounded interaction communities is still limited [10].

One domain considered particularly promising to study learning through grounded interaction with human agents is virtual world games: games in which conversational agents impersonating characters can learn to perform tasks, or improve their communicative ability, by interacting with human players in platforms such as Minecraft or Light [6, 19, 11, 15, 9, 24]. Games have been shown to be a promising platform for collecting data from thousands of players [20, 23]; virtual worlds approach the complexity of the real world; and virtual agents operating in such virtual worlds need to be able to develop a variety of interactional skills to be perceived as “real” [14].

This Dagstuhl PerspectivesWorkshop aimed, first of all, to bring together the communities working on the related areas of learning through interaction, (conversational) agents in games, dialogue and interaction, and collecting judgments from crowds through games, to make each community aware of the most recent developments in the other areas. We also intended to discuss current challenges, and whether advances in one area (e.g., grounded interaction) can benefit other areas (e.g., interactive learning).

Directions Identified and Discussed

The workshop involved extensive discussions between researchers working in all the fields that contribute to the research area. After in-depth presentations of:

The State of the Art (SOTA) in The Grounded and Communication Task Performance Abilities of (Embodied/Multimodal, Conversational) AI Agents (by group led by Hockenmaier)
The Games and Multimodal Platforms Useful for Conversational AI Agents (group lead: Bernardi) and
Current Approaches to Human (and Artificial Agent)-in-the-Loop Learning for AI Agents (group lead: Suglia),

and presentation of some of the most recent relevant research by the participants, we identified a few research directions particularly worth discussing in depth, and formed working groups around them. These included:

Complex Interaction.
A common assumption in many computational models is that dialogue consists of a linear sequence of turns in which two agents alternately exchange information. Each turn is assumed to depend only on the last turn of the other participant. However, human conversation requires more complex forms of interaction spanning multiple turns to solve real-world tasks.
Complexity occurs for several reasons: Dialogue is done by multiple people. They start from different information states, have different perspectives, and cannot see what is in each other's minds. They have a social relationship that they have to manage. Their interaction happens in real time, across multiple modalities, in the presence not only of various kinds of noise but of fundamental asymmetries in what the participants can perceive, know and understand.
To successfully overcome these asymmetries and solve tasks through such interactions, the interaction scheme needs to offer a number of functions (see below). Among humans, these are exemplified by a variety of phenomena that depend not just on sequential information exchange but on more complex structures, with richer models of the local and global interaction context. It is not clear to what degree current LLM-based models of dialogue can cope with them, and how much this limits their ability to collaborate efficiently with humans.
The working group on Complex Interaction reviewed some of these complex interaction phenomena, gave pointers to the literature, and discussed ways in which future interactive systems might handle them.
Game Design for Grounded Interaction.
Existing games and platforms used to evaluate and develop conversational agents are extremely diverse in their setting, goals, and complexities, and they are being developed in different subfields of AI, NLP, and computational linguistics [5]. Furthermore, within these subfields, games are designed for different purposes and, to some extent, classified using different taxonomies.
The explosion and diversity of games raise new research questions for these communities that we suggest should be explored in future research:

Q1: How can games and game benchmarks be designed more systematically, such that they lead to a deeper understanding of games and the skills that games are testing? How do we generalize skills and agents' abilities across games?
Q2: What role does the complexity of the game have? And how do we measure it?
Q3: How do we evaluate agents within and across games? In particular, how do we evaluate whether the skills trained / tested with a game transfer to real world applications?

Perspectives for Language Learning from Human Interaction.
Most of the recent AI breakthroughs have been in non-interactive settings: classical NLP tasks, math reasoning, etc. This was mainly due to the large availability of evaluation datasets in those domains. This is now changing, we see several new types of more realistic benchmarks that necessitate interacting within a given environment (WebArena [25], Webshop [22], OSWorld [21], AppWorld [18]). Learning paradigms that deal with interactivity need to be used.
Games are a convenient tool that enables us to construct scenarios that constructively approximate real scenarios. For example, the complexity of the games (what is observed vs what is learned, the search space for ML) can be iteratively increased or decreased and hence different learning methods can be studied in a more systematic and comparative fashion. Secondly, games, not being samples from real-world interactions but being close approximations of such interaction are a good way to engage human interactors to provide behavioural information and create consistent environments where data-collection (and if needed also data annotation) can be systematically performed.
The working group on Perspectives for Language Learning from Human Interaction produced a classification scheme of current ML approaches for learning from interaction, identifying a number of open questions, including:

Q4: How can agents learn to have/recognise intentions?
Q5: What are the tasks/games that can facilitate the acquisition of these skills?

Perceptual Grounding for Embodied Conversational Agents.
This group built their discussion around the research hypothesis that interactivity plays a major role in human intelligence; interactivity has multiple aspects that we spelled out: 1) interacting with an environment (manipulate objects, act on them), 2) interacting with others through language, and 3) interacting with others while acting in an environment. Through such multimodal embodied experiences humans develop their cognitive intelligence (in other words, understanding the state of affairs in the world) and social intelligence (understanding the mechanisms of interactions).

References

Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. Babyai: A platform to study the sample efficiency of grounded language learning. In 7th International Conference on Learning Representations, ICLR, 2019.
Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, page 4299–4307, 2017.
Herbert H. Clark and Susan E. Brennan. Grounding in communication. In L. B. Resnick, J. Levine, and S. D. Behrend, editors, Perspectives on Socially Shared Cognition. APA, 1990.
Nicholas FitzGerald, Yoav Artzi, and Luke Zettlemoyer. Learning distributions over logical forms for referring expression generation. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard, editors, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1914–1925, Seattle, Washington, USA, October 2013. Association for Computational Linguistics.
Roberto Gallotta, Graham Todd, Marvin Zammit, Sam Earle, Antonios Liapis, Julian Togelius, and Georgios N. Yannakakis. Large language models and games: A survey and roadmap. ArXiV, 2024.
Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. In Proc. of IJCAI, pages 4246–4247, 2016.
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in photographs of natural scenes. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar, October 2014. Association for Computational Linguistics.
Casey Kennington and David Schlangen. A simple generative model of incremental reference resolution for situated dialogue. Computer Speech and Language, 41:43–67, 2017.
Julia Kiseleva, Alexey Skrynnik, Artem Zholus, Shrestha Mohanty, Negar Arabzadeh, Marc-Alexandre Côté, Mohammad Aliannejadi, Milagro Teruel, Ziming Li, Mikhail Burtsev, Maartje ter Hoeve, Zoya Volovikova, Aleksandr Panov, Yuxuan Sun, Kavya Srinet, Arthur Szlam, Ahmed Awadallah, Seungeun Rho, Taehwan Kwon, DanielWontae Nam, Felipe Bivort Haiek, Edwin Zhang, Linar Abdrazakov, Guo Qingyam, Jason Zhang, and Zhibin Guo. Interactive grounded language understanding in a collaborative environment: Retrospective on Iglu 2022 competition. In Proceedings of the NeurIPS 2022 Competitions Track, PMLR, volume 220, pages 204–216, 2022.
Jayant Krishnamurthy and Thomas Kollar. Jointly learning to parse and perceive: Connecting natural language to the physical world. Transactions of the Association for Computational Linguistics, 1:193–206, 2013.
Anjali Narayan-Chen, Prashant Jayannavar, and Julia Hockenmaier. Collaborative dialogue in Minecraft. In Proc. of the 57th Annual Meeting of the ACL, pages 5405–5415, 2019.
OpenAI. Chatgpt: A large-scale open-domain chatbot. Blog post, 2022.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, PeterWelinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022.
David Schlangen. Dialogue games for benchmarking language understanding: Motivation, taxonomy, strategy. arXiv:2304.07007 [cs.CL], 2023.
Arthur Szlam, Jonathan Gray, Kavya Srinet, Yacine Jernite, Armand Joulin, Gabriel Synnaeve, Douwe Kiela, Haonan Yu, Zhuoyuan Chen, Siddharth Goyal, Demi Guo, Danielle Rothermel, C. Lawrence Zitnick, and Jason Weston. Why build an assistant in Minecraft? arXiv: 1907.09273, 2019.
Alberto Testoni and Raffaella Bernardi. “I’ve seen things you people wouldn’t believe”: Hallucinating entities in GuessWhat?! In Jad Kabbara, Haitao Lin, Amandalynne Paullada, and Jannis Vamvas, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 101–111, Online, August 2021. Association for Computational Linguistics.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023.
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents, 2024.
Jack Urbanek, Angela Fan, Siddharth Karamcheti, Saachi Jain, Emily Dinan, Tim Rocktäschel, Douwe Kiela, Arthur Szlam, Samuel Humeau, and Jason Weston. Learning to speak and act in a fantasy text adventure game. arXiv preprint arXiv:1903.03094, March 2019.
Luis von Ahn. Games with a purpose. Computer, 39(6):92–94, 2006.
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024.
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards Scalable real-world web interaction with grounded language agents, 2023.
Juntao Yu, Silviu Paun, Maris Camilleri, Paloma Carretero Garcia, Jon Chamberlain, Udo Kruschwitz, and Massimo Poesio. Aggregating crowdsourced and automatic judgments to scale up a corpus of anaphoric reference for fiction and wikipedia texts. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL), page 767–781, Dubrovnik, Croatia, 2023. Association for Computational Linguistics (ACL).
Pei Zhou, Andrew Zhu, Jennifer Hu, Jay Pujara, Xiang Ren, Chris Callison-Burch, Yejin Choi, and Prithviraj Ammanabrolu. I cast detect thoughts: Learning to converse and guide with intents and theory-of-mind in dungeons and dragons. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 11136–11155, Toronto, CAN, 2023. Association for Computational Linguistics.
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. resolution for situated dialogue. Computer Speech and Language, 41:43–67, 2017.

Creative Commons BY 4.0

Massimo Poesio, Raffaella Bernardi, Julia Hockenmaier, and Udo Kruschwitz

Motivation

Show Motivation

Over the past few years, there has been a decisive move in Artificial Intelligence (AI) towards human-centered intelligence and AI models that can learn through interaction. This shift is the result of the appearance of Large Language Models (LLMs) able to act as Intelligent Assistants such as InstructGPT, ChatGPT, BARD, or Lamda-2 (Ouyang et al, 2022; OpenAI, 2022; Touvron et al, 2023) and achieving an entire new level of performance in many AI tasks. Much of the success of these models is due to training regimes combining supervised learning and learning from interaction with humans, such as Reinforcement Learning Through Human Feedback (Christiano et al, 2017; Ouyang et al, 2022). Particularly, the most recent among such models, such as GPT-4, are trained with multimodal data and capable of producing output in different modalities. However, these models also have well-known issues, such as hallucinations, so that researchers talk of a Generative AI Paradox (West et al, 2023).

In parallel with the above developments, there has also been substantial progress on grounded interaction – developing models aware of the situation in which they operate (a physical world in the case of robots, a virtual world in the case of artificial agents) and able to, e.g., understand / produce references to this situation (Fitzgerald et al, 2013; Kazemzadeh et al, 2014; Kennington & Schlangen 2017; Chevalier-Boisvert et al, 2019; Testoni and Bernardi 2021; Suglia et al, 2024) perhaps through negotiation (Clark and Brennan, 1990). However, the communication between the intelligent assistant and grounded interaction communities is still limited (Krishnamurthy & Kollar 2013).

A particularly promising approach to study learning through grounded interaction with human agents is virtual world games: games in which conversational agents impersonating characters can learn to perform tasks, or improve their communicative ability, by interacting with human players in platforms such as Minecraft or Light (Johnson et al, 2016; Urbanek et al, 2019; Narayan-Chen et al, 2019; Szlam et al, 2019; Kiseleva et al, 2022; Zhou et al, 2023). Games have been shown to be a promising platform for collecting data from thousands of players (Ahn, 2006; Yu et al, 2023); virtual worlds approach the complexity of the real world; and virtual agents operating in such virtual worlds need to be able to develop a variety of interactional skills to be perceived as 'real' (Schlangen, 2023; Chamalasetti et al, 2023).

This Dagstuhl Perspectives Workshop aims, first of all, to bring together the communities working on the related areas of learning through interaction, (conversational) agents in games, dialogue and interaction, and collecting judgments from crowds through games, to make each community aware of the most recent developments in the other areas. We also intend to discuss current challenges, and whether advances in one area (e.g., grounded interaction) can benefit other areas (e.g., interactive learning). Topics to be discussed include:

Are there still gains to be had by training LLMs via games? Is interaction in games still a useful approach to training LLMs in this era when millions of people interact daily with them?
How beneficial is grounded human-in-the-loop? Can grounded human-in-the-loop interaction result in better learning than purely textual interaction, or interaction involving text and images but without reference to a scene? E.g., can it help with hallucinations?
Are there benefits from more complex interaction in interactive learning? Is there any advantage in moving towards a type of interaction more similar to actual human-human interaction – e.g., one in which conversational agents, as well, are allowed to ask clarification requests / take the initiative?
Gamification vs worldification: how does making externally motivated goals more game-like compare with making games more world-like?

Creative Commons BY 4.0

Raffaela Bernardi, Julia Hockenmaier, Udo Kruschwitz, and Massimo Poesio

Participants

Show Participants

Please log in to DOOR to see more details.

Malihe Alikhani (Northeastern University - Boston, US) [dblp]
Elisabeth André (Universität Augsburg, DE) [dblp]
Raffaella Bernardi (University of Trento, IT) [dblp]
Marc-Alexandre Côté (Microsoft - Montreal, CA) [dblp]
Simon Dobnik (University of Gothenburg, SE) [dblp]
Haishuo Fang (TU Darmstadt, DE)
Jonathan Ginzburg (University Paris-Diderot, FR) [dblp]
Ryuichiro Higashinaka (Nagoya University, JP) [dblp]
Julia Hockenmaier (University of Illinois - Urbana-Champaign, US) [dblp]
Nikolai Ilinykh (University of Gothenburg, SE) [dblp]
Prashant Jayannavar (University of Illinois - Urbana-Champaign, US) [dblp]
Alexander Koller (Universität des Saarlandes, DE) [dblp]
Udo Kruschwitz (Universität Regensburg, DE) [dblp]
Sharid Loáiciga (University of Gothenburg, SE) [dblp]
Catharine Oertel (TU Delft, NL) [dblp]
Diego Perez Liebana (Queen Mary University of London, GB) [dblp]
Massimo Poesio (Queen Mary University of London, GB & Utrecht University, NL) [dblp]
Matthew Purver (Queen Mary University of London, GB) [dblp]
David Schlangen (Universität Potsdam, DE) [dblp]
Carina Silberer (Universität Stuttgart, DE) [dblp]
Edwin Simpson (University of Bristol, GB) [dblp]
Alessandro Suglia (Heriot-Watt University - Edinburgh, GB) [dblp]
Alane Suhr (University of California - Berkeley, US)
Sina Zarrieß (Universität Bielefeld, DE) [dblp]
Andrew Zhu (University of Pennsylvania - Philadelphia, US)

Classification

Computation and Language
Computer Science and Game Theory
Human-Computer Interaction

Keywords

Artificial intelligence
Conversational agents in games
Human-in-the-loop learning
Grounded dialogue and interaction

Seminar 24492

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Perspectives Workshop 24492

Human in the Loop Learning through Grounded Interaction in Games

( Dec 01 – Dec 06, 2024 )

Permalink

Organizers

Contact

Shared Documents

Publications

Schedule

Summary

Background and Motivation

Directions Identified and Discussed

References

Motivation

Participants

Classification

Keywords