Dagstuhl Seminar 26192
Evaluation of AI Models in Software Engineering
( May 03 – May 08, 2026 )
Permalink
Organizers
- Satish Chandra (Meta - Menlo Park, US)
- Maliheh Izadi (Google - Zürich, CH & TU Delft, NL)
- Michael Pradel (Universität Stuttgart, DE & CISPA - Saarbrücken, DE)
Contact
- Andreas Dolzmann (for scientific matters)
- Jutka Gasiorowski (for administrative matters)
Dagstuhl Reports
As part of the mandatory documentation, participants are asked to submit their talk abstracts, working group results, etc. for publication in our series Dagstuhl Reports via the Dagstuhl Reports Submission System.
- Upload (Use personal credentials as created in DOOR to log in)
Dagstuhl Seminar Wiki
- Dagstuhl Seminar Wiki (Use personal credentials as created in DOOR to log in)
Shared Documents
- Dagstuhl Materials Page (Use personal credentials as created in DOOR to log in)
Schedule
- Upload (Use personal credentials as created in DOOR to log in)
Large Language Models (LLMs) are rapidly reshaping software engineering—powering code generation, debugging, and documentation. Yet while adoption is high, trust is low: only 3% of developers “highly trust” AI coding tools. A key reason is the lack of rigorous, standardized evaluation. Current benchmarks capture basic correctness but overlook qualities critical in real projects, such as readability, maintainability, security, and efficiency.
This Dagstuhl Seminar brings together researchers and practitioners to confront this evaluation gap. Our goal is to define what to measure, and how, when assessing LLM-based tools, and to develop shared benchmarks and guidelines. By building a stronger foundation for evaluation, we aim to foster reliable comparisons, drive progress in tool development, and strengthen confidence in AI for software engineering.
Objectives and Topics
The seminar aims to build a community-driven roadmap for evaluating AI in software engineering. We will:
- Benchmark AI coding tools: Define standardized tasks and benchmarks that better reflect real-world coding, from multi-file projects to collaborative scenarios.
- Improve evaluation frameworks: Develop reproducible, community-adopted frameworks and harnesses that support continuous and human-in-the-loop evaluation.
- Expand metrics: Go beyond accuracy to capture readability, maintainability, security, efficiency, and other critical dimensions of software quality.
- Address future open challenges:Anticipate emerging issues, such as evaluating coding agents, human–AI collaboration, and responsible benchmarking practices.
Together, these efforts will establish shared standards for rigorous and practical evaluation of AI in software engineering.
Satish Chandra, Maliheh Izadi, and Michael Pradel
This seminar qualifies for Dagstuhl's LZI Junior Researchers program. Schloss Dagstuhl wishes to enable the participation of junior scientists with a specialisation fitting for this Dagstuhl Seminar, even if they are not on the radar of the organizers. Applications by outstanding junior scientists are possible until December 5, 2025.
Please log in to DOOR to see more details.
- Vivi Andersson (KTH Royal Institute of Technology - Stockholm, SE) [dblp]
- Doehyun Baek (Universität Stuttgart, DE) [dblp]
- Earl T. Barr (University College London, GB) [dblp]
- Eric Bodden (Universität Paderborn, DE) [dblp]
- Marcel Böhme (MPI-SP - Bochum, DE) [dblp]
- Satish Chandra (Meta - Menlo Park, US) [dblp]
- Jürgen Cito (TU Wien, AT) [dblp]
- Premkumar T. Devanbu (University of California - Davis, US) [dblp]
- Yintong Huo (SMU - Singapore, SG) [dblp]
- Maliheh Izadi (Google - Zürich, CH & TU Delft, NL) [dblp]
- Mehdi Keshani (Universität Zürich, CH) [dblp]
- Claire Le Goues (Carnegie Mellon University - Pittsburgh, US) [dblp]
- Ziyou Li (TU Delft, NL) [dblp]
- Zhongxin Liu (Zhejiang University - Hangzhou, CN) [dblp]
- Petros Maniatis (Google DeepMind - Mountain View, US) [dblp]
- Pedro Orvalho (IIIA, CSIC - Barcelona, ES) [dblp]
- Matteo Paltenghi (Meta - Menlo Park, US) [dblp]
- Rangeet Pan (IBM TJ Watson Research Center - Yorktown Heights, US) [dblp]
- Rahul Pandita (GitHub - San Francisco, US) [dblp]
- Razvan Popescu (TU Delft, NL) [dblp]
- Michael Pradel (Universität Stuttgart, DE & CISPA - Saarbrücken, DE) [dblp]
- Francisco Ribeiro (New York University - Abu Dhabi, AE)
- Patrick Rondon (Google - New York, US) [dblp]
- Abhik Roychoudhury (National University of Singapore, SG) [dblp]
- Shin Yoo (KAIST - Daejeon, KR) [dblp]
Classification
- Artificial Intelligence
- Software Engineering
Keywords
- large language models
- evaluation
- benchmarking
- metrics
- frameworks

Creative Commons BY 4.0
