TOP
Search the Dagstuhl Website
Looking for information on the websites of the individual seminars? - Then please:
Not found what you are looking for? - Some of our services have separate websites, each with its own search option. Please check the following list:
Schloss Dagstuhl - LZI - Logo
Schloss Dagstuhl Services
Seminars
Within this website:
External resources:
  • DOOR (for registering your stay at Dagstuhl)
  • DOSA (for proposing future Dagstuhl Seminars or Dagstuhl Perspectives Workshops)
Publishing
Within this website:
External resources:
dblp
Within this website:
External resources:
  • the dblp Computer Science Bibliography


Dagstuhl Seminar 26192

Evaluation of AI Models in Software Engineering

( May 03 – May 08, 2026 )

Permalink
Please use the following short url to reference this page: https://www.dagstuhl.de/26192

Organizers
  • Satish Chandra (Meta - Menlo Park, US)
  • Maliheh Izadi (TU Delft, NL)
  • Michael Pradel (Universität Stuttgart, DE & CISPA - Saarbrücken, DE)

Contact

Dagstuhl Seminar Wiki

Shared Documents

Schedule
  • Upload (Use personal credentials as created in DOOR to log in)

Motivation

Large Language Models (LLMs) are rapidly reshaping software engineering—powering code generation, debugging, and documentation. Yet while adoption is high, trust is low: only 3% of developers “highly trust” AI coding tools. A key reason is the lack of rigorous, standardized evaluation. Current benchmarks capture basic correctness but overlook qualities critical in real projects, such as readability, maintainability, security, and efficiency.

This Dagstuhl Seminar brings together researchers and practitioners to confront this evaluation gap. Our goal is to define what to measure, and how, when assessing LLM-based tools, and to develop shared benchmarks and guidelines. By building a stronger foundation for evaluation, we aim to foster reliable comparisons, drive progress in tool development, and strengthen confidence in AI for software engineering.

Objectives and Topics

The seminar aims to build a community-driven roadmap for evaluating AI in software engineering. We will:

  • Benchmark AI coding tools: Define standardized tasks and benchmarks that better reflect real-world coding, from multi-file projects to collaborative scenarios.
  • Improve evaluation frameworks: Develop reproducible, community-adopted frameworks and harnesses that support continuous and human-in-the-loop evaluation.
  • Expand metrics: Go beyond accuracy to capture readability, maintainability, security, efficiency, and other critical dimensions of software quality.
  • Address future open challenges:Anticipate emerging issues, such as evaluating coding agents, human–AI collaboration, and responsible benchmarking practices.

Together, these efforts will establish shared standards for rigorous and practical evaluation of AI in software engineering.

Copyright Satish Chandra, Maliheh Izadi, and Michael Pradel

LZI Junior Researchers

This seminar qualifies for Dagstuhl's LZI Junior Researchers program. Schloss Dagstuhl wishes to enable the participation of junior scientists with a specialisation fitting for this Dagstuhl Seminar, even if they are not on the radar of the organizers. Applications by outstanding junior scientists are possible until December 5, 2025.


Participants

Please log in to DOOR to see more details.

  • Vivi Andersson
  • Doehyun Baek
  • Earl T. Barr
  • Eric Bodden
  • Marcel Böhme
  • Jialun Cao
  • Satish Chandra
  • Jürgen Cito
  • Premkumar T. Devanbu
  • Ye He
  • Yintong Huo
  • Maliheh Izadi
  • Mehdi Keshani
  • Claire Le Goues
  • Ziyou Li
  • Zhongxin Liu
  • Petros Maniatis
  • Nachiappan Nagappan
  • Pedro Orvalho
  • Matteo Paltenghi
  • Rangeet Pan
  • Rahul Pandita
  • Razvan Popescu
  • Michael Pradel
  • Francisco Ribeiro
  • Patrick Rondon
  • Abhik Roychoudhury
  • Shin Yoo

Classification
  • Artificial Intelligence
  • Software Engineering

Keywords
  • large language models
  • evaluation
  • benchmarking
  • metrics
  • frameworks