Dagstuhl Seminar 26281
Multimodal Data Quality – Human, Computational, and Institutional Perspectives
( Jul 05 – Jul 10, 2026 )
Permalink
Organizers
- Gianluca Demartini (University of Queensland - Brisbane, AU)
- Vanessa Murdock (Amazon Web Services - Seattle, US)
- Felix Naumann (Hasso-Plattner-Institut, Universität Potsdam, DE)
- Shazia Sadiq (University of Queensland - Brisbane, AU)
- Divesh Srivastava (AT&T Labs Research - Bedminster, US)
Contact
- Marsha Kleinbauer (for scientific matters)
- Susanne Bach-Bernhard (for administrative matters)
The reliance of advanced applications (e.g., in domains like education, finance, health) and emerging technologies (e.g., large language models (LLMs)) on an ever-increasing scale and variety of multimodal data has created the need for new tools, methods, algorithms, and even new professions (e.g., data quality officer), to ensure that the data coming from different sources is fit for purpose.
Data quality in its many dimensions is generally improved by data curation, which includes at least data preparation and cleaning (e.g., dealing with data quality issues, different formats and structures; data transformations; integration and augmentation); data annotation (e.g., collecting human labels to then train supervised machine learning models to scale-up the annotation process); data synthesis and generation; and inclusion of human and institutional oversight in large-scale automated curation tasks.
This Dagstuhl Seminar is relevant to the diverse fields of data management, data engineering, human computation and crowdsourcing, data-driven decision-making, responsible AI, and data governance. The seminar looks at three perspectives: Human, Computational, and Institutional. There is an evident need to bring together perspectives from domain experts who understand the properties and semantics of the datasets (human perspective); from algorithmic advancements that can help improve data quality (computational perspective); and from institutional imperatives including regulations, standards and organizational policies that create necessary safeguards for governance of data pipeline processes (institutional perspective).
Topics discussed in the seminar will aim to span these three perspectives of data quality as outlined above; including but not limited to:
- Human perspective
- Data Bias and Quality of human annotations collected at scale via crowdsourcing
- Active Learning methods to optimize data annotation strategies
- Support human labelling by means of data augmentation or LLM-generated
- The human impact of data quality due to bias, fairness, robustness, toxicity, privacy
- Computational perspective
- The role of generative AI in data annotation
- Hallucination, reliability, trust of LLMs
- Training AI with noisy or unbalanced data and evaluating on representative data
- Data-Centric AI: machine-generated data, data quality for fine-tuning, evaluation/training data for heterogeneous agentic systems
- Algorithmic fairness, multi-calibration
- Data security and privacy (e.g., jail-breaking LLMs)
- Institutional perspective
- Data Governance and Responsible Information Use
- Information Resilience across the data value chain
- Compliance with data standards and regulation
- Ethics related to the use of low-quality data to inform decisions and train AI
- Organizational structures and best practice
- Data monetization and value realization

Classification
- Computers and Society
- Databases
- Human-Computer Interaction
Keywords
- Data Quality
- Bias and Fairness
- Responsible AI