Dagstuhl Seminar 23372: Human-Centered Approaches for Provenance in Automated Data Science

Dagstuhl Seminar 23372

Human-Centered Approaches for Provenance in Automated Data Science

( Sep 10 – Sep 15, 2023 )

(Click in the middle of the image to enlarge)

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/23372

Organizers

Anamaria Crisan (Tableau Software - Seattle, US)
Lars Kotthoff (University of Wyoming - Laramie, US)
Marc Streit (Johannes Kepler Universität Linz, AT)
Kai Xu (University of Nottingham, GB)

Contact

Marsha Kleinbauer (for scientific matters)
Simone Schilke (for administrative matters)

Publications

Anamaria Crisan, Lars Kotthoff, Marc Streit, and Kai Xu. Human-Centered Approaches for Provenance in Automated Data Science (Dagstuhl Seminar 23372). In Dagstuhl Reports, Volume 13, Issue 9, pp. 116-136, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Schedule

Schedule

Summary

Show Summary

This Dagstuhl Seminar brings together an interdisciplinary group of researchers and practitioners, spanning Data Science (DS) and Machine Learning (ML), Visualization and Human-Computer Interactions (HCI), and Provenance; to tackle the challenges in automated data science (AutoDS). We specifically focused on ways that methods from human-centered design approaches and provenance can be leveraged to „open up the black box“ of AutoDS, introduce greater observability of these methods, and promote human-machine teaming. We observed that there exist many parallel efforts across different disciplines that have yet to be integrated; our seminar brought together these different perspectives as a first step towards producing a general synthesis of methodologies and techniques for advancing AutoDS.

Primitives for AutoDS and hybrid modes of automation. Initial implementations of AutoDS tooling were focused on the so-called CASH problem, combining algorithm selection with parameter optimization, which was exclusively limited to the modeling phase of the data science workflow. More recent work has expanded the scope to include tasks pertaining to data preparation, feature engineering, even model deployment and monitoring for concept drift. Within this expanded end-to-end scope for AutoDS, the individual components of the data science pipeline are often referred to as data science primitives; whether those primitives concern work carried out by a human (i.e., selecting a data set for analysis) or a machine (i.e., hyperparameter tuning) depends on the implementation of the system. Discussions on these data science primitives and the scope of the hybrid automation, where humans and automated processes trade-off work, help frame a discussion around provenance and human-centered design.

Provenance modalities in an end-to-end AutoDS pipeline. Existing methodologies for provenance in data analysis focus on three related themes: data provenance, computation provenance, and user provenance. These are often studied separately, while they should be explored together in AutoDS to be fully transparent and auditable. It was identified that modalities of capturing data, computation, and user provenance may not always align and there exist few techniques that attempt their integration. Moreover, user provenance can be especially complex to capture and surface, as the thinking and reasoning behind analysis choices and decisions are much more challenging to capture than data science workflow or user interactions. Many open problems and potential solutions were discussed at the seminar and more details are provided in the full report.

Visual and interaction techniques for explainable AutoDS (i.e., model-to-human communication). Data visualization is a powerful medium to help users understand and analyze complex data (in our case the AutoDS provenance), as well as to create opportunities for domain experts and data scientists to interrogate the pipelines themselves. Visual techniques for provenance of AutoDS pipelines exist (i.e., PipelineProfiler, ATMSeer, ModelLineUpper, AutoVizAI, and Visus) but these focus almost exclusively on modeling and do not consider the broader scope of AutoDS primitives. Seminar participants explored the possibilities and utility of visualizing multiple provenance modalities and across AutoDS primitives to achieve this goal.

Human-centered approaches to data science and analytics (i.e., human-to-model communication). Seminar participants acknowledged that humans and automated processes must collaborate in AutoDS, and it becomes necessary to explicitly consider the needs of humans to understand and intervene. Human-centered design encapsulates a broad set of methodologies and techniques for designing technology that interfaces with people. Seminar participants advocated for a broader application in human-centered approaches to ML/AI, including mitigating concerns of „black box“ algorithms as discussed earlier. A related research challenge identified is to make DS models more „interactive“ so user expertise and knowledge can be more easily incorporated, especially for non-technical domain experts. This can happen during the training of a large model through user „steering“ to reduce training time, or after deployment with techniques such as „active learning“ to continuously improve the module.

Creative Commons BY 4.0

Anamaria Crisan, Lars Kotthoff, Marc Streit, and Kai Xu

Motivation

Show Motivation

This Dagstuhl Seminar aims to bring together an interdisciplinary group of researchers and practitioners, spanning Data Science (DS) and Machine Learning (ML), Visualization and Human-Computer Interactions (HCI), and Provenance; to tackle the challenges in automated data science (AutoDS). We specifically focus on ways that methods from human-centered design approaches and provenance can be leveraged to “open up the black box” of AutoDS, introduce greater observability of these methods, and promote human-machine teaming. We observed that there exist many parallel efforts across different disciplines that have yet to be integrated; our seminar aims to bring together these different perspectives to produce a general synthesis of methodologies and techniques for advancing AutoDS.

Primitives for AutoDS and hybrid modes of automation. Initial implementations of AutoDS tooling were focused on the so-called CASH problem, combining algorithm selection with parameter optimization, which was exclusively limited to the modeling phase of the data science workflow. More recent work has expanded the scope to include tasks pertaining to data preparation, feature engineering, even model deployment and monitoring for concept drift. Within this expanded end-to-end scope for AutoDS, the individual components of the data science pipeline are often referred to as data science primitives; whether those primitives concern work carried out by a human (i.e., selecting a data set for analysis) or a machine (i.e., hyperparameter tuning) depends on the implementation of the system. Discussing these data science primitives and the scope of the hybrid automation, where humans and automated processes trade-off work, is important for framing a discussion around provenance and human-centered design.

Provenance modalities in an end-to-end AutoDS pipeline. Existing methodologies for provenance in data analysis focus on three related themes: data provenance, computation provenance, and user provenance. These are often studied separately, and should be explored together in AutoDS to be fully transparent and auditable. Yet, modalities of capturing data, computation, and user provenance may not always align and there exist few techniques that attempt their integration. Moreover, user provenance can be especially complex to capture and surface, as the thinking and reasoning behind analysis choices and decisions are much more challenging to capture than data science workflow or user interactions.

Visual and interaction techniques for explainable AutoDS (i.e., model-to-human communication). Data visualization is a powerful medium to help users understand and analyze complex data (in our case the AutoDS provenance), as well as to create opportunities for domain experts and data scientists to interrogate the pipelines themselves. Visual techniques for provenance of AutoDS pipelines exist (i.e., PipelineProfiler, ATMSeer, ModelLineUpper, AutoVizAI, and Visus) but these focus almost exclusively on modeling and do not consider the broader scope of AutoDS primitives. This topic intends to explore possibilities and utility of visualizing multiple provenance modalities and across AutoDS primitives.

Human-centered approaches to data science and analytics (i.e., human-to-model communication). Acknowledging that humans and automated processes must collaborate in AutoDS, it becomes necessary to explicitly consider the needs of humans to understand and intervene. Human-centered design encapsulates a broad set of methodologies and techniques for designing technology that interfaces with people. Researchers and practitioners advocate for a broader application in human-centered approaches to ML/AI, including mitigating concerns of “black box” algorithms as discussed earlier. Another research challenge here is to make DS models more “interactive” so user expertise and knowledge can be more easily incorporated, especially for non-technical domain experts. This can happen during the training of a large model through user “steering” to reduce training time, or after deployment with techniques such as “active learning” to continuously improve the module.

Creative Commons BY 4.0

Anamaria Crisan, Lars Kotthoff, Marc Streit, and Kai Xu

Participants

Show Participants

Marie Anastacio (RWTH Aachen, DE) [dblp]
Leilani Battle (University of Washington - Seattle, US) [dblp]
Jürgen Bernard (Universität Zürich, CH) [dblp]
Nadia Boukhelifa (INRAE - Palaiseau, FR) [dblp]
Camelia D. Brumar (Tufts University - Medford, US) [dblp]
Mehdi Chakhchoukh (Université de Paris-Saclay - Gif-sur-Yvette, FR & INRAE - Paris, FR) [dblp]
Anamaria Crisan (Tableau Software - Seattle, US) [dblp]
Klaus Eckelt (Johannes Kepler Universität Linz, AT) [dblp]
Mennatallah El-Assady (ETH Zürich, CH) [dblp]
Alex Endert (Georgia Institute of Technology - Atlanta, US) [dblp]
Rebecca Faust (Virginia Polytechnic Institute - Blacksburg, US) [dblp]
Kiran Gadhave (University of Utah - Salt Lake City, US) [dblp]
Andreas Kerren (Linköping University, SE) [dblp]
Steffen Koch (Universität Stuttgart, DE) [dblp]
David Koop (Northern Illinois University - DeKalb, US) [dblp]
Lars Kotthoff (University of Wyoming - Laramie, US) [dblp]
Alexander Lex (University of Utah - Salt Lake City, US) [dblp]
Dominik Moritz (Carnegie Mellon University - Pittsburgh, US) [dblp]
Alvitta Ottley (Washington University - St. Louis, US) [dblp]
Jen Rogers (Tufts University - Medford, US)
Sheeba Samuel (Friedrich-Schiller-Universität Jena, DE)
Marc Streit (Johannes Kepler Universität Linz, AT) [dblp]
Tanja Tornede (Leibniz Universität Hannover, DE) [dblp]
Cagatay Turkay (University of Warwick - Coventry, GB) [dblp]
Emily Wall (Emory University - Atlanta, US) [dblp]
Kai Xu (University of Nottingham, GB) [dblp]

Classification

Graphics
Human-Computer Interaction
Machine Learning

Keywords

Automation
Machine Learning
Visualization
Human-Centered Design
Provenance

Seminar 23372

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Seminar 23372

Human-Centered Approaches for Provenance in Automated Data Science

( Sep 10 – Sep 15, 2023 )

Permalink

Organizers

Contact

Publications

Schedule

Summary

Motivation

Participants

Classification

Keywords