This Dagstuhl Seminar aims to bring together an interdisciplinary group of researchers and practitioners, spanning Data Science (DS) and Machine Learning (ML), Visualization and Human-Computer Interactions (HCI), and Provenance; to tackle the challenges in automated data science (AutoDS). We specifically focus on ways that methods from human-centered design approaches and provenance can be leveraged to “open up the black box” of AutoDS, introduce greater observability of these methods, and promote human-machine teaming. We observed that there exist many parallel efforts across different disciplines that have yet to be integrated; our seminar aims to bring together these different perspectives to produce a general synthesis of methodologies and techniques for advancing AutoDS.
Primitives for AutoDS and hybrid modes of automation. Initial implementations of AutoDS tooling were focused on the so-called CASH problem, combining algorithm selection with parameter optimization, which was exclusively limited to the modeling phase of the data science workflow. More recent work has expanded the scope to include tasks pertaining to data preparation, feature engineering, even model deployment and monitoring for concept drift. Within this expanded end-to-end scope for AutoDS, the individual components of the data science pipeline are often referred to as data science primitives; whether those primitives concern work carried out by a human (i.e., selecting a data set for analysis) or a machine (i.e., hyperparameter tuning) depends on the implementation of the system. Discussing these data science primitives and the scope of the hybrid automation, where humans and automated processes trade-off work, is important for framing a discussion around provenance and human-centered design.
Provenance modalities in an end-to-end AutoDS pipeline. Existing methodologies for provenance in data analysis focus on three related themes: data provenance, computation provenance, and user provenance. These are often studied separately, and should be explored together in AutoDS to be fully transparent and auditable. Yet, modalities of capturing data, computation, and user provenance may not always align and there exist few techniques that attempt their integration. Moreover, user provenance can be especially complex to capture and surface, as the thinking and reasoning behind analysis choices and decisions are much more challenging to capture than data science workflow or user interactions.
Visual and interaction techniques for explainable AutoDS (i.e., model-to-human communication). Data visualization is a powerful medium to help users understand and analyze complex data (in our case the AutoDS provenance), as well as to create opportunities for domain experts and data scientists to interrogate the pipelines themselves. Visual techniques for provenance of AutoDS pipelines exist (i.e., PipelineProfiler, ATMSeer, ModelLineUpper, AutoVizAI, and Visus) but these focus almost exclusively on modeling and do not consider the broader scope of AutoDS primitives. This topic intends to explore possibilities and utility of visualizing multiple provenance modalities and across AutoDS primitives.
Human-centered approaches to data science and analytics (i.e., human-to-model communication). Acknowledging that humans and automated processes must collaborate in AutoDS, it becomes necessary to explicitly consider the needs of humans to understand and intervene. Human-centered design encapsulates a broad set of methodologies and techniques for designing technology that interfaces with people. Researchers and practitioners advocate for a broader application in human-centered approaches to ML/AI, including mitigating concerns of “black box” algorithms as discussed earlier. Another research challenge here is to make DS models more “interactive” so user expertise and knowledge can be more easily incorporated, especially for non-technical domain experts. This can happen during the training of a large model through user “steering” to reduce training time, or after deployment with techniques such as “active learning” to continuously improve the module.
- Human-Computer Interaction
- Machine Learning
- Machine Learning
- Human-Centered Design