Dagstuhl Seminar 18401
Automating Data Science
( Sep 30 – Oct 05, 2018 )
- Tijl De Bie (Ghent University, BE)
- Luc De Raedt (KU Leuven, BE)
- Holger H. Hoos (Leiden University, NL)
- Padhraic Smyth (University of California - Irvine, US)
- Michael Gerke (for scientific matters)
- Annette Beyer (for administrative matters)
Data science is concerned with the extraction of knowledge and insight, and ultimately societal or economic value, from data. It complements traditional statistics in that its object is data as it presents itself in the wild (often complex and heterogeneous, noisy, loosely structured, biased, etc.), rather than well structured data sampled in carefully designed studies. It also has a strong computer science focus, and is related to popular areas such as big data, machine learning, data mining and knowledge discovery.
Data science is becoming increasingly important with the abundance of big data, while the number of skilled data scientists is lagging. This has raised the question as to whether it is possible to automate data science in several contexts. First, from an artificial intelligence perspective, it is interesting to investigate whether (data) science (or portions of it) can be automated, as it is an activity currently requiring high levels of human expertise. Second, the field of machine learning has a long-standing interest in applying machine learning at the meta-level, in order to obtain better machine learning algorithms, yielding recent successes in automated parameter tuning, algorithm configuration and algorithm selection. Third, there is an interest in automating not only the model building process itself (cf. the Automated Statistician) but also in automating the preprocessing steps (data wrangling).
This Dagstuhl seminar will bring together researchers from all areas concerned with data science in order to study whether, to what extent, and how data science can be automated. It will focus on the following Data Science topics:
- Data Wrangling
- Predictive Modeling
- Exploratory Data Analysis
- Inductive querying
- Probabilistic Programming
- Visual Analytics
and will aim at answering the following questions:
- How can we automatically tune the parameters or configure algorithms? How can we apply this to machine learning and data science algorithms? This is related to expert / rule-based systems, information criteria, statistical learning theory, learning to learn, meta-learning, etc.
- How can we assist users in their exploratory data mining tasks? Can we automate it? What type of interactivity is needed? How to obtain models of the user and of interestingness?
- How can we support the data-wrangling process? How can inductive programming techniques help? Can it be realized fully automatically? What are the limitations and opportunities?
- How can one automate data-driven story-telling? How can we explain learned models to the user? To what extent can natural language be used?
- Can we (partially) automate Visual Analytics? Can we automatically visualize what is of interest to the user?
- What is the trade-off between automation and interaction? To what extent is automation (un)desirable?
- How can probabilistic programming and inductive querying techniques be used to facilitate data science ?
- How can automation be married with the increasing tendency for personalization? With the impact on privacy and society of data science, are there any additional ethical issues to be taken into account?
- Data Science for the expert versus for the layperson: different optimal trade-offs?
Data science is concerned with the extraction of knowledge and insight, and ultimately societal or economic value, from data. It complements traditional statistics in that its object is data as it presents itself in the wild (often complex and heterogeneous, noisy, loosely structured, biased, etc.), rather than data well-structured data sampled in carefully designed studies.
Such 'Big Data' is increasingly abundant, while the number of skilled data scientists is lagging. This has raised the question as to whether it is possible to automate data science in several contexts. First, from an artificial intelligence perspective, it is related to the issue of "robot scientists", which are concerned with the automation of scientific processes and which have so far largely focused on the life sciences. It is interesting to investigate whether principles of robot scientists can be applied to data science.
Second, there exist many results in the machine learning community, which has since the early 1980s been applying machine learning at a meta-level, in order to learn which machine learning algorithms, variants and (hyper-)parameter settings should be used on which types of data sets.
In recent years, there have been breakthroughs in this domain, and there now exist effective systems (such as Auto-WEKA and auto-sklearn) that automatically select machine learning methods and configure their hyperparameters in order to maximize the predictive performance on particular datasets.
Third, there are projects such as the Automated Statistician that want to fully automate the process of statistical modeling. Such systems could dramatically simplify scientific data modeling tasks, empowering scientists from data-rich scientific disciplines such as bioinformatics, climate data analysis, computational social science, and so on. To ensure success, important challenges not only from a purely modelling perspective, but also in terms of interpretability and the human-computer interface, need to be tackled. For example, the input to the Automated Statistician is a dataset, and the system produces not only a complex statistical model by means of a search process, but also explains it in natural language.
Fourth, there is an interest in not only automating the model building step in data science, but also various steps that precede it. It is well known in data science that 80% of the effort goes into preprocessing the data, putting it in the right format, and selecting the right features, whereas the model-building step typically only takes 20% of the effort. This has motivated researchers to focus on automated techniques for data wrangling, which is precisely concerned with transforming the given dataset into a format that can be handled by the data analysis component. Here, there are strong connections with inductive programming techniques.
Fifth, as it is often easier for non-expert users to interpret and understand visualisations of data rather than statistical models, work on automatic visualisation of data sets is also very relevant to this Dagstuhl seminar.
Finally, an interesting and challenging research question is whether it is possible to develop an integrated solution that tackles all these issues (as is the topic of the ERC AdG SYNTH).
Overview of the seminar
Structure of the seminar
The seminar was structured as follows. The mornings were generally dedicated to presentations (short tutorials on day one), whereas the afternoons were generally dedicated to discussions such as plenary discussions, smaller-group breakout sessions, and flex time that was kept open prior to the seminar. The flex time ended up being dedicated to a mix of presentations and breakout sessions.
Challenges in automating data science
On day one, a range of challenges for research on automating data science were identified, which can be clustered around the following six themes:
- Automating Machine Learning (AutoML)
Main challenges: computational efficiency; ensuring generalization also for small data; make AutoML faster and more data-efficient using meta-learning; extending ideas from AutoML to exploratory analysis / unsupervised learning.
- Exploratory data analysis and visualization
Main challenges: the fact that there is there is no single or clearly defined objective; help the user make progress towards an ill-defined goal; (subjective) interestingness of an analysis, a pattern, or a visualization; integrate machine learning and interaction in exploration; exploration of data types beyond simply tabular; veracity of visualizations; how to quantify progress and measure success; the need for benchmarks.
- Data wrangling
Main challenges: extend the scope of AutoML to include data wrangling tasks; user interfaces to provide intuitive input in data wrangling tasks; how to quantify progress and measure success; the need for benchmarks.
- Automation and human-centric data science (explainability, privacy, fairness, trust, interaction)
Main challenges: build-in privacy and fairness constraints in automatic data science systems; the dangers of ignorant usage of automated data science systems; different levels of expertise benefit from different degrees of automation; optimizing the performance of the combined human/machine `team'; determine when and where the human must be involved; definition or criteria for explainability; risk that automation will reduce explainability and transparency; explainability to whom -- a data scientist or layperson?
- Facilitating data science by novel querying and programming paradigms
Main challenges: interactive data models to help users gain intuitive understanding; declarative approaches for data analysis, querying, and visualization; a query language for automated data science.
Main challenges: robust objective measures for data science processes beyond predictive modelling; subjective measures: measures that depend on the user background and goals; evaluation of the entire data science pipeline versus individual steps; reproducibility in the presence of user interactions.
Topics discussed in depth
These identified challenges were then used to determine the program of the rest of the seminar. Talks were held on partial solutions to a range of these challenges. In addition, breakout discussions were held on the following topics:
- The relation between data-driven techniques and knowledge-based reasoning.
- Data wrangling.
- Beyond the black-box: explainability.
- Automation of exploratory / unsupervised data science tasks, and visualization.
- Automating data science for human users.
Along with abstracts of the talks, detailed discussions of the main ideas and conclusions of each of these breakout sessions are included in this Dagstuhl report.
Discussion and outlook
Automating data science is an area of research that is understudied as such. AutoML, as a subarea of automating data science, is arguably the first subarea where some remarkable successes have been achieved. This seminar identified the main challenges for the field in translating these successes into advances in other subareas of automating data science, most notably in automating exploratory data analysis, data wrangling and related tasks, integrating data and knowledge-driven approaches, and ultimately the data science process as a whole, from data gathering to the creation of insights and value.
Further developing automated data science raises several challenges. A first challenge concerns the evaluation of automated data science methods. Indeed, the possibility to automate is preconditioned on the availability of criteria to optimize. A second key one is how to ensure that automated data science systems remain Human-Centric, viewing humans as useful allies and ultimate beneficiaries. This can be achieved by designing effective user-interaction techniques, by ensuring explainability, and by ensuring privacy is respected and individuals are treated fairly. These are basic requirements for ensuring justified trust in automated data science systems, and thus key drivers to success.
- Leman Akoglu (Carnegie Mellon University - Pittsburgh, US) [dblp]
- Mitra Baratchi (Leiden University, NL) [dblp]
- Michael R. Berthold (Universität Konstanz, DE) [dblp]
- Hendrik Blockeel (KU Leuven, BE) [dblp]
- Pavel Brazdil (University of Porto, PT) [dblp]
- Ray G. Butler (Butler Scientifics - Barcelona, ES)
- Remco Chang (Tufts University - Medford, US) [dblp]
- Felipe Leno da Silva (University of São Paulo, BR) [dblp]
- Tijl De Bie (Ghent University, BE) [dblp]
- Luc De Raedt (KU Leuven, BE) [dblp]
- Peter Flach (University of Bristol, GB) [dblp]
- Paolo Frasconi (University of Florence, IT) [dblp]
- Elisa Fromont (University of Rennes, FR) [dblp]
- José Hernández-Orallo (Technical University of Valencia, ES) [dblp]
- Holger H. Hoos (Leiden University, NL) [dblp]
- Frank Hutter (Universität Freiburg, DE) [dblp]
- Tobias Jacobs (NEC Laboratories Europe - Heidelberg, DE) [dblp]
- Lars Kotthoff (University of Wyoming - Laramie, US) [dblp]
- Nada Lavrac (Jozef Stefan Institute - Ljubljana, SI) [dblp]
- Kevin Leyton-Brown (University of British Columbia - Vancouver, CA) [dblp]
- Jefrey Lijffijt (Ghent University, BE) [dblp]
- Zhengying Liu (University of Paris Sud - Orsay, FR) [dblp]
- Siegfried Nijssen (UC Louvain, BE) [dblp]
- Andrea Passerini (University of Trento, IT) [dblp]
- María Pérez-Ortiz (University of Cambridge, GB) [dblp]
- Bernhard Pfahringer (University of Waikato, NZ) [dblp]
- Kai Puolamäki (University of Helsinki, FI) [dblp]
- Matteo Riondato (Two Sigma Investments LP - New York, US) [dblp]
- Ute Schmid (Universität Bamberg, DE) [dblp]
- Marc Schoenauer (INRIA Saclay, FR) [dblp]
- Michele Sebag (CNRS, FR) [dblp]
- Padhraic Smyth (University of California - Irvine, US) [dblp]
- Alexandre Termier (University Rennes, FR) [dblp]
- Stefano Teso (KU Leuven, BE) [dblp]
- Heike Trautmann (Universität Münster, DE) [dblp]
- Isabel Valera (MPI für Intelligente Systeme - Tübingen, DE) [dblp]
- Matthijs van Leeuwen (Leiden University, NL) [dblp]
- Joaquin Vanschoren (TU Eindhoven, NL) [dblp]
- Jilles Vreeken (Universität des Saarlandes, DE) [dblp]
- Andreas Wierse (SICOS BW GmbH - Stuttgart, DE) [dblp]
- Christopher Williams (University of Edinburgh, GB) [dblp]
- artificial intelligence / robotics
- data bases / information retrieval
- programming languages / compiler
- data science
- artificial intelligence
- automated machine learning
- automated scientific discovery
- inductive programming