Dagstuhl Seminar 18411: Progressive Data Analysis and Visualization

Dagstuhl Seminar 18411

Progressive Data Analysis and Visualization

( Oct 07 – Oct 12, 2018 )

(Click in the middle of the image to enlarge)

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/18411

Organizers

Jean-Daniel Fekete (INRIA Saclay - Orsay, FR)
Danyel Fisher (Honeycomb - San Francisco, US)
Arnab Nandi (Ohio State University - Columbus, US)
Michael Sedlmair (Universität Stuttgart, DE)

Contact

Andreas Dolzmann (for scientific matters)
Annette Beyer (for administrative matters)

Publications

Progressive Data Analysis and Visualization (Dagstuhl Seminar 18411). Jean-Daniel Fekete, Danyel Fisher, Arnab Nandi, and Michael Sedlmair. In Dagstuhl Reports, Volume 8, Issue 10, pp. 1-40, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)

Impacts

Schedule

Schedule

Motivation

Show Motivation

We live in an era where data is abundant and growing rapidly; databases storing big data sprawl past memory and computation limits, and across distributed systems. New hardware and software systems have been built to sustain this growth in terms of storage management and predictive computation. However, these infrastructures, while good for data at scale, do not well support exploratory data analysis (EDA) as, for instance, commonly used in Visual Analytics. EDA allows human users to make sense of data with little or no known model on this data and is essential in many application domains, from network security and fraud detection to epidemiology and preventive medicine. Data exploration is done through an iterative loop where analysts interact with data through computations that return results, usually shown with visualizations, which in turn are interacted with by the analyst again. Due to human cognitive constraints, exploration needs highly responsive system response times: at 500 ms, users change their querying behavior; past five or ten seconds, users abandon tasks or lose attention. As datasets grow and computations become more complex, response time suffers. To address this problem, a new computation paradigm has emerged in the last decade under several names: online aggregation in the database community; progressive, incremental, or iterative visualization in other communities. It consists of splitting long computations into a series of approximate results improving with time; in this process, partial or approximate results are then rapidly returned to the user and can be interacted with in a fluent and iterative fashion. With the increasing growth in data, such progressive data analysis approaches will become one of the leading paradigms for data exploration systems, but it also will require major changes in the algorithms, data structures, and visualization tools used today.

This Dagstuhl Seminar sets out to discuss and address these challenges, by bringing together researchers from the different involved research communities: database, visualization, and machine learning. Thus far, these communities have often been divided by a gap hindering joint efforts in dealing with forthcoming challenges in progressive data analysis and visualization. The seminar will give a platform for these researchers and practitioners to exchange their ideas, experience, and visions, jointly develop strategies to deal with challenges, and create a deeper awareness of the implications of this paradigm shift. The implications are technical, but also human – both perceptual and cognitive – and the seminar will provide a holistic view of the problem by gathering specialists from all the communities.

Topics of the seminar will include: (1) Online aggregation and progressive data querying, (2) progressivity in analysis operations and algorithms, (3) visualization challenges and opportunities from progressivity, (4) progressivity in the user interface, (5) computational steering, and (6) revisiting the contract between the application and the data infrastructure.

Creative Commons BY 3.0 DE

Jean-Daniel Fekete, Danyel Fisher, Arnab Nandi, and Michael Sedlmair

Summary

Show Summary

We live in an era where data is abundant and growing rapidly; databases to handle big data are sprawling out past memory and computation limits, and across distributed systems. New hardware and software systems have been built to sustain this growth in terms of storage management and predictive computation. However, these infrastructures, while good for data at scale, do not support data exploration well.

The concept of exploratory data analysis (EDA) was introduced by John Tukey in the 1970's and is now a commonplace in visual analytics. EDA allows users to make sense of data with little or no known model; it is essential in many application domains, from network security and fraud detection to epidemiology and preventive medicine. For most datasets, it is considered a best practice to explore data before beginning to construct formal hypotheses. Data exploration is done through an iterative loop where analysts interact with data through computations that return results, usually shown with visualizations. Analysts reacting to these results and visualizations issue new commands triggering new computations until they answer their questions.

However, due to human cognitive constraints, exploration needs highly responsive system response times (see https://www.nngroup.com/articles/powers-of-10-time-scales-in-ux/): at 500 ms, users change their querying behavior; past five or ten seconds, users abandon tasks or lose attention. As datasets grow and computations become more complex, response time suffers. To address this problem, a new computation paradigm has emerged in the last decade under several names: online aggregation in the database community; progressive, incremental, or iterative visualization in other communities. It consists of splitting long computations into a series of approximate results improving with time; the results are then returned at a controlled pace.

This paradigm addresses scalability problems, as analysts can keep their attention on the results of long analyses as they arrive progressively. Initial research has shown promising results in progressive analysis for both big database queries and for machine learning.

The widespread use of progressive data analysis has been hampered by a chicken-and-egg problem: data visualization researchers do not have online database systems to work against, and database researchers do not have tools that will display the results of their work. As a result, progressive visualization systems are based on simulated systems or early prototypes. In many cases, neither side has currently incentive, skills, or resources to build the components needed.

Recently, data analysis researchers and practitioners have started conversations with their colleagues involved in the data analysis pipeline to combine their efforts. This standard pipeline includes the following core communities: data management, statistics and machine learning and interactive visualization. These initial conversations have led to fruitful evolutions of systems, combining two or three of these communities to complete a pipeline. Database and visualization have collaborated to create systems allowing progressive, approximate query results. Machine-learning and visualization have collaborated to create systems combining progressive multidimensional projections with appropriate scalable visualizations, such as Progressive t-SNE. Most current machine learning algorithms are designed to examine the entirety of a dataset. A major contribution of work like Progressive t-SNE is to have a decomposable algorithm that can compute a meaningful partial result, which then can be passed on to a visual interface for fluent exploration. In these few existing collaborations, the researchers are able to work together and find concrete mechanisms by adapting existing systems for these without re-building them from the ground up. A systematic and widespread linkage between the involved communities, however, is still largely absent.

This Dagstuhl seminar brought the researchers and practitioners who have started this software evolutionary process to exchange their ideas, experience, and visions. We are convinced that in the forthcoming years, progressive data analysis will become a leading paradigm for data exploration systems, but will require major changes in the algorithms and data structures in use today. The scientific communities involved need to understand the constraints and possibilities from their colleagues to converge faster, with a deeper awareness of the implications of this paradigm shift. The implications are technical, but also human, both perceptual and cognitive, and the seminar will provide a holistic view of the problem by gathering specialists from all the communities.

This summary summarizes the outcomes of our seminar. The seminar focused on

defining and formalizing the concept of progressive data analysis,
addressing fundamental issues for progressive data analysis, such as software architecture, management of uncertainty, and human aspects,
identifying evaluation methods to assess the quality of progressive systems, and threats to research on the topic,
examining applications in data science, machine learning, and time-series analysis.

As a major result from the seminar, the following problems have been identified:

Implementing fully functional progressive systems will be difficult, since the progressive model is incompatible with most of the existing data analysis stack,
The human side of progressive data analysis requires further research to investigate how visualization systems and user interfaces should be adapted to help humans cope with progressiveness,
The potentials of progressive data analysis are huge, in particular it would reconcile exploratory data analysis with big data and modern machine learning methods, item
data analysis and visualization mainstream in research and application domains.

Creative Commons BY 3.0 Unported license

Jean-Daniel Fekete, Danyel Fisher, and Michael Sedlmair

Participants

Show Participants

Marco Angelini (Sapienza University of Rome, IT) [dblp]
Michael Aupetit (QCRI - Doha, QA) [dblp]
Sriram Karthik Badam (University of Maryland - College Park, US) [dblp]
Carsten Binnig (TU Darmstadt, DE) [dblp]
Remco Chang (Tufts University - Medford, US) [dblp]
Jean-Daniel Fekete (INRIA Saclay - Orsay, FR) [dblp]
Danyel Fisher (Honeycomb - San Francisco, US) [dblp]
Hans Hagen (TU Kaiserslautern, DE) [dblp]
Barbara Hammer (Universität Bielefeld, DE) [dblp]
Christopher M. Jermaine (Rice University - Houston, US) [dblp]
Jaemin Jo (Seoul National University, KR) [dblp]
Daniel A. Keim (Universität Konstanz, DE) [dblp]
Jörn Kohlhammer (Fraunhofer IGD - Darmstadt, DE) [dblp]
Stefan Manegold (CWI - Amsterdam, NL) [dblp]
Luana Micallef (University of Copenhagen, DK) [dblp]
Dominik Moritz (University of Washington - Seattle, US) [dblp]
Thomas Mühlbacher (VRVis - Wien, AT) [dblp]
Hannes Mühleisen (CWI - Amsterdam, NL) [dblp]
Themis Palpanas (Paris Descartes University, FR) [dblp]
Adam Perer (Carnegie Mellon University - Pittsburgh, US) [dblp]
Nicola Pezzotti (TU Delft, NL) [dblp]
Gaëlle Richer (University of Bordeaux, FR) [dblp]
Florin Rusu (University of California - Merced, US) [dblp]
Giuseppe Santucci (Sapienza University of Rome, IT) [dblp]
Hans-Jörg Schulz (Aarhus University, DK) [dblp]
Michael Sedlmair (Universität Stuttgart, DE) [dblp]
Charles D. Stolper (Google, US) [dblp]
Hendrik Strobelt (IBM TJ Watson Research Center - Cambridge, US) [dblp]
Cagatay Turkay (City - University of London, GB) [dblp]
Frank van Ham (IBM Netherlands - Weert, NL) [dblp]
Anna Vilanova (TU Delft, NL) [dblp]
Yunhai Wang (Shandong University - Jinan, CN) [dblp]
Chris Weaver (University of Oklahoma - Norman, US) [dblp]
Emanuel Zgraggen (MIT - Cambridge, US) [dblp]

Classification

artificial intelligence / robotics
data bases / information retrieval
society / human-computer interaction

Keywords

visualization
visual analytics
database
big-data
human-computer interaction

Seminar 18411

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Seminar 18411

Progressive Data Analysis and Visualization

( Oct 07 – Oct 12, 2018 )

Permalink

Organizers

Contact

Publications

Impacts

Schedule

Motivation

Summary

Participants

Classification

Keywords