Dagstuhl Seminar 14091: Data Structures and Advanced Models of Computation on Big Data

Dagstuhl Seminar 14091

Data Structures and Advanced Models of Computation on Big Data

( Feb 23 – Feb 28, 2014 )

(Click in the middle of the image to enlarge)

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/14091

Organizers

Alejandro Lopez-Ortiz (University of Waterloo, CA)
Ulrich Carsten Meyer (Goethe-Universität - Frankfurt a. M., DE)
Robert Sedgewick (Princeton University, US)

Contact

Annette Beyer (for administrative matters)

Publications

Data Structures and Advanced Models of Computation on Big Data (Dagstuhl Seminar 14091). Alejandro López-Ortiz, Ulrich Carsten Meyer, and Robert Sedgewick. In Dagstuhl Reports, Volume 4, Issue 2, pp. 129-149, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2014)

Schedule

Schedule

Motivation

Show Motivation

Computing is about the processing, exchanging, handling and storage of information. The way information is stored profoundly influences the efficiency of operations over the data. Research in data structures attempts to provide ways of storing and manipulating data and information that are appropriate for the computational model at hand: internal random access memory, sequentially accessible data streams, multiple cache-external disk memory hierarchies, parallel and distributed computing, etc. In spite of the maturity of the field many data structuring problems are still open, while new ones arise due to technological advances. Almost every new computing or storage technology poses new data structuring problems. This Dagstuhl Seminar will address besides the "classical" data structuring topics the following issues:

Big data

One consequence of the exponential growth of computing usage has been a large increase in the scale of the problems addressed by computer scientists. For example, twenty-five years ago a few tens of megabytes constituted a substantial text corpus. Nowadays search engine indexes handle document collections whose size is in the order of a hundred terabytes of data or so. We have seen similar growth in the size of large data collections such as those derived from internet traffic management, genomic data, social networks, large scale scientific experiments such as the Large Hadron Collider (LHC) and sensor networks in general. In this context the introduction of novel data structures that can operate over massively distributed computing platforms with minimal interprocess communication is a must.

Memory Hierarchy Methods for CMPs and SSDs

Processor speeds and external memory sizes have increased exponentially over the last decades. However the transfer speed between external and internal memory has not kept pace. Data structuring research has been addressing this problem and has worked on solutions that attempt to minimize the data transfer between external and internal memory and the various caches in between. However, many data structuring problems still have unsatisfactory solutions in this model. New challenges arise from multicore computers in which certain caches are shared, giving rise to issues such as cache contention, coherence and optimal order of execution. Similarly, flash memory storage, suffer from certain peculiarities like read/write imbalances and wear, which are not captured in the standard I/O model.

Energy-Aware Data Structure Design

Reducing the energy consumption of computations has become an important issue both for large scale computing and for small mobile devices. Clearly, the power consumption of a program depends on its execution time and memory usage, but the details are quite sophisticated. Unstructured memory access patterns, for example, have be shown to severely influence the energy consumption of smart phones -- more than could be expected by their respective time delay. Thus, suitable theoretical models and experimental studies will be needed to develop meaningful data structures and algorithms for energy-efficient computing.

Web-scale Data Structures

They are playing an increasingly important role in modern computing. The focus on implementing them has properly been on systems and commercial issues, but as the scale continues to increase, the scientific basis for studying performance and developing new algorithms that has proven so successful for classical data structures is certain to play a critical role in the evolution of the web.

Summary

Show Summary

A persistent theme in the presentations in this Dagstuhl seminar is the need to refine our models of computation to adapt to modern architectures, if we are to develop a scientific basis for inventing efficient algorithms to solve real-world problems. For example, Mehlhorn's presentation on the cost of memory translation, Iacono's reexamination of the cache-oblivious model, and Sanders' description of communication efficiency all left many participants questioning basic assumptions they have carried for many years and are certain to stimulate new research in the future.

Better understanding of the properties of modern processors certainly can be fruitful. For example, several presentations, such as the papers by Aumüller, López-Ortiz, and Wild on Quicksort and the paper by Bingmann on string sorting, described faster versions of classic algorithms that are based on careful examination of modern processor design.

Overall, many presentations described experience with data from actual applications. For example, the presentations by Driemel and Varenhold on trajectory data described a relatively new big-data application that underscores the importance and breadth of application of classic techniques in computational geometry and data structure design.

Other presentations which discussed large data sets on modern architectures were the lower bound on parallel external list ranking by Jacob, which also applies on the MapReduce and BSP models commonly used in large distributed platforms; and by Hagerup who considered the standard problem of performing a depth first search (DFS) on a graph, a task that is trivial in small graphs but extremely complex on ``big data'' sets such as the Facebook graph. He proposed a space efficient algorithm that reduces the space required by DFS by a log n factor or an order of magnitude on practical data sets.

Schweikardt gave a model for MapReduce computations, a very common computing platform for very large server farms. Salinger considered the opposite end of the spectrum namely how to simplify the programming task as to take optimal advantage of a single server which also has its own degree of parallelism from multiple cores, GPUs and other parallel facilities.

In terms of geometric data structures for large data sets Afshani presented sublinear algorithms for the I/O model which generalize earlier work on sublinear algorithms. Sublinear algorithms are of key importance on very large data sets, which are thus presumably unable to fit in main memory. Yet most of the previously proposed algorithms assumed that such large data sets were hosted in main memory. Toma gave an external memory representation of the popular quad tree data structure commonly used in computer graphics as well as other spatial data applications.

Creative Commons BY 3.0 Unported license

Alejandro Lopez-Ortiz, Ulrich Carsten Meyer, and Robert Sedgewick

Participants

Show Participants

Peyman Afshani (Aarhus University, DK) [dblp]
Deepak Ajwani (Bell Labs - Dublin, IE) [dblp]
Lars Arge (Aarhus University, DK) [dblp]
Martin Aumüller (TU Ilmenau, DE) [dblp]
Hannah Bast (Universität Freiburg, DE) [dblp]
Timo Bingmann (KIT - Karlsruher Institut für Technologie, DE) [dblp]
Gerth Stølting Brodal (Aarhus University, DK) [dblp]
Andrej Brodnik (University of Primorska, SI) [dblp]
Martin Dietzfelbinger (TU Ilmenau, DE) [dblp]
Anne Driemel (TU Dortmund, DE) [dblp]
Rolf Fagerberg (University of Southern Denmark - Odense, DK) [dblp]
Johannes Fischer (TU Dortmund, DE) [dblp]
Mordecai Golin (HKUST - Kowloon, HK) [dblp]
Torben Hagerup (Universität Augsburg, DE) [dblp]
Tanja Hartmann (KIT - Karlsruher Institut für Technologie, DE) [dblp]
Herman J. Haverkort (TU Eindhoven, NL) [dblp]
Meng He (Dalhousie University, CA) [dblp]
John Iacono (Polytechnic Institute of NYU - Brooklyn, US) [dblp]
Riko Jacob (ETH Zürich, CH) [dblp]
Tsvi Kopelowitz (University of Michigan - Ann Arbor, US) [dblp]
Moshe Lewenstein (Bar-Ilan University - Ramat Gan, IL) [dblp]
Alejandro Lopez-Ortiz (University of Waterloo, CA) [dblp]
Kurt Mehlhorn (MPI für Informatik - Saarbrücken, DE) [dblp]
Ulrich Carsten Meyer (Goethe-Universität - Frankfurt a. M., DE) [dblp]
Gabriel Moruz (Goethe-Universität - Frankfurt a. M., DE) [dblp]
Ian Munro (University of Waterloo, CA) [dblp]
Markus E. Nebel (TU Kaiserslautern, DE) [dblp]
Patrick K. Nicholson (MPI für Informatik - Saarbrücken, DE) [dblp]
Jesper A. Sindahl Nielsen (Aarhus University, DK) [dblp]
John Owens (University of California - Davis, US) [dblp]
Rasmus Pagh (IT University of Copenhagen, DK) [dblp]
Rajeev Raman (University of Leicester, GB) [dblp]
Alejandro Salinger (Universität des Saarlandes, DE) [dblp]
Peter Sanders (KIT - Karlsruher Institut für Technologie, DE) [dblp]
Anita Schöbel (Universität Göttingen, DE) [dblp]
Nicole Schweikardt (Goethe-Universität - Frankfurt a. M., DE) [dblp]
Robert Sedgewick (Princeton University, US) [dblp]
He Sun (MPI für Informatik - Saarbrücken, DE) [dblp]
Sharma V. Thankachan (Louisiana State University - Baton Rouge, US) [dblp]
Laura I. Toma (Bowdoin College - Brunswick, US) [dblp]
Jan Vahrenhold (Universität Münster, DE) [dblp]
Sebastian Wild (TU Kaiserslautern, DE) [dblp]
Katharina A. Zweig (TU Kaiserslautern, DE) [dblp]

Related Seminars

Dagstuhl Seminar 9145: Data Structures (1991-11-04 - 1991-11-08) (Details)
Dagstuhl Seminar 9409: Data Structures (1994-02-28 - 1994-03-04) (Details)
Dagstuhl Seminar 9609: Data Structures (1996-02-26 - 1996-03-01) (Details)
Dagstuhl Seminar 98091: Data Structures (1998-03-02 - 1998-03-06) (Details)
Dagstuhl Seminar 00091: Data Structures (2000-02-27 - 2000-03-03) (Details)
Dagstuhl Seminar 02091: Data Structures (2002-02-24 - 2002-03-01) (Details)
Dagstuhl Seminar 04091: Data Structures (2004-02-22 - 2004-02-27) (Details)
Dagstuhl Seminar 06091: Data Structures (2006-02-26 - 2006-03-03) (Details)
Dagstuhl Seminar 08081: Data Structures (2008-02-17 - 2008-02-22) (Details)
Dagstuhl Seminar 10091: Data Structures (2010-02-28 - 2010-03-05) (Details)
Dagstuhl Seminar 16101: Data Structures and Advanced Models of Computation on Big Data (2016-03-06 - 2016-03-11) (Details)
Dagstuhl Seminar 19051: Data Structures for the Cloud and External Memory Data (2019-01-27 - 2019-02-01) (Details)
Dagstuhl Seminar 21071: Scalable Data Structures (2021-02-14 - 2021-02-19) (Details)
Dagstuhl Seminar 23211: Scalable Data Structures (2023-05-21 - 2023-05-26) (Details)
Dagstuhl Seminar 25191: Adaptive and Scalable Data Structures (2025-05-04 - 2025-05-09) (Details)
Dagstuhl Seminar 28221: Simplicity and Efficiency of Data Structures (2028-05-28 - 2028-06-02) (Details)

Classification

data bases / information retrieval
data structures / algorithms / complexity

Keywords

Data structures
algorithms
large data sets
external memory methods
big data
streaming
web-scale

Seminar 14091

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Seminar 14091

Data Structures and Advanced Models of Computation on Big Data

( Feb 23 – Feb 28, 2014 )

Permalink

Organizers

Contact

Publications

Schedule

Motivation

Summary

Participants

Related Seminars

Classification

Keywords