Data Structures for the Cloud and External Memory Data
( 27. Jan – 01. Feb, 2019 )
- Gerth Stølting Brodal (Aarhus University, DK)
- Ulrich Carsten Meyer (Goethe-Universität - Frankfurt a. M., DE)
- Markus E. Nebel (Universität Bielefeld, DE)
- Robert Sedgewick (Princeton University, US)
- Shida Kunz (für wissenschaftliche Fragen)
- Susanne Bach-Bernhard (für administrative Fragen)
- Efficient Gauss Elimination for Near-Quadratic Matrices with One Short Random Block per Row, with Applications : article - Dietzfelbinger, Martin; Walzer, Stefan - Wadern : LZI, 2019. - pp. 1-18 - Leibniz International Proceedings in Informatics ; 144 : article).
- Entropy Trees and Range-Minimum Queries In Optimal Average-Case Space - Munro, J. Ian; Wild, Sebastian - Cornell University : arXiv.org, 2019. - 16 pp..
Computing is about the processing, exchanging, handling and storage of information. The way information is stored profoundly influences the efficiency of operations over the data. Research in data structures attempts to provide ways of storing and manipulating data and information that are appropriate for the computational model at hand: A setting where data are to be stored in internal random access memory requires data structuring methods different from settings where data are to be stored on sequentially accessible tapes, or on a multiple cache-external disk memory hierarchy with highly varying access latency. Still other settings have data distributed over a network of nodes, involve large scale parallel processing, or just utilize the very small scale parallelism provided by the simultaneous processing of all the bits of one computer word.
In spite of the maturity of the field many data structuring problems are still open, while new ones arise due to technological advances. Moreover, almost every new computing or storage technology poses new data structuring problems. This Dagstuhl Seminar will address besides the “classical” data structuring topics the following issues:
- Distributed Data Structures for Cloud Services. There is a set of established programming models (like, e.g., Google’s MapReduce and Apache’s hadoop) and distributed data structures (like, e.g., Google’s BigTable) that serve as building blocks of many large-scale cloud services. They are designed to scale across thousands of machines in a way that allows to add additional servers without the need of reconfiguration. Due to hardware failures and latency issues components have varying and hardly predictable responsiveness and it is one of the challenges to construct a large scale responsive cloud services out of such parts. From a more abstract point of view, the following two subjects are of importance when it comes to the design of new distributed data structures for large scale applications: Problem decomposition and fault tolerance.
- Data Structures for Modern Memory Hierarchies. The amount of data processed by typical cloud services usually exceeds the size of DRAM available within all servers even on large clusters. As a matter of fact, lots of MapReduce operations in practical applications use B-tree files as input. The subject addressed under this headline is best explained by looking at the B-tree itself, which was introduced by Bayer and McCreight in 1972. It is still in use to index data that is too big to fit in memory and thus is still at the core of database and file system design. However, B-trees somehow are designed for hardware from the 70s. On disks of that era, there was a relatively small gap between sequential and random I/Os. But bandwidth increases as the square root of the capacity, and so it has enjoyed the same type of exponential growth as capacity has, though of course with twice the doubling time. In order to address the resulting problematic aspect of the B-tree performance profile, specifically the insertion bottleneck, write optimized data structures like, e.g., buffered repository trees and Bε-trees have been developed. The best of these can match a B-tree on its best operations (searches and sequential insertions) while outperforming them by orders of magnitude on their worst operations (random insertions). These raise questions to be addressed during our seminar: What is the right way to achieve concurrency in these high-insertion dictionaries? What is the right way to implement ACID (Atomicity, Consistency, Isolation, Durability – a set of properties of database transactions)? What is the right analytical framework for these data structures?
About the seminar
Data structures provide ways of storing and manipulating data and information that are appropriate for the computational model at hand. Every such model relies on assumptions that we have to keep questioning. The aim of this seminar was to exchange ideas for new algorithms and data structures, and to discuss our models of computations in light of recent technological advances. This Dagstuhl seminar was the 13th in a series of loosely related Dagstuhl seminars on data structures.
The presentations covered both advances in classic fields, as well as new problems and insights for recent trends in computing. In particular, Johnson (Section 3.12) and Muth (Section 3.17) reported on models and research opportunities in the cloud and external memory motivated by practical demands.
A number of talks highlighted technical challenges in storing and processing large datasets: Bast (Section 3.2) demonstrated the knowledge database QLever and discussed algorithmic aspects. Distributed frameworks were presented by Bingmann (Section 3.4) reporting on the progress of Thrill while focusing on parallel external sorting and by Yan (Section 3.32) who introduced G-thinker. Farach-Colton (Section 3.7) analyzed the slow-down of various filesystems caused by updates over time. Owens (Section 3.19) discuses intricacies of GPUs and presented efficient and practical data structures for this hardware.
In order to mitigate the impact of huge datasets, streaming and online algorithms were considered. Martinez (Section 3.15) discussed Affirmative Sampling which takes uniform samples of a stream and adapts the sample size to the stream’s diversity. Sedgewick (Section 3.26) revisited the cardinality estimation problem and proposed the HyperBitBit algorithm. A matching of requests to resources in an online setting was covered by Raghvendra (Section 3.22). Similarly, Mehlhorn (Section 3.16) presented a solution to assigning indivisible resources approximately optimizing the social welfare.
Nebel (Section 3.18) and Wild (Section 3.31) proposed and analyzed tree-based data structures. Additionally, various aspects on more general graph processing were covered ranging from their enumeration (Lumbroso, Section 3.14) and random sampling (Penschuck, Section 3.20), over representations for k-connectivity (Pettie, Section 3.21) to the detection of substructures (Silvestri, Section 3.28 and Tarjan, Section 3.29).
Regarding the complexity of graph algorithms, Fagerberg (Section 3.23) presented new lower bounds on the reorganisation cost of B-trees, while Thankachan (Section 3.30) gave hardness results on the recognizability of Wheeler graphs. Kopelowitz (Section 3.13) considered the complexity of data structures for the set-disjointness problem. Emphasizing cloud-related security concerns, Jacob (Section 3.11) showed that a range of simple data structures have to incur an (log n) overhead if one wants to prevent information leakage via their access patterns.
Problems involving large text corpora were considered by Fischer (Section 3.8) presenting an external memory bi-directional compression scheme, by Golin (Section 3.9) discussing AIFV codes, and by Salinger (Section 3.24) analyzing persistent full-text indices for versioned documents.
Data structures using hashing were examined by Conway (Section 3.5), Dietzfelbinger (Section 3.6), Even and Sanders (Section 3.25). Bender (Section 3.3) discussed variants of Bloom filters which adapt based on past queries.
Afshani (Section 3.1) presented Fragile Complexity, a novel model of computation with an element-centric cost function, and gave bounds for various classical problems. Iacono (Section 3.10) proposed to model locality-of-reference more explicitly and compared his proposal to the external memory and cache-oblivious model. Sen (Section 3.27) proposed the novel paradigm HAIbrid augmenting classic data structures with aritifical intelligence.
The organizers would like to thank the Dagstuhl team for their continuous support; the welcoming atmosphere made the seminar both highly productive and enjoyable. They also thank all participants for their contributions to this seminar.
- Peyman Afshani (Aarhus University, DK) [dblp]
- Hannah Bast (Universität Freiburg, DE) [dblp]
- Michael A. Bender (Stony Brook University, US) [dblp]
- Ioana Oriana Bercea (Tel Aviv University, IL) [dblp]
- Timo Bingmann (KIT - Karlsruher Institut für Technologie, DE) [dblp]
- Gerth Stølting Brodal (Aarhus University, DK) [dblp]
- Alexander Conway (Rutgers University - Piscataway, US) [dblp]
- Martin Dietzfelbinger (TU Ilmenau, DE) [dblp]
- Guy Even (Tel Aviv University, IL) [dblp]
- Tomer Even (Tel Aviv University, IL)
- Rolf Fagerberg (University of Southern Denmark - Odense, DK) [dblp]
- Martin Farach-Colton (Rutgers University - Piscataway, US) [dblp]
- Johannes Fischer (TU Dortmund, DE) [dblp]
- Mordecai Golin (HKUST - Kowloon, HK) [dblp]
- Herman J. Haverkort (Universität Bonn, DE) [dblp]
- John Iacono (UL - Brussels, BE) [dblp]
- Riko Jacob (IT University of Copenhagen, DK) [dblp]
- Rob Johnson (VMware - Palo Alto, US) [dblp]
- Tsvi Kopelowitz (Bar-Ilan University - Ramat Gan, IL) [dblp]
- Moshe Lewenstein (Bar-Ilan University - Ramat Gan, IL) [dblp]
- Jérémie Lumbroso (Princeton University, US) [dblp]
- Conrado Martinez (UPC - Barcelona, ES) [dblp]
- Kurt Mehlhorn (MPI für Informatik - Saarbrücken, DE) [dblp]
- Ulrich Carsten Meyer (Goethe-Universität - Frankfurt a. M., DE) [dblp]
- Friedhelm Meyer auf der Heide (Universität Paderborn, DE) [dblp]
- Ian Munro (University of Waterloo, CA) [dblp]
- Robert Muth (Google - New York, US) [dblp]
- Markus E. Nebel (Universität Bielefeld, DE) [dblp]
- John Owens (University of California, Davis, US) [dblp]
- Manuel Penschuck (Goethe-Universität - Frankfurt a. M., DE) [dblp]
- Seth Pettie (University of Michigan - Ann Arbor, US) [dblp]
- Sharath Raghvendra (Virginia Polytechnic Institute - Blacksburg, US) [dblp]
- Rajeev Raman (University of Leicester, GB) [dblp]
- Alejandro Salinger (SAP SE - Walldorf, DE) [dblp]
- Peter Sanders (KIT - Karlsruher Institut für Technologie, DE) [dblp]
- Robert Sedgewick (Princeton University, US) [dblp]
- Siddhartha Sen (Microsoft - New York, US) [dblp]
- Francesco Silvestri (University of Padova, IT) [dblp]
- Robert Endre Tarjan (Princeton University, US) [dblp]
- Sharma V. Thankachan (University of Central Florida - Orlando, US) [dblp]
- Sebastian Wild (University of Waterloo, CA) [dblp]
- Da Yan (The University of Alabama - Birmingham, US) [dblp]
- Norbert Zeh (Dalhousie University - Halifax, CA) [dblp]
- Dagstuhl-Seminar 9145: Data Structures (1991-11-04 - 1991-11-08) (Details)
- Dagstuhl-Seminar 9409: Data Structures (1994-02-28 - 1994-03-04) (Details)
- Dagstuhl-Seminar 9609: Data Structures (1996-02-26 - 1996-03-01) (Details)
- Dagstuhl-Seminar 98091: Data Structures (1998-03-02 - 1998-03-06) (Details)
- Dagstuhl-Seminar 00091: Data Structures (2000-02-27 - 2000-03-03) (Details)
- Dagstuhl-Seminar 02091: Data Structures (2002-02-24 - 2002-03-01) (Details)
- Dagstuhl-Seminar 04091: Data Structures (2004-02-22 - 2004-02-27) (Details)
- Dagstuhl-Seminar 06091: Data Structures (2006-02-26 - 2006-03-03) (Details)
- Dagstuhl-Seminar 08081: Data Structures (2008-02-17 - 2008-02-22) (Details)
- Dagstuhl-Seminar 10091: Data Structures (2010-02-28 - 2010-03-05) (Details)
- Dagstuhl-Seminar 14091: Data Structures and Advanced Models of Computation on Big Data (2014-02-23 - 2014-02-28) (Details)
- Dagstuhl-Seminar 16101: Data Structures and Advanced Models of Computation on Big Data (2016-03-06 - 2016-03-11) (Details)
- Dagstuhl-Seminar 21071: Scalable Data Structures (2021-02-14 - 2021-02-19) (Details)
- Dagstuhl-Seminar 23211: Scalable Data Structures (2023-05-21 - 2023-05-26) (Details)
- data structures / algorithms / complexity
- Data structures
- cloud computing
- large data sets
- external memory methods
- big data