##### Dagstuhl Seminar 16431

### Computation over Compressed Structured Data

##### ( Oct 23 – Oct 28, 2016 )

##### Permalink

##### Organizers

- Philip Bille (Technical University of Denmark - Lyngby, DK)
- Markus Lohrey (Universität Siegen, DE)
- Sebastian Maneth (University of Edinburgh, GB)
- Gonzalo Navarro (University of Chile - Santiago de Chile, CL)

##### Contact

- Andreas Dolzmann (for scientific matters)
- Annette Beyer (for administrative matters)

##### Impacts

- LZ78 Compression in Low Main Memory Space - Arroyuelo, Diego; Canovas, Rodrigo; Navarro, Gonzalo; Raman, Rajeev - Santiago : University, 2017. - 12 pp..
- LZ78 Compression in Low Main Memory Space : article in LNCS 10508 "String Processing and Information Retrieval 2017" : pp. 38-50 - Arroyuelo, Diego; Canovas, Rodrigo; Navarro, Gonzalo; Raman, Rajeev - Berlin : Springer, 2017 - (Lecture notes in computer science ; 10508 : article).
- On Two LZ78-style Grammars : Compression Bounds and Compressed-Space Computation - Badkobeh, Golnaz; Gagie, Travis; Inenaga, Shunsuke; Kosolobov, Dmitry; Puglisi, Simon J. - Cornell University : arXiv.org, 2017. - 16 pp..
- On Two LZ78-style Grammars : Compression Bounds and Compressed-Space Computation : article in LNCS 10508 "String Processing and Information Retrieval 2017" : pp. 51-67 - (Lecture notes in computer science ; 10508 : article) - Berlin : Springer, 2017 - (Lecture notes in computer science ; 10508 : article).

##### Schedule

Whenever large amounts of data have to be exchanged over small band-width connections or have to be stored in devices with limited memory, data compression is a promising technique to reduce transferred or stored data volumes. Since these data typically have to be processed further and decompression is in many applications too time or space-consuming, it is desired that the processing steps can be executed directly on the compressed data, i.e., without fully decompressing the data. A further performance gain from working directly on compressed data is due to reduced secondary memory access, since compressed data may fit to a larger percentage into main memory. This performance gain may outweigh the computational overhead resulting from working on compressed data. Processing structured data (i.e., strings, trees, or graphs) is a typical application domain in this context. The main research questions in this context are:

- Compression algorithms: How can structured data be compressed?
- Indexes for compressed structures: How can we give fast access to basic operations over the compressed structures?
- Algorithms over compressed structures: Which algorithms can be executed over compressed structures without prior decompression?

Solutions to these questions require techniques from various areas of computer science: data compression, algorithms, data structures, formal language theory, and information theory. The aim of this Dagstuhl Seminar is to bring together researchers from these areas with a particular interest in compression of structured data. To make the seminar more focused, we selected the following five currently active research directions as key topics for the seminar.

### Encoding Data Structures

The goal is to encode data structures with the minimal number of bits needed to support only the desired operations, which is also called the effective entropy. The best known example of such an encoding is the 2n-bit structure that answers range minimum queries on a permutation of [1,n], whose ordinary entropy is n log(n) bits. Determining the effective entropy and designing encodings that reach the effective entropy leads to challenging research problems in enumerative combinatorics, information theory, and data structures.

### Computation-Friendly Compression

Existing state-of-the-art compression schemes encode data by extensive and convoluted references between pieces of information. This leads to strong compression guarantees, but often makes it difficult to efficiently perform compressed computation. Recent developments have moved towards designing more computation-friendly compression schemes that achieve both strong compression and allow for efficient computation. Precise bounds on the worst-case compression of these schemes are mostly missing so far.

### Repetitive Text Collections

Many of the largest sequence collections that are arising are formed by many documents that are very similar to each other. Typical examples arise from version control systems, collaborative editing systems (wiki), or sequencing of genomes from the same species. Statistical-compression does not exploit this redundancy. Recently, compressed indexes based on grammar-based compressors have been developed for repetitive text collections. They achieve a considerable compression, but on the downside operations are much slower.

### Recompression

Recompression is a new technique that was successfully applied for the approximation of smallest string grammars and to solve several algorithmic problems on grammar-compressed strings. Recently, recompression has been extended from strings to trees. The long list of problems that were solved in a relatively short period using recompression indicates that there exist more applications of recompression.

### Graph Compression

A lot of recent work deals with succinct data structures for graphs and with graph compression, in particular for web and network graphs. At the same time, simple queries such as in- and out-neighbors can be executed efficiently on these structures. There is a wide range of important open problems and future work. For instance, there is a strong need to support more complex graph queries, like for instance regular path queries, on compressed graphs.

The Dagstuhl Seminar "Computation over Compressed Structured Data" took place from October 23rd to 28th, 2016. The aim was to bring together researchers from various research directions in data compression, indexing for compressed data, and algorithms for compressed data. Compression, and the ability to index and compute directly over compressed data, is a topic that is gaining importance as digitally stored data volumes are increasing at unprecedented speeds. In particular, the seminar focused on techniques for compressed *structured data*, i.e., string, trees, and graphs, where compression schemes can exploit complex structural properties to achieve strong compression ratios.

The seminar was meant to inspire the exchange of theoretical results and practical requirements related to compression of structured data, indexing, and algorithms for compressed structured data. The following specific points were addressed.

**Encoding Data Structures.** The goal is to encode data structures with the minimal number of bits needed to support only the desired operations, which is also called the effective entropy. The best known example of such an encoding is the 2n-bit structure that answers range minimum queries on a permutation of [1,n], whose ordinary entropy is nlog(n) bits. Determining the effective entropy and designing encodings that reach the effective entropy leads to challenging research problems in enumerative combinatorics, information theory, and data structures.

**Computation-Friendly Compression.** Existing state-of-the-art compression schemes encode data by extensive and convoluted references between pieces of information. This leads to strong compression guarantees, but often makes it difficult to efficiently perform compressed computation. Recent developments have moved towards designing more computation-friendly compression schemes that achieve both strong compression and allow for efficient computation. Precise bounds on the worst-case compression of these schemes are mostly missing so far.

**Repetitive Text Collections.** Many of the largest sequence collections that are arising are formed by many documents that are very similar to each other. Typical examples arise from version control systems, collaborative editing systems (wiki), or sequencing of genomes from the same species. Statistical-compression does not exploit this redundancy. Recently, compressed indexes based on grammar-based compressors have been developed for repetitive text collections. They achieve a considerable compression, but on the downside operations are much slower.

**Recompression.** Recompression is a new technique that was successfully applied for the approximation of smallest string grammars and to solve several algorithmic problems on grammar-compressed strings. Recently, recompression has been extended from strings to trees. The long list of problems that were solved in a relatively short period using recompression indicates that there exist more applications of recompression.

**Graph Compression.** A lot of recent work deals with succinct data structures for graphs and with graph compression, in particular for web and network graphs. At the same time, simple queries such as in- and out-neighbors can be executed efficiently on these structures. There is a wide range of important open problems and future work. For instance, there is a strong need to support more complex graph queries, like for instance regular path queries, on compressed graphs.

The seminar fully satisfied our expectations. The 41 participants from 16 countries (Algiers, Canada, Chile, Denmark, Finland, France, Germany, Great Britain, Ireland, Italy, Israel, Japan, Korea, Poland, Spain, and US) had been invited by the organizers to give survey talks about their recent research related to the topic of the seminar. The talks covered topics related to compression (e.g., grammar-based compression of string, trees, and graphs, Lempel-Ziv compression), indexing of compressed data (e.g., set-intersection, longest common extensions, labeling schemes), algorithms on compressed data (e.g., streaming, regular expression matching, parameterized matching) and covered a wide range of applications including databases, WWW, and bioinformatics. Most talks were followed by lively discussions. Smaller groups formed naturally which continued these discussions later.

We thank Schloss Dagstuhl for the professional and inspiring atmosphere. Such an intense research seminar is possible because Dagstuhl so perfectly meets all researchers’ needs. For instance, elaborate research discussions in the evening were followed by local wine tasting or by heated sauna sessions.

- Diego Arroyuelo (TU Federico Santa María - Valparaíso, CL) [dblp]
- Hideo Bannai (Kyushu University - Fukuoka, JP) [dblp]
- Djamal Belazzougui (CERIST - Algiers, DZ) [dblp]
- Philip Bille (Technical University of Denmark - Lyngby, DK) [dblp]
- Stefan Böttcher (Universität Paderborn, DE) [dblp]
- Rodrigo Cánovas (University of Montpellier 2, FR) [dblp]
- Patrick Hagge Cording (Technical University of Denmark - Lyngby, DK) [dblp]
- Héctor Ferrada (University of Helsinki, FI) [dblp]
- Johannes Fischer (TU Dortmund, DE) [dblp]
- Travis Gagie (Universidad Diego Portales, CL) [dblp]
- Adrià Gascón (University of Edinburgh, GB) [dblp]
- Pawel Gawrychowski (University of Wroclaw, PL) [dblp]
- Simon Gog (KIT - Karlsruher Institut für Technologie, DE) [dblp]
- Inge Li Gørtz (Technical University of Denmark - Lyngby, DK) [dblp]
- Cecilia Hernández Rivas (University of Concepción, CL) [dblp]
- Danny Hucke (Universität Siegen, DE) [dblp]
- Tomohiro I (Kyushu Institute of Technology, JP) [dblp]
- Shunsuke Inenaga (Kyushu University - Fukuoka, JP) [dblp]
- Artur Jez (University of Wroclaw, PL) [dblp]
- Juha Kärkkäinen (University of Helsinki, FI) [dblp]
- Susana Ladra González (University of A Coruña, ES) [dblp]
- Markus Lohrey (Universität Siegen, DE) [dblp]
- Sebastian Maneth (University of Edinburgh, GB) [dblp]
- Ian Munro (University of Waterloo, CA) [dblp]
- Gonzalo Navarro (University of Chile - Santiago de Chile, CL) [dblp]
- Yakov Nekrich (University of Waterloo, CA) [dblp]
- Patrick K. Nicholson (Bell Labs - Dublin, IE) [dblp]
- Alberto Ordóñez (University of A Coruña, ES) [dblp]
- Fabian Peternek (University of Edinburgh, GB) [dblp]
- Nicola Prezza (University of Udine, IT) [dblp]
- Rajeev Raman (University of Leicester, GB) [dblp]
- Wojciech Rytter (University of Warsaw, PL) [dblp]
- Hiroshi Sakamoto (Kyushu Institute of Technology - Fukuoka, JP) [dblp]
- Srinivasa Rao Satti (Seoul National University, KR) [dblp]
- Markus Schmid (Universität Trier, DE) [dblp]
- Manfred Schmidt-Schauss (Goethe-Universität - Frankfurt a. M., DE) [dblp]
- Rahul Shah (Louisiana State University - Baton Rouge, US) [dblp]
- Ayumi Shinohara (Tohoku University - Sendai, JP) [dblp]
- Tatiana Starikovskaya (University Paris-Diderot, FR) [dblp]
- Alexander Tiskin (University of Warwick - Coventry, GB) [dblp]
- Rossano Venturini (University of Pisa, IT) [dblp]

##### Related Seminars

##### Classification

- data bases / information retrieval
- data structures / algorithms / complexity

##### Keywords

- Data Compression
- Indexing
- Algorithms on Compressed Structures
- Straight- Line Programs