https://www.dagstuhl.de/16431

### October 23 – 28 , 2016, Dagstuhl Seminar 16431

# Computation over Compressed Structured Data

## Organizers

Philip Bille (Technical University of Denmark – Lyngby, DK)

Markus Lohrey (Universität Siegen, DE)

Sebastian Maneth (University of Edinburgh, GB)

Gonzalo Navarro (University of Chile – Santiago de Chile, CL)

## For support, please contact

## Documents

Dagstuhl Report, Volume 6, Issue 10

Aims & Scope

List of Participants

Dagstuhl's Impact: Documents available

Dagstuhl Seminar Schedule [pdf]

## Summary

The Dagstuhl Seminar "Computation over Compressed Structured Data" took place from October 23rd to 28th, 2016. The aim was to bring together researchers from various research directions in data compression, indexing for compressed data, and algorithms for compressed data. Compression, and the ability to index and compute directly over compressed data, is a topic that is gaining importance as digitally stored data volumes are increasing at unprecedented speeds. In particular, the seminar focused on techniques for compressed *structured data*, i.e., string, trees, and graphs, where compression schemes can exploit complex structural properties to achieve strong compression ratios.

The seminar was meant to inspire the exchange of theoretical results and practical requirements related to compression of structured data, indexing, and algorithms for compressed structured data. The following specific points were addressed.

**Encoding Data Structures.** The goal is to encode data structures with the minimal number of bits needed to support only the desired operations, which is also called the effective entropy. The best known example of such an encoding is the 2n-bit structure that answers range minimum queries on a permutation of [1,n], whose ordinary entropy is nlog(n) bits. Determining the effective entropy and designing encodings that reach the effective entropy leads to challenging research problems in enumerative combinatorics, information theory, and data structures.

**Computation-Friendly Compression.** Existing state-of-the-art compression schemes encode data by extensive and convoluted references between pieces of information. This leads to strong compression guarantees, but often makes it difficult to efficiently perform compressed computation. Recent developments have moved towards designing more computation-friendly compression schemes that achieve both strong compression and allow for efficient computation. Precise bounds on the worst-case compression of these schemes are mostly missing so far.

**Repetitive Text Collections.** Many of the largest sequence collections that are arising are formed by many documents that are very similar to each other. Typical examples arise from version control systems, collaborative editing systems (wiki), or sequencing of genomes from the same species. Statistical-compression does not exploit this redundancy. Recently, compressed indexes based on grammar-based compressors have been developed for repetitive text collections. They achieve a considerable compression, but on the downside operations are much slower.

**Recompression.** Recompression is a new technique that was successfully applied for the approximation of smallest string grammars and to solve several algorithmic problems on grammar-compressed strings. Recently, recompression has been extended from strings to trees. The long list of problems that were solved in a relatively short period using recompression indicates that there exist more applications of recompression.

**Graph Compression.** A lot of recent work deals with succinct data structures for graphs and with graph compression, in particular for web and network graphs. At the same time, simple queries such as in- and out-neighbors can be executed efficiently on these structures. There is a wide range of important open problems and future work. For instance, there is a strong need to support more complex graph queries, like for instance regular path queries, on compressed graphs.

The seminar fully satisfied our expectations. The 41 participants from 16 countries (Algiers, Canada, Chile, Denmark, Finland, France, Germany, Great Britain, Ireland, Italy, Israel, Japan, Korea, Poland, Spain, and US) had been invited by the organizers to give survey talks about their recent research related to the topic of the seminar. The talks covered topics related to compression (e.g., grammar-based compression of string, trees, and graphs, Lempel-Ziv compression), indexing of compressed data (e.g., set-intersection, longest common extensions, labeling schemes), algorithms on compressed data (e.g., streaming, regular expression matching, parameterized matching) and covered a wide range of applications including databases, WWW, and bioinformatics. Most talks were followed by lively discussions. Smaller groups formed naturally which continued these discussions later.

We thank Schloss Dagstuhl for the professional and inspiring atmosphere. Such an intense research seminar is possible because Dagstuhl so perfectly meets all researchers’ needs. For instance, elaborate research discussions in the evening were followed by local wine tasting or by heated sauna sessions.

**Summary text license**

Creative Commons BY 3.0 Unported license

Philip Bille, Markus Lohrey, Sebastian Maneth, and Gonzalo Navarro

## Dagstuhl Seminar Series

- 13232: "Indexes and Computation over Compressed Structured Data" (2013)
- 08261: "Structure-Based Compression of Complex Massive Data" (2008)

## Classification

- Data Bases / Information Retrieval
- Data Structures / Algorithms / Complexity

## Keywords

- Data Compression
- Indexing
- Algorithms on Compressed Structures
- Straight- Line Programs