23.10.16 - 28.10.16, Seminar 16431

Computation over Compressed Structured Data

Diese Seminarbeschreibung wurde vor dem Seminar auf unseren Webseiten veröffentlicht und bei der Einladung zum Seminar verwendet.

Motivation

Whenever large amounts of data have to be exchanged over small band-width connections or have to be stored in devices with limited memory, data compression is a promising technique to reduce transferred or stored data volumes. Since these data typically have to be processed further and decompression is in many applications too time or space-consuming, it is desired that the processing steps can be executed directly on the compressed data, i.e., without fully decompressing the data. A further performance gain from working directly on compressed data is due to reduced secondary memory access, since compressed data may fit to a larger percentage into main memory. This performance gain may outweigh the computational overhead resulting from working on compressed data. Processing structured data (i.e., strings, trees, or graphs) is a typical application domain in this context. The main research questions in this context are:

  • Compression algorithms: How can structured data be compressed?
  • Indexes for compressed structures: How can we give fast access to basic operations over the compressed structures?
  • Algorithms over compressed structures: Which algorithms can be executed over compressed structures without prior decompression?

Solutions to these questions require techniques from various areas of computer science: data compression, algorithms, data structures, formal language theory, and information theory. The aim of this Dagstuhl Seminar is to bring together researchers from these areas with a particular interest in compression of structured data. To make the seminar more focused, we selected the following five currently active research directions as key topics for the seminar.

Encoding Data Structures

The goal is to encode data structures with the minimal number of bits needed to support only the desired operations, which is also called the effective entropy. The best known example of such an encoding is the 2n-bit structure that answers range minimum queries on a permutation of [1,n], whose ordinary entropy is n log(n) bits. Determining the effective entropy and designing encodings that reach the effective entropy leads to challenging research problems in enumerative combinatorics, information theory, and data structures.

Computation-Friendly Compression

Existing state-of-the-art compression schemes encode data by extensive and convoluted references between pieces of information. This leads to strong compression guarantees, but often makes it difficult to efficiently perform compressed computation. Recent developments have moved towards designing more computation-friendly compression schemes that achieve both strong compression and allow for efficient computation. Precise bounds on the worst-case compression of these schemes are mostly missing so far.

Repetitive Text Collections

Many of the largest sequence collections that are arising are formed by many documents that are very similar to each other. Typical examples arise from version control systems, collaborative editing systems (wiki), or sequencing of genomes from the same species. Statistical-compression does not exploit this redundancy. Recently, compressed indexes based on grammar-based compressors have been developed for repetitive text collections. They achieve a considerable compression, but on the downside operations are much slower.

Recompression

Recompression is a new technique that was successfully applied for the approximation of smallest string grammars and to solve several algorithmic problems on grammar-compressed strings. Recently, recompression has been extended from strings to trees. The long list of problems that were solved in a relatively short period using recompression indicates that there exist more applications of recompression.

Graph Compression

A lot of recent work deals with succinct data structures for graphs and with graph compression, in particular for web and network graphs. At the same time, simple queries such as in- and out-neighbors can be executed efficiently on these structures. There is a wide range of important open problems and future work. For instance, there is a strong need to support more complex graph queries, like for instance regular path queries, on compressed graphs.