TOP
Search the Dagstuhl Website
Looking for information on the websites of the individual seminars? - Then please:
Not found what you are looking for? - Some of our services have separate websites, each with its own search option. Please check the following list:
Schloss Dagstuhl - LZI - Logo
Schloss Dagstuhl Services
Seminars
Within this website:
External resources:
  • DOOR (for registering your stay at Dagstuhl)
  • DOSA (for proposing future Dagstuhl Seminars or Dagstuhl Perspectives Workshops)
Publishing
Within this website:
External resources:
dblp
Within this website:
External resources:
  • the dblp Computer Science Bibliography


Dagstuhl Seminar 16321

Coding Theory in the Time of Big Data

( Aug 07 – Aug 12, 2016 )

(Click in the middle of the image to enlarge)

Permalink
Please use the following short url to reference this page: https://www.dagstuhl.de/16321

Organizers

Contact



Motivation

The main focus of this seminar is to explore the impact of contemporary challenges for efficient, reliable, and secure storage and delivery of files in the time of 'big data'. In particular, the seminar will examine the ways in which coding theory fundamentals must be extended to address the emerging issues in the evolving practice of storage and transmission of large files across networks. These novel coding applications are having a significant impact on coding theory fundamentals. The seminar will touch on topics such as algebraic coding theory, distributed storage, index coding, caching problems, streaming algorithms, cryptography, information theory, randomized algorithms and complexity theory.

Codes for Distributed Storage

In multi-disk systems, to provide reliability, if a disk fails, data must be recoverable from the remaining stored data. While it is empirically clear that disk failure is the norm rather than an exception (making systemic redundancy a storage requirement), the sheer scale of data involved means that redundancy must be added as efficiently as possible. In other words, coding is now a requirement for trustworthy distributed storage systems.

Regeneration codes, local reconstruction codes and update-efficient codes are among the current classes of codes with properties effective according to different distributed storage constraints such as repair bandwidth, disk I/O and repair locality. Algebraic methods and ideas from classical coding theory and network coding may be applied to give constructions of families of codes with high performance with respect to such different parameters. In addition to providing reliability against disk failures, coding reduces the time required to download content from a distributed storage system. The download time is reduced because when a content file is encoded to add redundancy and distributed across multiple disks, reading only a subset of the disks is sufficient to reconstruct the content. For the same total storage used, coding exploits the diversity in storage better than simple replication, and hence gives faster download. Storing coded data in distributed systems and even untrusted networks, although practical, poses novel challenges to maintaining data integrity, unless the codes are explicitly designed to handle these challenges. For example, an adversary may corrupt a large amount of stored data simply by altering the data stored on a small number of storage. Eavesdropping to the data becomes easier because of multiple downloads during nodal repair.

Codes for Video and other Big Data Delivery

Efficient use of storage and transmission resources is central to applications such as video broadcasting. Traditional unicast-based solutions underperform when the same content is sought by multiple users, while coding and information theoretic approaches can offer advantages.

Codes for streaming, heterogeneous multicast/broadcast codes, index coding and distributed caching are especially relevant to big data delivery. Convolutional codes have been proposed for low-delay streaming codes robust to burst and single errors. Rateless codes could be adapted for heterogeneous scenarios, where users not only have different channel conditions, but may demand different amounts of information, and have different computing and display capabilities. Index coding techniques can be applied to distributed caching problems, where a reduction in network traffic may be achieved by strategic storage of popular files at strategic nodes or end devices.


Summary

The Dagstuhl Seminar 16321 Coding Theory in the Time of Big Data, held in August 7-12, 2016, was the third of a series of Dagstuhl seminars relating modern aspects of coding theory and its applications in computer science. The overarching technical theme was on how fundamentals of coding theory could be applied to data storage and transmission in the context of big data and conversely, on emerging topics in coding theory arising from such applications. In Dagstuhl Seminar 11461 the main topics discussed were list decoding, codes on graphs, network coding and the relations between them. The themes of distributed storage, network coding and polar codes were central to Dagstuhl Seminar 13351.

The conference was organised into six main working groups, as listed below:

  1. Distributed Storage & Index Coding,
  2. Private Information Retrieval for Storage Codes,
  3. DNA-Based Storage,
  4. Age & Delay of Information.
  5. Code-Based Cryptography,
  6. Rank-Metric Codes.

The amount of data that is being stored is scaling at a rapid pace making efficient data storage an important problem that inspires several lines of scientific research. During the seminar, several discussions were conducted on the theme of using classical and new techniques from coding theory to store/compute data efficiently in distributed storage systems. A number of open problems were identified, such as the design of codes with optimal repair bandwidth, fundamental trade-offs between storage & communication cost, applications to content distribution networks, connections between fundamental limits of storage/caching and the index coding problem and applications of coding theory for parallel computing. A theoretical framework and numerical simulation for the long term reliability of a distributed storage system were presented by Luby.

DNA-based storage was recently proposed to address new challenges to handle extremely high volume recording media to propose new compression methods for non-traditional data formats. Since DNA may be easily replicated and a massive amount of information stored reliably with minimal space requirements, it has enormous potential as a method of big data storage. This was the focus of the DNA working group. Problems such read and write cost, insertion and deletion errors arising in sequences, error reduction were discussed. Milenkovic gave an introductory talk describing several problems associated with whole genome, sequencing read, RNA-seq and ChiP-seq data compression, and outlined the first portable DNA-based rewritable and random access storage system.

Private information retrieval (PIR) enables a user to retrieve a data item from a database without disclosing the identity of the item retrieved, while the data itself may be public. The PIR working group considered this problem in the context of storage codes, in particular for dynamic coded storage and adversarial PIR, with some extensions to asynchronized systems, batch codes and private keyword search. Hollanti gave a tutorial overview of recent results in the area.

Age of information is a metric for status updating systems, where a monitor is interested in staying timely about the status of a source. The optimal updating strategy that minimizes the average age exists when the updating rate is constrained by limited network resources. Streaming source coding problems can be applied to the problem of age analysis. The main focus the Age & Delay working group was to introduce the age of information concept to participating coding theorists and explore potential age and delay problems in coding and storage. An adaptive arithmetic coding scheme was proposed as a potential solution to avoid huge decoding delay. Several possible delay problems in file downloading from multiple servers were discussed. Two PhD students, Zhong and Najm gave a tutorial overview of the topic.

Code-based crypto-systems are some of the very few that resist quantum-based attacks. In the case subfield subcodes such as the Goppa or Srivastava codes no successful attack is known yet. Moderate-density parity-check (MDPC) codes have been proposed for key size reduction in such crytposystems. The group identified open problems such as investigating other subfield subcodes and attacks on MDPC structured codes. An overview was presented by Bossert.

Rank-metric codes have applications in random network coding, coded-caching and in code-based cryptography. The working group focussed on maximum rank distance (MRD) codes, specifically their classifications and on algebraic methods for constructing and decoding families of them. New nontrivial classifications were obtained. Further research directions on the classification problem were identified such as adapting semi-field theory techniques and searches for codes with high symmetry. Given the known limitation of list decoders for Gabidulin codes, the group worked on adapting decoders for Gabidulin codes to recent families of MRD codes. Sheekey presented recent results on MRD codes and described links to semifields.

A total of 44 researchers participated in the seminar across these working groups. In addition, several participants took the opportunity to collaborate with others on specific related projects. There were 16 talks in total, several related to storage of big data and others on topics such as maximum rank distance codes, chip-to-chip communication, the MDS conjecture, the SAGE computer algebra system, age of information, the edge removal problem, convolutional codes and network coding. Among the talks given were some tutorial presentations, aimed at introducing researchers to fundamentals of a related working group. The working groups focussed on identifying and addressing new and/or important open problems in the area. Age & Delay, PIR for storage codes and DNA-based storage were new topics to many participants and generated considerable interest.

Copyright Martin Bossert, Eimear Byrne, and Emina Soljanin

Participants
  • Iryna Andriyanova (University of Cergy-Pontoise, FR) [dblp]
  • Allison Beemer (University of Nebraska - Lincoln, US) [dblp]
  • Jessalyn Bolkema (University of Nebraska - Lincoln, US) [dblp]
  • Martin Bossert (Universität Ulm, DE) [dblp]
  • Michael Braun (Hochschule Darmstadt, DE) [dblp]
  • Eimear Byrne (University College Dublin, IE) [dblp]
  • Viveck Cadambe (Pennsylvania State University - University Park, US) [dblp]
  • Gerard Cohen (Telecom Paris Tech, FR) [dblp]
  • Alex Dimakis (University of Texas - Austin, US) [dblp]
  • Iwan M. Duursma (University of Illinois - Urbana Champaign, US) [dblp]
  • Michelle Effros (CalTech - Pasadena, US) [dblp]
  • Rafah El-Khatib (EPFL - Lausanne, CH) [dblp]
  • Tuvi Etzion (Technion - Haifa, IL) [dblp]
  • Christina Fragouli (University of California at Los Angeles, US) [dblp]
  • Heide Gluesing-Luerssen (University of Kentucky, US) [dblp]
  • Elisa Gorla (Université de Neuchâtel, CH) [dblp]
  • Marcus Greferath (Aalto University, FI) [dblp]
  • Venkatesan Guruswami (Carnegie Mellon University - Pittsburgh, US) [dblp]
  • Daniel Heinlein (Universität Bayreuth, DE) [dblp]
  • Camilla Hollanti (Aalto University, FI) [dblp]
  • Anna-Lena Horlemann-Trautmann (EPFL - Lausanne, CH) [dblp]
  • Christine A. Kelley (University of Nebraska - Lincoln, US) [dblp]
  • P. Vijay Kumar (Indian Institute of Science -Bangalore, IN) [dblp]
  • Julia Lieb (Universität Würzburg, DE) [dblp]
  • Hans-Andrea Loeliger (ETH - Zürich, CH) [dblp]
  • Michael Luby (Qualcomm Inc. - San Diego, US) [dblp]
  • Felice Manganiello (Clemson University, US) [dblp]
  • Carolyn Mayer (University of Nebraska - Lincoln, US) [dblp]
  • Sihem Mesnager (University of Paris VIII, Telecom Panis Tech, FR) [dblp]
  • Olgica Milenkovic (University of Illinois - Urbana Champaign, US) [dblp]
  • Elie Najm (EPFL - Lausanne, CH) [dblp]
  • Alessandro Neri (Universität Zürich, CH) [dblp]
  • Mario Osvin Pavcevic (University of Zagreb, HR) [dblp]
  • Sven Puchinger (Universität Ulm, DE) [dblp]
  • Tovohery Randrianarisoa (Universität Zürich, CH) [dblp]
  • Alberto Ravagnani (Université de Neuchâtel, CH) [dblp]
  • Johan Rosenkilde (Technical University of Denmark - Lyngby, DK) [dblp]
  • Joachim Rosenthal (Universität Zürich, CH) [dblp]
  • John Sheekey (University College Dublin, IE) [dblp]
  • M. Amin Shokrollahi (EPFL - Lausanne, CH) [dblp]
  • Vitaly Skachek (University of Tartu, EE) [dblp]
  • Emina Soljanin (Rutgers University - Piscataway, US) [dblp]
  • Pascal Vontobel (The Chinese University of Hong Kong, HK) [dblp]
  • Jing Zhong (Rutgers University - New Brunswick, US) [dblp]

Related Seminars
  • Dagstuhl Seminar 11461: Coding Theory (2011-11-13 - 2011-11-18) (Details)
  • Dagstuhl Seminar 13351: Coding Theory (2013-08-25 - 2013-08-30) (Details)
  • Dagstuhl Seminar 18511: Algebraic Coding Theory for Networks, Storage, and Security (2018-12-16 - 2018-12-21) (Details)

Classification
  • data structures / algorithms / complexity
  • networks
  • security / cryptology

Keywords
  • error-correction
  • coding theory
  • algebraic coding theory
  • distributed storage
  • index coding
  • caching problems
  • streaming algorithms
  • cryptography
  • information theory
  • randomized algorithms
  • complexity theory.