07.08.16 - 12.08.16, Seminar 16321

Coding Theory in the Time of Big Data

Diese Seminarbeschreibung wurde vor dem Seminar auf unseren Webseiten veröffentlicht und bei der Einladung zum Seminar verwendet.


The main focus of this seminar is to explore the impact of contemporary challenges for efficient, reliable, and secure storage and delivery of files in the time of 'big data'. In particular, the seminar will examine the ways in which coding theory fundamentals must be extended to address the emerging issues in the evolving practice of storage and transmission of large files across networks. These novel coding applications are having a significant impact on coding theory fundamentals. The seminar will touch on topics such as algebraic coding theory, distributed storage, index coding, caching problems, streaming algorithms, cryptography, information theory, randomized algorithms and complexity theory.

Codes for Distributed Storage

In multi-disk systems, to provide reliability, if a disk fails, data must be recoverable from the remaining stored data. While it is empirically clear that disk failure is the norm rather than an exception (making systemic redundancy a storage requirement), the sheer scale of data involved means that redundancy must be added as efficiently as possible. In other words, coding is now a requirement for trustworthy distributed storage systems.

Regeneration codes, local reconstruction codes and update-efficient codes are among the current classes of codes with properties effective according to different distributed storage constraints such as repair bandwidth, disk I/O and repair locality. Algebraic methods and ideas from classical coding theory and network coding may be applied to give constructions of families of codes with high performance with respect to such different parameters. In addition to providing reliability against disk failures, coding reduces the time required to download content from a distributed storage system. The download time is reduced because when a content file is encoded to add redundancy and distributed across multiple disks, reading only a subset of the disks is sufficient to reconstruct the content. For the same total storage used, coding exploits the diversity in storage better than simple replication, and hence gives faster download. Storing coded data in distributed systems and even untrusted networks, although practical, poses novel challenges to maintaining data integrity, unless the codes are explicitly designed to handle these challenges. For example, an adversary may corrupt a large amount of stored data simply by altering the data stored on a small number of storage. Eavesdropping to the data becomes easier because of multiple downloads during nodal repair.

Codes for Video and other Big Data Delivery

Efficient use of storage and transmission resources is central to applications such as video broadcasting. Traditional unicast-based solutions underperform when the same content is sought by multiple users, while coding and information theoretic approaches can offer advantages.

Codes for streaming, heterogeneous multicast/broadcast codes, index coding and distributed caching are especially relevant to big data delivery. Convolutional codes have been proposed for low-delay streaming codes robust to burst and single errors. Rateless codes could be adapted for heterogeneous scenarios, where users not only have different channel conditions, but may demand different amounts of information, and have different computing and display capabilities. Index coding techniques can be applied to distributed caching problems, where a reduction in network traffic may be achieved by strategic storage of popular files at strategic nodes or end devices.