01.05.17 - 05.05.17, Seminar 17181

Theory and Applications of Hashing

The following text appeared on our web pages prior to the seminar, and was included as part of the invitation.


The idea of hashing was proposed in the 1950s as an efficient method for implementing symbol tables in compilers. In succeeding decades it has emerged as an algorithmic tool that goes well beyond its original purpose, providing solutions for a wide range of algorithmic problems. While the theory of hashing may appear mature, in fact, many new advances have been made in recent years. The rich interplay between theory and practice in the field of hashing cannot be overstated. Applications drive the need for new algorithms and data structures based on hashing, as well as the need for more efficient hash function families with provable theoretical guarantees. Specific implementations developed in “the field” often do not have theoretical guarantees on performance or accuracy, creating new theoretical problems and driving a need to make theory and practice meet. Thus, the aim of this Dagstuhl Seminar is to bring together researchers with an interest in hashing methods (and more generally random mappings) from various perspectives, spanning theory and a diverse set of application areas. In this way we hope to identify opportunities for further advancing the field.

The theoretical side includes aspects such as the design of hash functions, the design of algorithm and data structure primitives, and the mathematical analysis of hashing schemes. For hash function design, since Carter and Wegman proposed universal hashing in 1977, there has been a fertile research agenda identifying sufficient randomness properties of hash functions for various applications, and on the other hand devising resource-efficient hash functions having these properties. While new simple and efficient hash function constructions with strong theoretical properties keep appearing, there are still many applications of hashing for which no efficient hash function construction is known. At the same time it is increasingly clear that traditional measures of hash function performance are inadequate to describe the randomness properties needed in many applications, and a more direct analysis is needed.

While hashing is interesting in its own right, it has also become a foundational building block for higher level algorithms and data structures, each of which can in turn often find uses in a variety of application spaces. The list of such basic algorithms and structures is long, and growing: cuckoo hash tables, Bloom filters of various kinds, including invertible Bloom lookup tables, balanced allocations, expanders derived from hashing, count-min filters, sketches of many different kinds, similarity measures for text and images, and so on. These algorithms and data structures provide the interface between the lower level of hash function design and the higher level applications; often they extend what we require from the hash functions.

Regarding applications, hashing is of course heavily used in information storage and retrieval contexts. It has been a central tool in the design of algorithms for data streams. In recent years the use of hashing has spread to many other settings where data is stored and accessible, but scalability can be achieved only by resorting to approximation algorithms. One early success story in this direction is min-wise hashing for identifying (near-)duplicate web pages. Bloom filters have become a staple in systems, with countless variations expanding on its initial functionality, such as counts associated with set elements or aging out of set items as new elements are dynamically added. In the field of machine learning, random mappings of data to lower-dimensional vectors that are easier to handle is of increasing importance for big data applications. Designing randomized mappings that meet the criteria of machine learning applications has been an active research area in recent years. We believe the topic of the Dagstuhl Seminar is highly relevant for companies dealing with big data, as some of the invitees who made fundamental contributions to the theory and applications of hashing are now employed in industry; we expect that applications beyond information retrieval, data stream analysis, and machine learning will also be discussed at the Dagstuhl Seminar.

Creative Commons BY 3.0 DE
Martin Dietzfelbinger, Michael Mitzenmacher, Rasmus Pagh, and David P. Woodruff