Dagstuhl Seminar 17181: Theory and Applications of Hashing

Dagstuhl Seminar 17181

Theory and Applications of Hashing

( May 01 – May 05, 2017 )

(Click in the middle of the image to enlarge)

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/17181

Organizers

Martin Dietzfelbinger (TU Ilmenau, DE)
Michael Mitzenmacher (Harvard University - Cambridge, US)
Rasmus Pagh (IT University of Copenhagen, DK)
David P. Woodruff (IBM Almaden Center - San Jose, US)

Coordinator

Martin Aumüller (IT University of Copenhagen, DK)

Contact

Andreas Dolzmann (for scientific matters)
Susanne Bach-Bernhard (for administrative matters)

Shared Documents

Dagstuhl Materials Page (Use personal credentials as created in DOOR to log in)

Publications

Theory and Applications of Hashing (Dagstuhl Seminar 17181). Martin Dietzfelbinger, Michael Mitzenmacher, Rasmus Pagh, David P. Woodruff, and Martin Aumüller. In Dagstuhl Reports, Volume 7, Issue 5, pp. 1-21, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2017)

Motivation

Show Motivation

The idea of hashing was proposed in the 1950s as an efficient method for implementing symbol tables in compilers. In succeeding decades it has emerged as an algorithmic tool that goes well beyond its original purpose, providing solutions for a wide range of algorithmic problems. While the theory of hashing may appear mature, in fact, many new advances have been made in recent years. The rich interplay between theory and practice in the field of hashing cannot be overstated. Applications drive the need for new algorithms and data structures based on hashing, as well as the need for more efficient hash function families with provable theoretical guarantees. Specific implementations developed in “the field” often do not have theoretical guarantees on performance or accuracy, creating new theoretical problems and driving a need to make theory and practice meet. Thus, the aim of this Dagstuhl Seminar is to bring together researchers with an interest in hashing methods (and more generally random mappings) from various perspectives, spanning theory and a diverse set of application areas. In this way we hope to identify opportunities for further advancing the field.

The theoretical side includes aspects such as the design of hash functions, the design of algorithm and data structure primitives, and the mathematical analysis of hashing schemes. For hash function design, since Carter and Wegman proposed universal hashing in 1977, there has been a fertile research agenda identifying sufficient randomness properties of hash functions for various applications, and on the other hand devising resource-efficient hash functions having these properties. While new simple and efficient hash function constructions with strong theoretical properties keep appearing, there are still many applications of hashing for which no efficient hash function construction is known. At the same time it is increasingly clear that traditional measures of hash function performance are inadequate to describe the randomness properties needed in many applications, and a more direct analysis is needed.

While hashing is interesting in its own right, it has also become a foundational building block for higher level algorithms and data structures, each of which can in turn often find uses in a variety of application spaces. The list of such basic algorithms and structures is long, and growing: cuckoo hash tables, Bloom filters of various kinds, including invertible Bloom lookup tables, balanced allocations, expanders derived from hashing, count-min filters, sketches of many different kinds, similarity measures for text and images, and so on. These algorithms and data structures provide the interface between the lower level of hash function design and the higher level applications; often they extend what we require from the hash functions.

Regarding applications, hashing is of course heavily used in information storage and retrieval contexts. It has been a central tool in the design of algorithms for data streams. In recent years the use of hashing has spread to many other settings where data is stored and accessible, but scalability can be achieved only by resorting to approximation algorithms. One early success story in this direction is min-wise hashing for identifying (near-)duplicate web pages. Bloom filters have become a staple in systems, with countless variations expanding on its initial functionality, such as counts associated with set elements or aging out of set items as new elements are dynamically added. In the field of machine learning, random mappings of data to lower-dimensional vectors that are easier to handle is of increasing importance for big data applications. Designing randomized mappings that meet the criteria of machine learning applications has been an active research area in recent years. We believe the topic of the Dagstuhl Seminar is highly relevant for companies dealing with big data, as some of the invitees who made fundamental contributions to the theory and applications of hashing are now employed in industry; we expect that applications beyond information retrieval, data stream analysis, and machine learning will also be discussed at the Dagstuhl Seminar.

Creative Commons BY 3.0 DE

Martin Dietzfelbinger, Michael Mitzenmacher, Rasmus Pagh, and David P. Woodruff

Summary

Show Summary

Background

The idea of hashing was proposed in the 1950s as an efficient method for implementing symbol tables in compilers. In succeeding decades it has emerged as an algorithmic tool that goes well beyond its original purpose, providing solutions for a wide range of algorithmic problems. While the theory of hashing may appear mature, in fact, many new advances have been made in recent years. Also, the number of applications has grown to an extent that few people realize how broad the reach of hashing is, or have a comprehensive overview. The aim of this seminar was to bring together researchers with an interest in hashing methods (and more generally random mappings) from various perspectives, spanning theory and a diverse set of application areas. In this way we wanted to identify opportunities for further advancing the field.

Theory

The theoretical side includes aspects such as the design of hash functions, the design of algorithm and data structure primitives, and the mathematical analysis of hashing schemes. For hash function design, since Carter and Wegman proposed universal hashing there has been a fertile research agenda identifying sufficient randomness properties of hash functions for various applications, and on the other hand devising resource-efficient hash functions having these properties. While new simple and efficient hash function constructions with strong theoretical properties keep appearing (seminar participants have contributed to this), there are still many applications of hashing for which no efficient hash function construction is known. At the same time it is increasingly clear that traditional measures of hash function performance like collision probability and independence are inadequate to describe the randomness properties needed in many applications.

While hashing is interesting in its own right, it has also become a foundational building block for higher level algorithms and data structures, each of which can in turn often find uses in a variety of application spaces. Various hash-based sketches laid the groundwork for the field of streaming algorithms, by showing how approximate counting could be done effectively. More recently, hashing algorithms have provided frameworks for similarity measures for text and images, data reconciliation, and even fast sparse Fast Fourier transform algorithms. As we construct richer, more elaborate structures on top of the hashing building blocks, we stretch what we require from hashing further.

Mathematical analysis of hashing schemes, an area enriched by early work of Knuth but grown well beyond, has inspired the development of a number of combinatorial methods and results. In the 1990s it was realized that load balancing using the best of two (or more) random choices leads to a much more even distribution, often referred to as the ``power of two choices''. Another great success in this area, obtained during the last decade, has been the analysis of cuckoo hashing; several seminar participants were involved in this. On the other hand, as questions become answered, new questions arise; in the case of cuckoo hashing, the effectiveness of variants including double hashing and partial-key cuckoo hashing are not understood. Beyond the hashing schemes themselves, data structures and algorithms that make use of lower-level hashing primitives further require and benefit from mathematical analysis. The fundamental connections between hashing, sparse reconstruction algorithms, and various sketching approaches are only now starting to be realized.

Applications

Hashing is of course heavily used in information storage and retrieval contexts. For example, the "power of two choices" paradigm(where several seminar participants were among the pioneers) has resulted in extremely scalable and robust architectures for distributed hash tables (also known as key-value stores).

Other applications of hashing are appearing at a tremendous rate, as systems-designers and builders become accustomed to a world where approximate answers (as opposed to exact answers) are not only sufficient, they are necessary for efficiency. Indeed, hashing was one of the key methodologies for handling big data well before "big data" was even a widely used term. Since the seminal paper of Flajolet and Martin that showed how to efficiently compute an approximate count of the number of elements in a data stream, hashing has been a central tool in the design of algorithms for data streams, where the inability to store all the information that passes requires approximations. But in recent years the use of hashing has spread to many other settings where data is stored and accessible, but scalability can be achieved only by resorting to approximation algorithms similar to those developed in the setting of data streams. One early success story in this direction is the method for identifying near-duplicate web pages in Altavista using min-wise hashing. Another is HyperANF, a refined version of the Flajolet-Martin method that was used in 2012 to compute the distance distribution of the Facebook social network, making it by far the largest scale Milgram-like experiment ever performed. Finally, Bloom filters, invented around 1970 to provide a small-memory approximate representation of a set, have become a staple in systems, with countless variations expanding on its initial functionality, such as counts associated with set elements or aging out of set items as new elements are dynamically added.

In the field of machine learning, random mappings of data to lower-dimensional vectors that are easier to handle is of increasing importance for big data applications. This is true in particular since machine learning algorithms often work with kernelized feature vectors whose dimension far exceeds the size of data vectors. Designing randomized mappings that meet the criteria of machine learning applications has been an active research area in recent years. NIPS 2014 awarded one of two best paper awards a paper co-authored by seminar participant Shrivastava that presents a new asymmetric locality-sensitive hash function design for machine learning applications and shows how it leads to significant speedups.

The rich interplay between theory and practice in the field of hashing cannot be overstated. Applications drive the need for new algorithms and data structures based on hashing, as well as the need for more efficient classes of hash function families with provable theoretical guarantees. Specific implementations developed in the field often do not have theoretical guarantees on performance or accuracy, creating new theoretical problems and driving a need to make theory and practice meet.

Industrial relevance.

The workshop topic was highly relevant for companies dealing with big data. Three seminar participants are affiliated with Google, one has worked on hashing algorithms at AT&T for a decade, and one is affiliated with VMware. Also, one organizer was affiliated with IBM at the time of the seminar.

Outcome of the seminar

The seminar brought together quite a few leading figures in the area of hashing, mixed with a good fraction of young researchers, some of which are behind many of the most exciting results in the area in recent years. Areas that were particularly well represented were: Analysis of multiple-choice hashing methods, hashing for high-dimensional search problems, applications of hashing in string algorithms, applications of hashing in machine learning, streaming (approximation) algorithms, high-performance hash function construction, and algorithm engineering. Many results in these areas were presented in 18 shorter talks. Four longer talks (by A. Andoni, A. McGregor, U. Wieder, and Q. Zhang) contributed background, overview over new results, and aspects of applications of hashing in industry. Open problems were discussed in an open problem session; four of them are included in this report.

The paper [1] on fillable arrays co-authored by J. Nelson was motivated by a talk given by T. Hagerup (see Section 3.7) and can thus be seen as a first direct result of the seminar.

We, the organizers, would like to thank all participants for their contributions in talks, material, and discussions, and in particular the speakers of the longer talks. Many thanks are due to the Dagstuhl staff both in the offices and in the castle for their support in making this seminar a success.

References

Jacob Teo Por Loong, Jelani Nelson, Huacheng Yu: Fillable arrays with constant time operations and a single bit of redundancy. CoRR abs/1709.09574 (2017)

Creative Commons BY 3.0 Unported license

Martin Dietzfelbinger, Michael Mitzenmacher, Rasmus Pagh, David P. Woodruff, and Martin Aumüller

Participants

Show Participants

Alexandr Andoni (Columbia University - New York, US) [dblp]
Martin Aumüller (IT University of Copenhagen, DK) [dblp]
Hannah Bast (Universität Freiburg, DE) [dblp]
Djamal Belazzougui (CERIST - Algiers, DZ) [dblp]
Vladimir Braverman (Johns Hopkins University - Baltimore, US) [dblp]
Andrei Z. Broder (Google Inc. - Mountain View, US) [dblp]
Tobias Christiani (IT University of Copenhagen, DK) [dblp]
Artur Czumaj (University of Warwick - Coventry, GB) [dblp]
Søren Dahlgaard (University of Copenhagen, DK) [dblp]
Martin Dietzfelbinger (TU Ilmenau, DE) [dblp]
Stefan Edelkamp (Universität Bremen, DE) [dblp]
Tom Friedetzky (Durham University, GB) [dblp]
Andreas Goerdt (TU Chemnitz, DE) [dblp]
Bernhard Haeupler (Carnegie Mellon University - Pittsburgh, US) [dblp]
Torben Hagerup (Universität Augsburg, DE) [dblp]
Michael Kapralov (EPFL - Lausanne, CH) [dblp]
Mathias Bæk Tejs Knudsen (University of Copenhagen, DK) [dblp]
Ravi Kumar (Google Research - Mountain View, US) [dblp]
Yi Li (Nanyang TU - Singapore, SG) [dblp]
Andrew McGregor (University of Massachusetts - Amherst, US) [dblp]
Michael Mitzenmacher (Harvard University - Cambridge, US) [dblp]
Moni Naor (Weizmann Institute - Rehovot, IL) [dblp]
Jelani Nelson (Harvard University - Cambridge, US) [dblp]
Rasmus Pagh (IT University of Copenhagen, DK) [dblp]
Konstantinos Panagiotou (LMU München, DE) [dblp]
Ely Porat (Bar-Ilan University - Ramat Gan, IL) [dblp]
Ilya Razenshteyn (MIT - Cambridge, US) [dblp]
Peter Sanders (KIT - Karlsruher Institut für Technologie, DE) [dblp]
Tamás Sarlós (Google Inc. - Mountain View, US) [dblp]
Ludwig Schmidt (MIT - Cambridge, US) [dblp]
Anshumali Shrivastava (Rice University - Houston, US) [dblp]
Francesco Silvestri (University of Padova, IT) [dblp]
Christian Sohler (TU Dortmund, DE) [dblp]
Tatiana Starikovskaya (University Paris-Diderot, FR) [dblp]
Mikkel Thorup (University of Copenhagen, DK) [dblp]
Stefan Walzer (TU Ilmenau, DE) [dblp]
Udi Wieder (VMware - Palo Alto, US) [dblp]
Philipp Woelfel (University of Calgary, CA) [dblp]
David P. Woodruff (IBM Almaden Center - San Jose, US) [dblp]
Qin Zhang (Indiana University - Bloomington, US) [dblp]

Classification

data bases / information retrieval
data structures / algorithms / complexity

Keywords

hash function construction and analysis
hashing primitives
information retrieval applications
locality-sensitive hashing
data streaming applications
machine learning applications
connections to complexity theory

Seminar 17181

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Seminar 17181

Theory and Applications of Hashing

( May 01 – May 05, 2017 )

Permalink

Organizers

Coordinator

Contact

Shared Documents

Publications

Motivation

Summary

Background

Theory

Applications

Industrial relevance.

Outcome of the seminar

References

Participants

Classification

Keywords