Research Article
Distributed perfect hashing for very large key sets
@INPROCEEDINGS{10.4108/ICST.INFOSCALE2008.3518, author={Fabiano C. Botelho and Daniel Galinkin and Wagner Meira Jr. and Nivio Ziviani}, title={Distributed perfect hashing for very large key sets}, proceedings={3rd International ICST Conference on Scalable Information Systems}, publisher={ICST}, proceedings_a={INFOSCALE}, year={2010}, month={5}, keywords={Distributed minimal perfect hash function large key sets}, doi={10.4108/ICST.INFOSCALE2008.3518} }
- Fabiano C. Botelho
Daniel Galinkin
Wagner Meira Jr.
Nivio Ziviani
Year: 2010
Distributed perfect hashing for very large key sets
INFOSCALE
ICST
DOI: 10.4108/ICST.INFOSCALE2008.3518
Abstract
A perfect hash function (PHF) h: S → [0, m -- 1] for a key set S ⊆ U of size n, where m ≥ n and U is a key universe, is an injective function that maps the keys of S to unique values. A minimal perfect hash function (MPHF) is a PHF with m = n, the smallest possible range. Minimal perfect hash functions are widely used for memory efficient storage and fast retrieval of items from static sets.
In this paper we present a distributed and parallel version of a simple, highly scalable and near-space optimal perfect hashing algorithm for very large key sets, recently presented in [4]. The sequential implementation of the algorithm constructs a MPHF for a set of 1.024 billion URLs of average length 64 bytes collected from the Web in approximately 50 minutes using a commodity PC.
The parallel implementation proposed here presents the following performance using 14 commodity PCs: (i) it constructs a MPHF for the same set of 1.024 billion URLs in approximately 4 minutes; (ii) it constructs a MPHF for a set of 14.336 billion 16-byte random integers in approximately 50 minutes with a performance degradation of 20%; (iii) one version of the parallel algorithm distributes the description of the MPHF among the participating machines and its evaluation is done in a distributed way, faster than the centralized function.