Distributed perfect hashing for very large key sets

Fabiano C. Botelho; Daniel Galinkin; Wagner Meira Jr.; Nivio Ziviani

3rd International ICST Conference on Scalable Information Systems

Research Article

Distributed perfect hashing for very large key sets

Download661 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/ICST.INFOSCALE2008.3518,
    author={Fabiano C. Botelho and Daniel Galinkin and Wagner Meira Jr. and Nivio Ziviani},
    title={Distributed perfect hashing for very large key sets},
    proceedings={3rd International ICST Conference on Scalable Information Systems},
    publisher={ICST},
    proceedings_a={INFOSCALE},
    year={2010},
    month={5},
    keywords={Distributed minimal perfect hash function large key sets},
    doi={10.4108/ICST.INFOSCALE2008.3518}
}

Fabiano C. Botelho
Daniel Galinkin
Wagner Meira Jr.
Nivio Ziviani
Year: 2010
Distributed perfect hashing for very large key sets
INFOSCALE
ICST
DOI: 10.4108/ICST.INFOSCALE2008.3518

Fabiano C. Botelho¹^,*, Daniel Galinkin¹^,*, Wagner Meira Jr.¹^,*, Nivio Ziviani¹^,*

1: Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil

*Contact email: fbotelho@dcc.ufmg.br, dggc@dcc.ufmg.br, meira@dcc.ufmg.br, nivio@dcc.ufmg.br

Abstract

A perfect hash function (PHF) h: S → [0, m -- 1] for a key set S ⊆ U of size n, where m ≥ n and U is a key universe, is an injective function that maps the keys of S to unique values. A minimal perfect hash function (MPHF) is a PHF with m = n, the smallest possible range. Minimal perfect hash functions are widely used for memory efficient storage and fast retrieval of items from static sets.

In this paper we present a distributed and parallel version of a simple, highly scalable and near-space optimal perfect hashing algorithm for very large key sets, recently presented in [4]. The sequential implementation of the algorithm constructs a MPHF for a set of 1.024 billion URLs of average length 64 bytes collected from the Web in approximately 50 minutes using a commodity PC.

The parallel implementation proposed here presents the following performance using 14 commodity PCs: (i) it constructs a MPHF for the same set of 1.024 billion URLs in approximately 4 minutes; (ii) it constructs a MPHF for a set of 14.336 billion 16-byte random integers in approximately 50 minutes with a performance degradation of 20%; (iii) one version of the parallel algorithm distributes the description of the MPHF among the participating machines and its evaluation is done in a distributed way, faster than the centralized function.

Keywords: Distributed minimal perfect hash function large key sets

Published: 2010-05-16
Publisher: ICST

: http://dx.doi.org/10.4108/ICST.INFOSCALE2008.3518