3rd International ICST Conference on Scalable Information Systems

Research Article

Distributed perfect hashing for very large key sets

Download661 downloads
  • @INPROCEEDINGS{10.4108/ICST.INFOSCALE2008.3518,
        author={Fabiano C. Botelho and Daniel Galinkin and Wagner Meira Jr. and Nivio Ziviani},
        title={Distributed perfect hashing for very large key sets},
        proceedings={3rd International ICST Conference on Scalable Information Systems},
        publisher={ICST},
        proceedings_a={INFOSCALE},
        year={2010},
        month={5},
        keywords={Distributed minimal perfect hash function large key sets},
        doi={10.4108/ICST.INFOSCALE2008.3518}
    }
    
  • Fabiano C. Botelho
    Daniel Galinkin
    Wagner Meira Jr.
    Nivio Ziviani
    Year: 2010
    Distributed perfect hashing for very large key sets
    INFOSCALE
    ICST
    DOI: 10.4108/ICST.INFOSCALE2008.3518
Fabiano C. Botelho1,*, Daniel Galinkin1,*, Wagner Meira Jr.1,*, Nivio Ziviani1,*
  • 1: Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil
*Contact email: fbotelho@dcc.ufmg.br, dggc@dcc.ufmg.br, meira@dcc.ufmg.br, nivio@dcc.ufmg.br

Abstract

A perfect hash function (PHF) h: S → [0, m -- 1] for a key set S ⊆ U of size n, where m ≥ n and U is a key universe, is an injective function that maps the keys of S to unique values. A minimal perfect hash function (MPHF) is a PHF with m = n, the smallest possible range. Minimal perfect hash functions are widely used for memory efficient storage and fast retrieval of items from static sets.

In this paper we present a distributed and parallel version of a simple, highly scalable and near-space optimal perfect hashing algorithm for very large key sets, recently presented in [4]. The sequential implementation of the algorithm constructs a MPHF for a set of 1.024 billion URLs of average length 64 bytes collected from the Web in approximately 50 minutes using a commodity PC.

The parallel implementation proposed here presents the following performance using 14 commodity PCs: (i) it constructs a MPHF for the same set of 1.024 billion URLs in approximately 4 minutes; (ii) it constructs a MPHF for a set of 14.336 billion 16-byte random integers in approximately 50 minutes with a performance degradation of 20%; (iii) one version of the parallel algorithm distributes the description of the MPHF among the participating machines and its evaluation is done in a distributed way, faster than the centralized function.