2nd International ICST Conference on Scalable Information Systems

Research Article

Load-Balancing and Caching for Collection Selection Architectures

Download669 downloads
  • @INPROCEEDINGS{10.4108/infoscale.2007.892,
        author={Diego Puppin and Fabrizio Silvestri and Raffaele Perego and Ricardo Baeza-Yates},
        title={Load-Balancing and Caching for Collection Selection Architectures},
        proceedings={2nd International ICST Conference on Scalable Information Systems},
        proceedings_a={INFOSCALE},
        year={2010},
        month={5},
        keywords={},
        doi={10.4108/infoscale.2007.892}
    }
    
  • Diego Puppin
    Fabrizio Silvestri
    Raffaele Perego
    Ricardo Baeza-Yates
    Year: 2010
    Load-Balancing and Caching for Collection Selection Architectures
    INFOSCALE
    ICST
    DOI: 10.4108/infoscale.2007.892
Diego Puppin1,*, Fabrizio Silvestri1,*, Raffaele Perego1,*, Ricardo Baeza-Yates2,*
  • 1: ISTI-CNR, Pisa.
  • 2: Yahoo! Research, Barcelona/Santiago.
*Contact email: diego.puppin@isti.cnr.it, fabrizio.silvestri@isti.cnr.it, raffaele.perego@isti.cnr.it, ricardo@baeza.cl

Abstract

To address the rapid growth of the Internet, modern Web search engines have to adopt distributed organizations, where the collection of indexed documents is partitioned among several servers, and query answering is performed as a parallel and distributed task. Collection selection can be a way to reduce the overall computing load, by finding a trade-off between the quality of results retrieved and the cost of solving queries. In this paper, we analyze the relationship between the collection selection strategy, the effect on load balancing and on the caching subsystem, by exploring the design-space of a distributed search engine based on collection selection. In particular, we propose a strategy to perform collection selection in a load-driven way, and a novel caching policy able to incrementally refine the effectiveness of the results returned for each subsequent cache hit. The combination of load-driven collection selection and incremental caching strategies allows our system to retrieve two thirds of the top-ranked results returned by a baseline centralized index, with only one fifth of the computing workload.