1st International ICST Conference on Scalable Information Systems

Research Article

Query-driven document partitioning and collection selection

  • @INPROCEEDINGS{10.1145/1146847.1146881,
        author={Diego Puppin and Fabrizio Silvestri and Domenico  Laforenza},
        title={Query-driven document partitioning and collection selection},
        proceedings={1st International ICST Conference on Scalable Information Systems},
        publisher={ACM},
        proceedings_a={INFOSCALE},
        year={2006},
        month={6},
        keywords={},
        doi={10.1145/1146847.1146881}
    }
    
  • Diego Puppin
    Fabrizio Silvestri
    Domenico Laforenza
    Year: 2006
    Query-driven document partitioning and collection selection
    INFOSCALE
    ACM
    DOI: 10.1145/1146847.1146881
Diego Puppin1,*, Fabrizio Silvestri1,*, Domenico Laforenza1,*
  • 1: Istituto di Scienza e Tecnologie dell’Informazione (A. Faedo), Consiglio Nazionale delle Ricerche, Pisa, Italy
*Contact email: diego.puppin@isti.cnr.it, fabrizio.silvestri@isti.cnr.it, domenico.laforenza@isti.cnr.it

Abstract

We present a novel strategy to partition a document collection onto several servers and to perform effective collection selection. The method is based on the analysis of query logs. We proposed a novel document representation called query-vectors model. Each document is represented as a list recording the queries for which the document itself is a match, along with their ranks. To both partition the collection and build the collection selection function, we co-cluster queries and documents. The document clusters are then assigned to the underlying IR servers, while the query clusters represent queries that return similar results, and are used for collection selection. We show that this document partition strategy greatly boosts the performance of standard collection selection algorithms, including CORI, w.r.t. a round-robin assignment. Secondly, we show that performing collection selection by matching the query to the existing query clusters and successively choosing only one server, we reach an average precision-at-5 up to 1.74 and we constantly improve CORI precision of a factor between 11% and 15%. As a side result we show a way to select rarely asked-for documents. Separating these documents from the rest of the collection allows the indexer to produce a more compact index containing only relevant documents that are likely to be requested in the future. In our tests, around 52% of the documents (3,128,366) are not returned among the first 100 top-ranked results of any query.