Research Article
Query-driven document partitioning and collection selection
@INPROCEEDINGS{10.1145/1146847.1146881, author={Diego Puppin and Fabrizio Silvestri and Domenico Laforenza}, title={Query-driven document partitioning and collection selection}, proceedings={1st International ICST Conference on Scalable Information Systems}, publisher={ACM}, proceedings_a={INFOSCALE}, year={2006}, month={6}, keywords={}, doi={10.1145/1146847.1146881} }
- Diego Puppin
Fabrizio Silvestri
Domenico Laforenza
Year: 2006
Query-driven document partitioning and collection selection
INFOSCALE
ACM
DOI: 10.1145/1146847.1146881
Abstract
We present a novel strategy to partition a document collection onto several servers and to perform effective collection selection. The method is based on the analysis of query logs. We proposed a novel document representation called query-vectors model. Each document is represented as a list recording the queries for which the document itself is a match, along with their ranks. To both partition the collection and build the collection selection function, we co-cluster queries and documents. The document clusters are then assigned to the underlying IR servers, while the query clusters represent queries that return similar results, and are used for collection selection. We show that this document partition strategy greatly boosts the performance of standard collection selection algorithms, including CORI, w.r.t. a round-robin assignment. Secondly, we show that performing collection selection by matching the query to the existing query clusters and successively choosing only one server, we reach an average precision-at-5 up to 1.74 and we constantly improve CORI precision of a factor between 11% and 15%. As a side result we show a way to select rarely asked-for documents. Separating these documents from the rest of the collection allows the indexer to produce a more compact index containing only relevant documents that are likely to be requested in the future. In our tests, around 52% of the documents (3,128,366) are not returned among the first 100 top-ranked results of any query.