Research Article
Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval
@INPROCEEDINGS{10.4108/infoscale.2007.881, author={Gleb Skobeltsyn and Toan Luu and Ivana Podnar Zarko and Martin Rajman and Karl Aberer}, title={Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval}, proceedings={2nd International ICST Conference on Scalable Information Systems}, proceedings_a={INFOSCALE}, year={2010}, month={5}, keywords={P2P DHT IR Query-Driven index Scalability.}, doi={10.4108/infoscale.2007.881} }
- Gleb Skobeltsyn
Toan Luu
Ivana Podnar Zarko
Martin Rajman
Karl Aberer
Year: 2010
Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval
INFOSCALE
ICST
DOI: 10.4108/infoscale.2007.881
Abstract
We present a query-driven algorithm for the distributed in- dexing of large document collections within structured P2P networks. To cope with bandwidth consumption that has been identi¯ed as the major problem for the standard P2P approach with single term indexing, we leverage a distributed index that stores up to top-k document references only for carefully chosen indexing term combinations. In addition, since the number of possible term combinations extracted from a document collection can be very large, we propose to use query statistics to index only such combinations that are indeed frequently requested by the users. Thus, by avoid- ing the maintenance of super°uous indexing information, we achieve a substantial reduction in bandwidth and storage. A speci¯c activation mechanism is applied to continuously up- date the indexing information according to changes in the query distribution, resulting in an e±cient, constantly evolv- ing query-driven indexing structure. We show that the size of the index and the generated indexing/retrieval tra±c remains manageable even for web- size document collections at the price of a marginal loss in precision for rare queries. Our theoretical analysis and ex- perimental results provide convincing evidence about the feasibility of the query-driven indexing strategy for large scale P2P text retrieval. Moreover, our experiments con¯rm that the retrieval performance is only slightly lower than the one obtained with state-of-the-art centralized query engines.