Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval

Gleb Skobeltsyn; Toan Luu; Ivana Podnar Zarko; Martin Rajman; Karl Aberer

2nd International ICST Conference on Scalable Information Systems

Research Article

Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval

Download612 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/infoscale.2007.881,
    author={Gleb Skobeltsyn and Toan Luu and Ivana Podnar Zarko and Martin Rajman and Karl Aberer},
    title={Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval},
    proceedings={2nd International ICST Conference on Scalable Information Systems},
    proceedings_a={INFOSCALE},
    year={2010},
    month={5},
    keywords={P2P DHT IR Query-Driven index Scalability.},
    doi={10.4108/infoscale.2007.881}
}

Gleb Skobeltsyn
Toan Luu
Ivana Podnar Zarko
Martin Rajman
Karl Aberer
Year: 2010
Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval
INFOSCALE
ICST
DOI: 10.4108/infoscale.2007.881

Gleb Skobeltsyn¹^,*, Toan Luu¹^,*, Ivana Podnar Zarko²^,*, Martin Rajman¹^,*, Karl Aberer¹^,*

1: Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne, Switzerland
2: University of Zagreb Zagreb, Croatia

*Contact email: gleb.skobeltsyn@epfl.ch, vinhtoan.luu@epfl.ch, ivana.podnar@fer.hr, martin.rajman@epfl.ch, karl.aberer@epfl.ch

Abstract

We present a query-driven algorithm for the distributed in- dexing of large document collections within structured P2P networks. To cope with bandwidth consumption that has been identi¯ed as the major problem for the standard P2P approach with single term indexing, we leverage a distributed index that stores up to top-k document references only for carefully chosen indexing term combinations. In addition, since the number of possible term combinations extracted from a document collection can be very large, we propose to use query statistics to index only such combinations that are indeed frequently requested by the users. Thus, by avoid- ing the maintenance of super°uous indexing information, we achieve a substantial reduction in bandwidth and storage. A speci¯c activation mechanism is applied to continuously up- date the indexing information according to changes in the query distribution, resulting in an e±cient, constantly evolv- ing query-driven indexing structure. We show that the size of the index and the generated indexing/retrieval tra±c remains manageable even for web- size document collections at the price of a marginal loss in precision for rare queries. Our theoretical analysis and ex- perimental results provide convincing evidence about the feasibility of the query-driven indexing strategy for large scale P2P text retrieval. Moreover, our experiments con¯rm that the retrieval performance is only slightly lower than the one obtained with state-of-the-art centralized query engines.

Keywords: P2P DHT IR Query-Driven index Scalability.

Published: 2010-05-16
Modified: 2011-09-11

: http://dx.doi.org/10.4108/infoscale.2007.881