2nd International ICST Conference on Scalable Information Systems

Research Article

Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval

Download474 downloads
  • @INPROCEEDINGS{10.4108/infoscale.2007.881,
        author={Gleb Skobeltsyn and Toan Luu and Ivana Podnar Zarko and Martin Rajman and Karl Aberer},
        title={Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval},
        proceedings={2nd International ICST Conference on Scalable Information Systems},
        proceedings_a={INFOSCALE},
        year={2010},
        month={5},
        keywords={P2P DHT IR Query-Driven index Scalability.},
        doi={10.4108/infoscale.2007.881}
    }
    
  • Gleb Skobeltsyn
    Toan Luu
    Ivana Podnar Zarko
    Martin Rajman
    Karl Aberer
    Year: 2010
    Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval
    INFOSCALE
    ICST
    DOI: 10.4108/infoscale.2007.881
Gleb Skobeltsyn1,*, Toan Luu1,*, Ivana Podnar Zarko2,*, Martin Rajman1,*, Karl Aberer1,*
  • 1: Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne, Switzerland
  • 2: University of Zagreb Zagreb, Croatia
*Contact email: gleb.skobeltsyn@epfl.ch, vinhtoan.luu@epfl.ch, ivana.podnar@fer.hr, martin.rajman@epfl.ch, karl.aberer@epfl.ch

Abstract

We present a query-driven algorithm for the distributed in- dexing of large document collections within structured P2P networks. To cope with bandwidth consumption that has been identi¯ed as the major problem for the standard P2P approach with single term indexing, we leverage a distributed index that stores up to top-k document references only for carefully chosen indexing term combinations. In addition, since the number of possible term combinations extracted from a document collection can be very large, we propose to use query statistics to index only such combinations that are indeed frequently requested by the users. Thus, by avoid- ing the maintenance of super°uous indexing information, we achieve a substantial reduction in bandwidth and storage. A speci¯c activation mechanism is applied to continuously up- date the indexing information according to changes in the query distribution, resulting in an e±cient, constantly evolv- ing query-driven indexing structure. We show that the size of the index and the generated indexing/retrieval tra±c remains manageable even for web- size document collections at the price of a marginal loss in precision for rare queries. Our theoretical analysis and ex- perimental results provide convincing evidence about the feasibility of the query-driven indexing strategy for large scale P2P text retrieval. Moreover, our experiments con¯rm that the retrieval performance is only slightly lower than the one obtained with state-of-the-art centralized query engines.