Mining Query Logs to Optimize Index Partitioning in Parallel Web Search Engines

Claudio Lucchese; Salvatore Orlando; Raffaele Perego; Fabrizio Silvestri

2nd International ICST Conference on Scalable Information Systems

Research Article

Mining Query Logs to Optimize Index Partitioning in Parallel Web Search Engines

Download788 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/infoscale.2007.227,
    author={Claudio Lucchese and Salvatore Orlando and Raffaele Perego and Fabrizio Silvestri},
    title={Mining Query Logs to Optimize Index Partitioning in Parallel Web Search Engines},
    proceedings={2nd International ICST Conference on Scalable Information Systems},
    proceedings_a={INFOSCALE},
    year={2010},
    month={5},
    keywords={},
    doi={10.4108/infoscale.2007.227}
}

Claudio Lucchese
Salvatore Orlando
Raffaele Perego
Fabrizio Silvestri
Year: 2010
Mining Query Logs to Optimize Index Partitioning in Parallel Web Search Engines
INFOSCALE
ICST
DOI: 10.4108/infoscale.2007.227

Claudio Lucchese¹^,*, Salvatore Orlando²^,*, Raffaele Perego³^,*, Fabrizio Silvestri³^,*

1: Dipartimento di Informatica, Universit`a Ca’ Foscari di Venezia, Venezia, Italy
2: Dipartimento di Informatica, Universit`a Ca’ Foscari di Venezia, Venezia, Italy,
3: ISTI-CNR, Consiglio Nazionale delle Ricerche, Pisa, Italy,

*Contact email: c.lucchese@isti.cnr.it, orlando@dsi.unive.it, r.perego@isti.cnr.it, f.silvestri@isti.cnr.it

Abstract

Large-scale Parallel Web Search Engines (WSEs) needs to adopt a strategy for partitioning the inverted index among a set of parallel server nodes. In this paper we are interested in devising an effective term-partitioning strategy, according to which the global vocabulary of terms and the associated inverted lists are split into disjoint subsets, and assigned to distinct servers. Due to the workload imbalance caused by the skewed distribution of terms in user queries, finding an effective partitioning strategy is considered a very complex task. In this paper we first formally introduce Term Partitioning as a new optimization problem. Then we show how the knowledge mined from past WSE query logs can be profitably used to discover good solutions of this problem. Finally, we report many results to show that we are able to effectively reduce both the average number of servers activated per each query, along with the workload imbalance. Experiments are conducted on large query logs of real WSEs.

Published: 2010-05-16
Modified: 2011-09-11

: http://dx.doi.org/10.4108/infoscale.2007.227

Mining Query Logs to Optimize Index Partitioning in Parallel Web Search Engines

Abstract

About EAI

Community

Publish with EAI