Optimization Issues in Inverted Index-based Entity Annotation

Ganesh Ramakrishnan; Sachindra Joshi; Sanjeet Khaitan; Sreeram Balakrishnan

3rd International ICST Conference on Scalable Information Systems

Research Article

Optimization Issues in Inverted Index-based Entity Annotation

Download1146 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/ICST.INFOSCALE2008.3475,
    author={Ganesh Ramakrishnan and Sachindra Joshi and Sanjeet Khaitan and Sreeram Balakrishnan},
    title={Optimization Issues in Inverted Index-based Entity Annotation},
    proceedings={3rd International ICST Conference on Scalable Information Systems},
    publisher={ICST},
    proceedings_a={INFOSCALE},
    year={2010},
    month={5},
    keywords={Inverted index evaluation plan cost estimate},
    doi={10.4108/ICST.INFOSCALE2008.3475}
}

Ganesh Ramakrishnan
Sachindra Joshi
Sanjeet Khaitan
Sreeram Balakrishnan
Year: 2010
Optimization Issues in Inverted Index-based Entity Annotation
INFOSCALE
ICST
DOI: 10.4108/ICST.INFOSCALE2008.3475

Ganesh Ramakrishnan¹^,*, Sachindra Joshi¹^,*, Sanjeet Khaitan²^,*, Sreeram Balakrishnan³^,*

1: IBM India Research Lab, New Delhi, INDIA
2: InfoSpace Inc., Bangalore, INDIA
3: IBM Software Group, San Jose, CA, United States

*Contact email: ganramkr@in.ibm.com, jsachind@in.ibm.com, sanjeet.khaitan@infospace.com, sreevb@us.ibm.co

Abstract

Entity annotation is emerging as a key enabling requirement for search based on deeper semantics: for example, a search on ‘John’s address’, that returns matches to all entities annotated as an address that co-occur with ‘John’. A dominant paradigm adopted by rulebased named entity annotators is to annotate a document at a time. The complexity of this approach varies linearly with the number of documents and the cost for annotating each document, which could be prohibiting for large document corpora. A recently proposed alternative paradigm for rule-based entity annotation [16], operates on the inverted index of a document collection and achieves an order of magnitude speed-up over the document-based counterpart. In addition the index based approach permits collection level optimization of the order of index operations required for the annotation process. It is this aspect that is explored in this paper. We develop a polynomial time algorithm that, based on estimated cost, can optimally select between different logically equivalent evaluation plans for a given rule. Additionally, we prove that this problem becomes NP-hard when the optimization has to be performed over multiple rules and provide effective heuristics for handling this case. Our empirical evaluations show a speed-up factor upto 2 over the baseline system without optimizations.

Keywords: Inverted index, evaluation plan, cost estimate

Published: 2010-05-16
Publisher: ICST
Modified: 2010-05-16

: http://dx.doi.org/10.4108/ICST.INFOSCALE2008.3475

Optimization Issues in Inverted Index-based Entity Annotation

Abstract

About EAI

Community

Publish with EAI