Research Article
Optimization Issues in Inverted Index-based Entity Annotation
@INPROCEEDINGS{10.4108/ICST.INFOSCALE2008.3475, author={Ganesh Ramakrishnan and Sachindra Joshi and Sanjeet Khaitan and Sreeram Balakrishnan}, title={Optimization Issues in Inverted Index-based Entity Annotation}, proceedings={3rd International ICST Conference on Scalable Information Systems}, publisher={ICST}, proceedings_a={INFOSCALE}, year={2010}, month={5}, keywords={Inverted index evaluation plan cost estimate}, doi={10.4108/ICST.INFOSCALE2008.3475} }
- Ganesh Ramakrishnan
Sachindra Joshi
Sanjeet Khaitan
Sreeram Balakrishnan
Year: 2010
Optimization Issues in Inverted Index-based Entity Annotation
INFOSCALE
ICST
DOI: 10.4108/ICST.INFOSCALE2008.3475
Abstract
Entity annotation is emerging as a key enabling requirement for search based on deeper semantics: for example, a search on ‘John’s address’, that returns matches to all entities annotated as an address that co-occur with ‘John’. A dominant paradigm adopted by rulebased named entity annotators is to annotate a document at a time. The complexity of this approach varies linearly with the number of documents and the cost for annotating each document, which could be prohibiting for large document corpora. A recently proposed alternative paradigm for rule-based entity annotation [16], operates on the inverted index of a document collection and achieves an order of magnitude speed-up over the document-based counterpart. In addition the index based approach permits collection level optimization of the order of index operations required for the annotation process. It is this aspect that is explored in this paper. We develop a polynomial time algorithm that, based on estimated cost, can optimally select between different logically equivalent evaluation plans for a given rule. Additionally, we prove that this problem becomes NP-hard when the optimization has to be performed over multiple rules and provide effective heuristics for handling this case. Our empirical evaluations show a speed-up factor upto 2 over the baseline system without optimizations.