3rd International ICST Conference on Scalable Information Systems

Research Article

Optimization Issues in Inverted Index-based Entity Annotation

Download614 downloads
        author={Ganesh Ramakrishnan and Sachindra Joshi and Sanjeet Khaitan and Sreeram Balakrishnan},
        title={Optimization Issues in Inverted Index-based Entity Annotation},
        proceedings={3rd International ICST Conference on Scalable Information Systems},
        keywords={Inverted index evaluation plan cost estimate},
  • Ganesh Ramakrishnan
    Sachindra Joshi
    Sanjeet Khaitan
    Sreeram Balakrishnan
    Year: 2010
    Optimization Issues in Inverted Index-based Entity Annotation
    DOI: 10.4108/ICST.INFOSCALE2008.3475
Ganesh Ramakrishnan1,*, Sachindra Joshi1,*, Sanjeet Khaitan2,*, Sreeram Balakrishnan3,*
  • 1: IBM India Research Lab, New Delhi, INDIA
  • 2: InfoSpace Inc., Bangalore, INDIA
  • 3: IBM Software Group, San Jose, CA, United States
*Contact email: ganramkr@in.ibm.com, jsachind@in.ibm.com, sanjeet.khaitan@infospace.com, sreevb@us.ibm.co


Entity annotation is emerging as a key enabling requirement for search based on deeper semantics: for example, a search on ‘John’s address’, that returns matches to all entities annotated as an address that co-occur with ‘John’. A dominant paradigm adopted by rulebased named entity annotators is to annotate a document at a time. The complexity of this approach varies linearly with the number of documents and the cost for annotating each document, which could be prohibiting for large document corpora. A recently proposed alternative paradigm for rule-based entity annotation [16], operates on the inverted index of a document collection and achieves an order of magnitude speed-up over the document-based counterpart. In addition the index based approach permits collection level optimization of the order of index operations required for the annotation process. It is this aspect that is explored in this paper. We develop a polynomial time algorithm that, based on estimated cost, can optimally select between different logically equivalent evaluation plans for a given rule. Additionally, we prove that this problem becomes NP-hard when the optimization has to be performed over multiple rules and provide effective heuristics for handling this case. Our empirical evaluations show a speed-up factor upto 2 over the baseline system without optimizations.