3rd International ICST Conference on Scalable Information Systems

Research Article

Analysis of Varying Approaches to Topical Web Query Classification

Download523 downloads
  • @INPROCEEDINGS{10.4108/ICST.INFOSCALE2008.3487,
        author={Steven M. Beitzel and Eric C. Jensen and Abdur Chowdhury and Ophir Frieder},
        title={Analysis of Varying Approaches to Topical Web Query Classification},
        proceedings={3rd International ICST Conference on Scalable Information Systems},
        publisher={ICST},
        proceedings_a={INFOSCALE},
        year={2010},
        month={5},
        keywords={query classification web search},
        doi={10.4108/ICST.INFOSCALE2008.3487}
    }
    
  • Steven M. Beitzel
    Eric C. Jensen
    Abdur Chowdhury
    Ophir Frieder
    Year: 2010
    Analysis of Varying Approaches to Topical Web Query Classification
    INFOSCALE
    ICST
    DOI: 10.4108/ICST.INFOSCALE2008.3487
Steven M. Beitzel1,*, Eric C. Jensen2,*, Abdur Chowdhury2,*, Ophir Frieder3,*
  • 1: Telcordia Technologies, Inc.
  • 2: Summize, Inc.
  • 3: Illinois Institute of Technology
*Contact email: steve@research.telcordia.com, ej@summize.com, abdur@summize.com, ophir@ir.iit.edu

Abstract

Topical classification of web queries has drawn recent interest from forums such as the 2005 KDD Cup because of the promise it offers in improving retrieval effectiveness and efficiency. Many proposed techniques make use of documents classified in taxonomies (such as the ODP: Open Directory Project -- http://www.dmoz.org) to inform on the class of a web query. Implicit in these approaches is the assumption that topically classifying queries is equivalent to the general topical text classification task (although with few directly available features from such short queries). We test this assumption by comparing and combining classifiers trained directly from manually classified queries and their retrieved documents, trained from categorized documents in the ODP, and induced from unlabeled query logs for pre-retrieval classification. We find that training classifiers directly from manually classified queries outperforms the best general topical classifier by 48% in relative F1 score. We attribute this to a mismatch in task when applying a general classifier to queries. For example, a typically vague web query classified as "business" is likely to retrieve documents classified as "news" and "organizations" in addition to those labeled "business." Equating a "business" class of queries with a "business" class of documents, then, is not appropriate.