2nd International ICST Conference on Scalable Information Systems

Research Article

Comparing Different Units for Query Translation in Chinese Cross-Language Information Retrieval

Download596 downloads
  • @INPROCEEDINGS{10.4108/infoscale.2007.932,
        author={Lixin  Shi and Jian-Yun  Nie and Jing  Bai},
        title={Comparing Different Units for Query Translation in Chinese Cross-Language Information Retrieval},
        proceedings={2nd International ICST Conference on Scalable Information Systems},
        proceedings_a={INFOSCALE},
        year={2010},
        month={5},
        keywords={CLIR Language Model Parallel Corpus Translation Model Translation Unit},
        doi={10.4108/infoscale.2007.932}
    }
    
  • Lixin Shi
    Jian-Yun Nie
    Jing Bai
    Year: 2010
    Comparing Different Units for Query Translation in Chinese Cross-Language Information Retrieval
    INFOSCALE
    ICST
    DOI: 10.4108/infoscale.2007.932
Lixin Shi1,*, Jian-Yun Nie1,*, Jing Bai1,*
  • 1: Département d'informatique et de recherche opérationnelle, Université de Montréal C.P. 6128, succursale Centre-ville, Montréal, Québec, H3C 3J7 Canada
*Contact email: shilixin@iro.umontreal.ca, nie@iro.umontreal.ca, baijing@iro.umontreal.ca

Abstract

Although both words and n-grams of characters have been used in Chinese IR, they have often been used as two competing methods. For cross-language IR with Chinese, word translation has been used in all previous studies. In this paper, we re-examine the use of n-grams and words for monolingual Chinese IR. We show that both types of indexing unit can be combined within the language modeling framework to produce higher retrieval effectiveness. For CLIR with Chinese, we investigate the possibility of using bigrams and unigrams as translation units. Several translation models from English words to Chinese unigrams, bigrams and words are created based on a parallel corpus. An English query is then translated in several ways, each producing a ranking score. The final ranking score combines all these types of translation. Our experiments on several collections show that Chinese character n-grams are reasonable alternative translation units to words, and they lead to retrieval effectiveness comparable to words. In addition, combinations of both words and n-grams produce higher effectiveness.