Comparing Different Units for Query Translation in Chinese Cross-Language Information Retrieval

Lixin  Shi; Jian-Yun  Nie; Jing  Bai

2nd International ICST Conference on Scalable Information Systems

Research Article

Comparing Different Units for Query Translation in Chinese Cross-Language Information Retrieval

Download1111 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/infoscale.2007.932,
    author={Lixin  Shi and Jian-Yun  Nie and Jing  Bai},
    title={Comparing Different Units for Query Translation in Chinese Cross-Language Information Retrieval},
    proceedings={2nd International ICST Conference on Scalable Information Systems},
    proceedings_a={INFOSCALE},
    year={2010},
    month={5},
    keywords={CLIR Language Model Parallel Corpus Translation Model Translation Unit},
    doi={10.4108/infoscale.2007.932}
}

Lixin Shi
Jian-Yun Nie
Jing Bai
Year: 2010
Comparing Different Units for Query Translation in Chinese Cross-Language Information Retrieval
INFOSCALE
ICST
DOI: 10.4108/infoscale.2007.932

Lixin Shi¹^,*, Jian-Yun Nie¹^,*, Jing Bai¹^,*

1: Département d'informatique et de recherche opérationnelle, Université de Montréal C.P. 6128, succursale Centre-ville, Montréal, Québec, H3C 3J7 Canada

*Contact email: shilixin@iro.umontreal.ca, nie@iro.umontreal.ca, baijing@iro.umontreal.ca

Abstract

Although both words and n-grams of characters have been used in Chinese IR, they have often been used as two competing methods. For cross-language IR with Chinese, word translation has been used in all previous studies. In this paper, we re-examine the use of n-grams and words for monolingual Chinese IR. We show that both types of indexing unit can be combined within the language modeling framework to produce higher retrieval effectiveness. For CLIR with Chinese, we investigate the possibility of using bigrams and unigrams as translation units. Several translation models from English words to Chinese unigrams, bigrams and words are created based on a parallel corpus. An English query is then translated in several ways, each producing a ranking score. The final ranking score combines all these types of translation. Our experiments on several collections show that Chinese character n-grams are reasonable alternative translation units to words, and they lead to retrieval effectiveness comparable to words. In addition, combinations of both words and n-grams produce higher effectiveness.

Keywords: CLIR, Language Model, Parallel Corpus, Translation Model, Translation Unit

Published: 2010-05-16
Modified: 2011-09-11

: http://dx.doi.org/10.4108/infoscale.2007.932

Comparing Different Units for Query Translation in Chinese Cross-Language Information Retrieval

Abstract

About EAI

Community

Publish with EAI