Research Article
Comparing Different Units for Query Translation in Chinese Cross-Language Information Retrieval
@INPROCEEDINGS{10.4108/infoscale.2007.932, author={Lixin Shi and Jian-Yun Nie and Jing Bai}, title={Comparing Different Units for Query Translation in Chinese Cross-Language Information Retrieval}, proceedings={2nd International ICST Conference on Scalable Information Systems}, proceedings_a={INFOSCALE}, year={2010}, month={5}, keywords={CLIR Language Model Parallel Corpus Translation Model Translation Unit}, doi={10.4108/infoscale.2007.932} }
- Lixin Shi
Jian-Yun Nie
Jing Bai
Year: 2010
Comparing Different Units for Query Translation in Chinese Cross-Language Information Retrieval
INFOSCALE
ICST
DOI: 10.4108/infoscale.2007.932
Abstract
Although both words and n-grams of characters have been used in Chinese IR, they have often been used as two competing methods. For cross-language IR with Chinese, word translation has been used in all previous studies. In this paper, we re-examine the use of n-grams and words for monolingual Chinese IR. We show that both types of indexing unit can be combined within the language modeling framework to produce higher retrieval effectiveness. For CLIR with Chinese, we investigate the possibility of using bigrams and unigrams as translation units. Several translation models from English words to Chinese unigrams, bigrams and words are created based on a parallel corpus. An English query is then translated in several ways, each producing a ranking score. The final ranking score combines all these types of translation. Our experiments on several collections show that Chinese character n-grams are reasonable alternative translation units to words, and they lead to retrieval effectiveness comparable to words. In addition, combinations of both words and n-grams produce higher effectiveness.