About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
2nd International ICST Conference on Scalable Information Systems

Research Article

Comparing Different Units for Query Translation in Chinese Cross-Language Information Retrieval

Download878 downloads
Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.4108/infoscale.2007.932,
        author={Lixin  Shi and Jian-Yun  Nie and Jing  Bai},
        title={Comparing Different Units for Query Translation in Chinese Cross-Language Information Retrieval},
        proceedings={2nd International ICST Conference on Scalable Information Systems},
        proceedings_a={INFOSCALE},
        year={2010},
        month={5},
        keywords={CLIR Language Model Parallel Corpus Translation Model Translation Unit},
        doi={10.4108/infoscale.2007.932}
    }
    
  • Lixin Shi
    Jian-Yun Nie
    Jing Bai
    Year: 2010
    Comparing Different Units for Query Translation in Chinese Cross-Language Information Retrieval
    INFOSCALE
    ICST
    DOI: 10.4108/infoscale.2007.932
Lixin Shi1,*, Jian-Yun Nie1,*, Jing Bai1,*
  • 1: Département d'informatique et de recherche opérationnelle, Université de Montréal C.P. 6128, succursale Centre-ville, Montréal, Québec, H3C 3J7 Canada
*Contact email: shilixin@iro.umontreal.ca, nie@iro.umontreal.ca, baijing@iro.umontreal.ca

Abstract

Although both words and n-grams of characters have been used in Chinese IR, they have often been used as two competing methods. For cross-language IR with Chinese, word translation has been used in all previous studies. In this paper, we re-examine the use of n-grams and words for monolingual Chinese IR. We show that both types of indexing unit can be combined within the language modeling framework to produce higher retrieval effectiveness. For CLIR with Chinese, we investigate the possibility of using bigrams and unigrams as translation units. Several translation models from English words to Chinese unigrams, bigrams and words are created based on a parallel corpus. An English query is then translated in several ways, each producing a ranking score. The final ranking score combines all these types of translation. Our experiments on several collections show that Chinese character n-grams are reasonable alternative translation units to words, and they lead to retrieval effectiveness comparable to words. In addition, combinations of both words and n-grams produce higher effectiveness.

Keywords
CLIR Language Model Parallel Corpus Translation Model Translation Unit
Published
2010-05-16
Modified
2011-09-11
http://dx.doi.org/10.4108/infoscale.2007.932
Copyright © 2007–2025 ICST
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL