Forensics in Telecommunications, Information, and Multimedia. Third International ICST Conference, e-Forensics 2010, Shanghai, China, November 11-12, 2010, Revised Selected Papers

Research Article

Text Content Filtering Based on Chinese Character Reconstruction from Radicals

Download
462 downloads
  • @INPROCEEDINGS{10.1007/978-3-642-23602-0_22,
        author={Wenlei He and Gongshen Liu and Jun Luo and Jiuchuan Lin},
        title={Text Content Filtering Based on Chinese Character Reconstruction from Radicals},
        proceedings={Forensics in Telecommunications, Information, and Multimedia. Third International ICST Conference, e-Forensics 2010, Shanghai, China, November 11-12, 2010, Revised Selected Papers},
        proceedings_a={E-FORENSICS},
        year={2012},
        month={10},
        keywords={Chinese character radical multi-pattern matching text filtering},
        doi={10.1007/978-3-642-23602-0_22}
    }
    
  • Wenlei He
    Gongshen Liu
    Jun Luo
    Jiuchuan Lin
    Year: 2012
    Text Content Filtering Based on Chinese Character Reconstruction from Radicals
    E-FORENSICS
    Springer
    DOI: 10.1007/978-3-642-23602-0_22
Wenlei He1, Gongshen Liu1, Jun Luo2, Jiuchuan Lin2
  • 1: Shanghai Jiao Tong University
  • 2: The Third Research Institute of Ministry of Public Security

Abstract

Content filtering through keyword matching is widely adopted in network censoring, and proven to be successful. However, a technique to bypass this kind of censorship by decomposing Chinese characters appears recently. Chinese characters are combinations of radicals, and splitting characters into radicals pose a big obstacle to keyword filtering. To tackle this challenge, we proposed the first filtering technology based on combination of Chinese character radicals. We use a modified Rabin-Karp algorithm to reconstruct characters from radicals according to Chinese character structure library. Then we use another modified Rabin-Karp algorithm to filter keywords among massive text content. Experiment shows that our approach can identify most of the keywords in the form of combination of radicals and yields a visible improvement in the filtering result compared to traditional keyword filtering.