Heuristic Learning of Rules for Information Extraction from Web Documents

Dawei Hu; Huan Li; Tianyong Hao; Enhong Chen; Liu Wenyin

2nd International ICST Conference on Scalable Information Systems

Research Article

Heuristic Learning of Rules for Information Extraction from Web Documents

Download535 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/infoscale.2007.214,
    author={Dawei Hu and Huan Li and Tianyong Hao and Enhong Chen and Liu Wenyin},
    title={Heuristic Learning of Rules for Information Extraction from Web Documents},
    proceedings={2nd International ICST Conference on Scalable Information Systems},
    proceedings_a={INFOSCALE},
    year={2010},
    month={5},
    keywords={Information Extraction (IE) Conditional Entropy Extraction Rule},
    doi={10.4108/infoscale.2007.214}
}

Dawei Hu
Huan Li
Tianyong Hao
Enhong Chen
Liu Wenyin
Year: 2010
Heuristic Learning of Rules for Information Extraction from Web Documents
INFOSCALE
ICST
DOI: 10.4108/infoscale.2007.214

Dawei Hu^1,2,3^,*, Huan Li^1,2,3^,*, Tianyong Hao³^,*, Enhong Chen^1,2^,*, Liu Wenyin^2,3^,*

1: Department of Computer Science and Technology, University of Science & Technology of China, Hefei, China
2: Joint Research Lab of Excellence, CityU-USTC Advanced Research Institute, Suzhou, China
3: Department of Computer Science, City University of Hong Kong, Hong Kong, China

*Contact email: dwhu@mail.ustc.edu.cn, huanl@mail.ustc.edu.cn, tianyong@cityu.edu.hk, ehchen@ustc.edu.cn, csliuwy@cityu.edu.hk

Abstract

The efficacy of an information extraction system is mostly determined by the quality of the extraction rules. Building these extraction rules is time-consuming and difficult to implement by hand. Hence, we propose a Heuristic Rule Learning (HRL) algorithm which can automatically and efficiently acquire highquality extraction rules from a user labeled training corpus. Moreover, these extraction rules are maintained at the most suitable generalization level to enhance information extraction efficacy. In HRL, we use a Dynamic tErm eXtraction Technique (DEXT) to construct terms and extraction rules at different generalization levels. The conditional entropy model is used to evaluate the suitability of these different generalization levels of the extraction rules so as to maintain them at a high-quality level. Experimental results show the algorithm’s efficacy of acquiring extraction rules at different generalization levels and the efficacy of these extraction rules in the information extraction tasks.

Keywords: Information Extraction (IE) Conditional Entropy Extraction Rule

Published: 2010-05-16
Modified: 2011-09-11

: http://dx.doi.org/10.4108/infoscale.2007.214