Third International conference on advances in communication, network and computing

Research Article

Similarity Based Web Data Extraction and Integration System for Web Content Mining

Download
280 downloads
  • @INPROCEEDINGS{10.1007/978-3-642-35615-5_41,
        author={Srikantaiah K.C. and Suraj M. and Venugopal K.R. and Iyengar S.S. and L. Patnaik},
        title={Similarity Based Web Data Extraction and Integration System for Web Content Mining},
        proceedings={Third International conference on advances in communication, network and computing},
        proceedings_a={CNC},
        year={2012},
        month={12},
        keywords={Offline Browsing Web Data Extraction Web Data Integration World Wide Web Web Wrapper},
        doi={10.1007/978-3-642-35615-5_41}
    }
    
  • Srikantaiah K.C.
    Suraj M.
    Venugopal K.R.
    Iyengar S.S.
    L. Patnaik
    Year: 2012
    Similarity Based Web Data Extraction and Integration System for Web Content Mining
    CNC
    Springer
    DOI: 10.1007/978-3-642-35615-5_41
Srikantaiah K.C.1,*, Suraj M.2, Venugopal K.R.1, Iyengar S.S.3, L. Patnaik4
  • 1: University Visvesvaraya College of Engineering, Bangalore University
  • 2: SJB Institute of Technology
  • 3: Florida International University
  • 4: Indian Institute of Science
*Contact email: Srikantaiahkc@gmail.com

Abstract

The Internet is a major source of all information that we essentially need. The information on the web cannot be analyzed and queried as per the user requests. Here, we propose and develop a similarity based web data extraction and integration system (WDES and WDICS) to extract search result pages from the web and integrate its contents to enable the user to perform intended analysis. The system provides for local replication of search result pages, in a manner convenient for offline browsing. The system organizes itself into two possible phases that are involved in performing the above task. We develop and implement algorithms for extracting and integrating the content from the web. Experiment is performed on the contents of Bluetooth product listings and it gives us a better Precision and Recall than DEPTA [1].