About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
sis 20(25): e6

Research Article

Optimised Transformation Algorithm For Hadoop Data Loading in Web ETL Framework

Download1327 downloads
Cite
BibTeX Plain Text
  • @ARTICLE{10.4108/eai.13-7-2018.160600,
        author={Gaurav  Gupta and Neelesh  Kumar and Indu  Chhabra},
        title={Optimised Transformation Algorithm For Hadoop Data Loading in Web ETL Framework},
        journal={EAI Endorsed Transactions on Scalable Information Systems},
        volume={7},
        number={25},
        publisher={EAI},
        journal_a={SIS},
        year={2019},
        month={10},
        keywords={Redundant Data, Data Transformation, Data Loading, Levenshtein Distance Matching, Hadoop},
        doi={10.4108/eai.13-7-2018.160600}
    }
    
  • Gaurav Gupta
    Neelesh Kumar
    Indu Chhabra
    Year: 2019
    Optimised Transformation Algorithm For Hadoop Data Loading in Web ETL Framework
    SIS
    EAI
    DOI: 10.4108/eai.13-7-2018.160600
Gaurav Gupta1,*, Neelesh Kumar2, Indu Chhabra3
  • 1: Research Planning & Project Management, CSIR-Indian Institute of Petroleum, Dehradun, India
  • 2: BioMedical Instrumentation, CSIR-Central Scientific Instruments Organisation, Chandigarh, India
  • 3: Department of Computer Science & Applicaions, Panjab University, Chandigarh, India
*Contact email: gaurav.gupta@iip.res.in

Abstract

Web ETL unlike conventional ETL framework requires considerable improvements in all the three layers i.e. Extraction, Transformation and Loading due to the inherent nature of web input data. Websites are huge and are unique source of information, out of such huge information available on the websites, finding and analysing the required and relevant data is critical as the data may be foul consisting of redundant data or misspelled. Determining integrated record that stands for identical real world entities in abundant ways is the major problem to be analysed for any database. Hence, Web ETL transformation layer functionality of data transformation becomes mandatory in determining the pertinent information to be examined. Since the data on the web is “very voluminous” hence loading only clean data in data warehouse is necessary for fast processing to achieve accurate result. The present research focuses on data transformation in web ETL framework and proposes a modified technique to employ token wise sentence sorting to remove redundant records from the patent database along with Levenshtein distance used for string matching. Afterwards the cleaned data is transformed and loaded from this staging area to hadoop environment. The integration of proposed transformation technique with hadoop system delimits the constraint of data processing, storage and retrieval of large data structure from conventional data warehouse system.

Keywords
Redundant Data, Data Transformation, Data Loading, Levenshtein Distance Matching, Hadoop
Received
2019-05-11
Accepted
2019-10-01
Published
2019-10-02
Publisher
EAI
http://dx.doi.org/10.4108/eai.13-7-2018.160600

Copyright © 2019 Gaurav Gupta et al., licensed to EAI. This is an open access article distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/3.0/), which permits unlimited use, distribution and reproduction in any medium so long as the original work is properly cited.

EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL