Innovations and Interdisciplinary Solutions for Underserved Areas. Second International Conference, InterSol 2018, Kigali, Rwanda, March 24–25, 2018, Proceedings

Research Article

A Parallelized Spark Based Version of mRMR

Download
127 downloads
  • @INPROCEEDINGS{10.1007/978-3-319-98878-8_18,
        author={Reine Marone and Fod\^{e} Camara and Samba Ndiaye},
        title={A Parallelized Spark Based Version of mRMR},
        proceedings={Innovations and Interdisciplinary Solutions for Underserved Areas. Second International Conference, InterSol 2018, Kigali, Rwanda, March 24--25, 2018, Proceedings},
        proceedings_a={INTERSOL},
        year={2018},
        month={9},
        keywords={Feature selection Filter method Parallel computing Apache Spark mRMR SVM},
        doi={10.1007/978-3-319-98878-8_18}
    }
    
  • Reine Marone
    Fodé Camara
    Samba Ndiaye
    Year: 2018
    A Parallelized Spark Based Version of mRMR
    INTERSOL
    Springer
    DOI: 10.1007/978-3-319-98878-8_18
Reine Marone1,*, Fodé Camara2,*, Samba Ndiaye1
  • 1: Cheikh Anta Diop University
  • 2: Alioune Diop University
*Contact email: reine.marie.marone@ucad.edu.sn, fode.camara@uadb.edu.sn

Abstract

Nowadays, we are surrounded by enormous large-scale high dimensional data called big data and it is crucial to reduce the dimensionality of data for machine learning problems. That’s why feature selection plays a vital role in the process of machine learning because it aims to reduce high-dimensionality by removing irrelevant and redundant features from original data. However some characteristics of big data like data velocity, volume and data variety have brought new challenges in the field of feature selection. In fact, most of existing feature selection algorithms were designed for running on a single machine (centralized computing architecture) and do not scale well when dealing with big data. Their efficiency may significantly deteriorate to the point of becoming inapplicable. For this reason, there is an increasing need for scalable yet efficient feature selection methods. That’s why we present here a distributed and effective version of the mRMR (Max-Relevance and Min-Redundancy) algorithm to face real-world problems of data mining and evaluate the empirical performance of the proposed algorithms in selecting features in several public datasets. When we compared the efficiency and the scalability of our parallelized method in comparison with the centralized one we have found out that our parallelized method have given better results.