Emerging Technologies for Developing Countries. Second EAI International Conference, AFRICATEK 2018, Cotonou, Benin, May 29–30, 2018, Proceedings

Research Article

Two Parallelized Filter Methods for Feature Selection Based on Spark

Download
114 downloads
  • @INPROCEEDINGS{10.1007/978-3-030-05198-3_16,
        author={Reine Marone and Fod\^{e} Camara and Samba Ndiaye and Demba Kande},
        title={Two Parallelized Filter Methods for Feature Selection Based on Spark},
        proceedings={Emerging Technologies for Developing Countries. Second EAI International Conference, AFRICATEK 2018, Cotonou, Benin, May 29--30, 2018, Proceedings},
        proceedings_a={AFRICATEK},
        year={2018},
        month={12},
        keywords={Feature selection Parallel computing Apache spark mRMR Novel method Big data Large scale High dimensional},
        doi={10.1007/978-3-030-05198-3_16}
    }
    
  • Reine Marone
    Fodé Camara
    Samba Ndiaye
    Demba Kande
    Year: 2018
    Two Parallelized Filter Methods for Feature Selection Based on Spark
    AFRICATEK
    Springer
    DOI: 10.1007/978-3-030-05198-3_16
Reine Marone1,*, Fodé Camara2,*, Samba Ndiaye1, Demba Kande1
  • 1: Cheikh Anta Diop University
  • 2: Alioune Diop University
*Contact email: reine.marie.marone@ucad.edu.sn, fode.camara@uadb.edu.sn

Abstract

The goal of feature selection is to reduce computation time, improve prediction performance, build simpler and more comprehensive models and allow a better understanding of the data in machine learning or data mining problems. But the major problem nowadays is that the size of datasets grows larger and larger, both vertically and horizontally. That constitutes challenges to the feature selection, as there is an increasing need for scalable and yet efficient feature selection methods. As an answer to those problems, we present here two effective parallel algorithms developed on Apache Spark, a unified analytics engine for big data processing. One of them is a parallelized algorithm based on the famous feature selection method called mRMR. In the second algorithm we propose a totally novel metric to select the more relevant and less redundant features. To show the superiority of that algorithm we have created its centralized version that we have called CNFS_Spark.