Two Parallelized Filter Methods for Feature Selection Based on Spark

Reine Marone; Fodé Camara; Samba Ndiaye; Demba Kande

Emerging Technologies for Developing Countries. Second EAI International Conference, AFRICATEK 2018, Cotonou, Benin, May 29–30, 2018, Proceedings

Research Article

Two Parallelized Filter Methods for Feature Selection Based on Spark

Download

247 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.1007/978-3-030-05198-3_16,
    author={Reine Marone and Fod\^{e} Camara and Samba Ndiaye and Demba Kande},
    title={Two Parallelized Filter Methods for Feature Selection Based on Spark},
    proceedings={Emerging Technologies for Developing Countries. Second EAI International Conference, AFRICATEK 2018, Cotonou, Benin, May 29--30, 2018, Proceedings},
    proceedings_a={AFRICATEK},
    year={2018},
    month={12},
    keywords={Feature selection Parallel computing Apache spark mRMR Novel method Big data Large scale High dimensional},
    doi={10.1007/978-3-030-05198-3_16}
}

Reine Marone
Fodé Camara
Samba Ndiaye
Demba Kande
Year: 2018
Two Parallelized Filter Methods for Feature Selection Based on Spark
AFRICATEK
Springer
DOI: 10.1007/978-3-030-05198-3_16

Reine Marone¹^,*, Fodé Camara²^,*, Samba Ndiaye¹, Demba Kande¹

1: Cheikh Anta Diop University
2: Alioune Diop University

*Contact email: reine.marie.marone@ucad.edu.sn, fode.camara@uadb.edu.sn

Abstract

The goal of feature selection is to reduce computation time, improve prediction performance, build simpler and more comprehensive models and allow a better understanding of the data in machine learning or data mining problems. But the major problem nowadays is that the size of datasets grows larger and larger, both vertically and horizontally. That constitutes challenges to the feature selection, as there is an increasing need for scalable and yet efficient feature selection methods. As an answer to those problems, we present here two effective parallel algorithms developed on Apache Spark, a unified analytics engine for big data processing. One of them is a parallelized algorithm based on the famous feature selection method called mRMR. In the second algorithm we propose a totally novel metric to select the more relevant and less redundant features. To show the superiority of that algorithm we have created its centralized version that we have called CNFS_Spark.

Keywords: Feature selection, Parallel computing, Apache spark, mRMR, Novel method, Big data, Large scale, High dimensional

Published: 2018-12-14
Appears in: SpringerLink

: http://dx.doi.org/10.1007/978-3-030-05198-3_16

Two Parallelized Filter Methods for Feature Selection Based on Spark

Abstract

About EAI

Community

Publish with EAI