Research Article
Two Parallelized Filter Methods for Feature Selection Based on Spark
@INPROCEEDINGS{10.1007/978-3-030-05198-3_16, author={Reine Marone and Fod\^{e} Camara and Samba Ndiaye and Demba Kande}, title={Two Parallelized Filter Methods for Feature Selection Based on Spark}, proceedings={Emerging Technologies for Developing Countries. Second EAI International Conference, AFRICATEK 2018, Cotonou, Benin, May 29--30, 2018, Proceedings}, proceedings_a={AFRICATEK}, year={2018}, month={12}, keywords={Feature selection Parallel computing Apache spark mRMR Novel method Big data Large scale High dimensional}, doi={10.1007/978-3-030-05198-3_16} }
- Reine Marone
Fodé Camara
Samba Ndiaye
Demba Kande
Year: 2018
Two Parallelized Filter Methods for Feature Selection Based on Spark
AFRICATEK
Springer
DOI: 10.1007/978-3-030-05198-3_16
Abstract
The goal of feature selection is to reduce computation time, improve prediction performance, build simpler and more comprehensive models and allow a better understanding of the data in machine learning or data mining problems. But the major problem nowadays is that the size of datasets grows larger and larger, both vertically and horizontally. That constitutes challenges to the feature selection, as there is an increasing need for scalable and yet efficient feature selection methods. As an answer to those problems, we present here two effective parallel algorithms developed on Apache Spark, a unified analytics engine for big data processing. One of them is a parallelized algorithm based on the famous feature selection method called mRMR. In the second algorithm we propose a totally novel metric to select the more relevant and less redundant features. To show the superiority of that algorithm we have created its centralized version that we have called CNFS_Spark.