A Parallelized Spark Based Version of mRMR

Reine Marone; Fodé Camara; Samba Ndiaye

Innovations and Interdisciplinary Solutions for Underserved Areas. Second International Conference, InterSol 2018, Kigali, Rwanda, March 24–25, 2018, Proceedings

Research Article

A Parallelized Spark Based Version of mRMR

Download

127 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.1007/978-3-319-98878-8_18,
    author={Reine Marone and Fod\^{e} Camara and Samba Ndiaye},
    title={A Parallelized Spark Based Version of mRMR},
    proceedings={Innovations and Interdisciplinary Solutions for Underserved Areas. Second International Conference, InterSol 2018, Kigali, Rwanda, March 24--25, 2018, Proceedings},
    proceedings_a={INTERSOL},
    year={2018},
    month={9},
    keywords={Feature selection Filter method Parallel computing Apache Spark mRMR SVM},
    doi={10.1007/978-3-319-98878-8_18}
}

Reine Marone
Fodé Camara
Samba Ndiaye
Year: 2018
A Parallelized Spark Based Version of mRMR
INTERSOL
Springer
DOI: 10.1007/978-3-319-98878-8_18

Reine Marone¹^,*, Fodé Camara²^,*, Samba Ndiaye¹

1: Cheikh Anta Diop University
2: Alioune Diop University

*Contact email: reine.marie.marone@ucad.edu.sn, fode.camara@uadb.edu.sn

Abstract

Nowadays, we are surrounded by enormous large-scale high dimensional data called big data and it is crucial to reduce the dimensionality of data for machine learning problems. That’s why feature selection plays a vital role in the process of machine learning because it aims to reduce high-dimensionality by removing irrelevant and redundant features from original data. However some characteristics of big data like data velocity, volume and data variety have brought new challenges in the field of feature selection. In fact, most of existing feature selection algorithms were designed for running on a single machine (centralized computing architecture) and do not scale well when dealing with big data. Their efficiency may significantly deteriorate to the point of becoming inapplicable. For this reason, there is an increasing need for scalable yet efficient feature selection methods. That’s why we present here a distributed and effective version of the mRMR (Max-Relevance and Min-Redundancy) algorithm to face real-world problems of data mining and evaluate the empirical performance of the proposed algorithms in selecting features in several public datasets. When we compared the efficiency and the scalability of our parallelized method in comparison with the centralized one we have found out that our parallelized method have given better results.

Keywords: Feature selection Filter method Parallel computing Apache Spark mRMR SVM

Published: 2018-09-03
Appears in: SpringerLink

: http://dx.doi.org/10.1007/978-3-319-98878-8_18