ew 18: e3

Research Article

Unsupervised Approach for Email Spam Filtering using Data Mining

Download160 downloads
  • @ARTICLE{10.4108/eai.9-3-2021.168962,
        author={Mehdi Ebady Manaa and Ahmed J. Obaid and Mohammed Hussein Dosh},
        title={Unsupervised Approach for Email Spam Filtering using Data Mining},
        journal={EAI Endorsed Transactions on Energy Web: Online First},
        volume={},
        number={},
        publisher={EAI},
        journal_a={EW},
        year={2021},
        month={3},
        keywords={Spam Emails, Vector Space Model, Data Security, Machine Learning, M-DBSCAN},
        doi={10.4108/eai.9-3-2021.168962}
    }
    
  • Mehdi Ebady Manaa
    Ahmed J. Obaid
    Mohammed Hussein Dosh
    Year: 2021
    Unsupervised Approach for Email Spam Filtering using Data Mining
    EW
    EAI
    DOI: 10.4108/eai.9-3-2021.168962
Mehdi Ebady Manaa1,*, Ahmed J. Obaid2, Mohammed Hussein Dosh3
  • 1: Department of Information Networks, College of Information Technology, University of Babylon, Iraq
  • 2: Department of Computer Science, Faculty of Computer science and Mathematics, University of Kufa, Iraq
  • 3: College of Education for Girls, University of Kufa, Iraq
*Contact email: It.mehdi.ebady@itnet.uobabylon.edu.iq

Abstract

The computer networks overwhelm with unwanted emails, which are called spam emails. This email brings financial damage to companies and losses of user reputation. In this paper, the increasing volume of these emails has created the intense need to design and implement robust anti-spam filtering using the vector space model and Machine Learning (ML). ML algorithms have successfully used to detect and filter spam emails that jeopardize the network resources and consume the bandwidth. The main objective is to apply unsupervised learning M-DBSCAN to classify spam and ham emails. A robust method using the Modified Density-Based Spatial Clustering of Applications with Noise (M-DBSCAN) is implemented. The extracted N-representative points from each cluster are applied in the online test. These points represent the cluster objects to detect spherical and non-spherical clusters. These N-representative points are formed from the training step to detect spam email using distance measures. The data set used from the Kaggle website included many objects of ham and spam emails. The results show good performance accuracy with 97.848% in M-DBSCAN compared with 95.918% for standard DBSCAN accuracy and efficient values in false-negative rate, false-positive rate, f-score and online time detection.