Unsupervised Approach for Email Spam Filtering using Data Mining

The computer networks overwhelm with unwanted emails, which are called spam emails. This email brings financial damage to companies and losses of user reputation. In this paper, the increasing volume of these emails has created the intense need to design and implement robust anti-spam filtering using the vector space model and Machine Learning (ML). ML algorithms have successfully used to detect and filter spam emails that jeopardize the network resources and consume the bandwidth. The main objective is to apply unsupervised learning M-DBSCAN to classify spam and ham emails. A robust method using the Modified Density-Based Spatial Clustering of Applications with Noise (M-DBSCAN) is implemented. The extracted N- representative points from each cluster are applied in the online test. These points represent the cluster objects to detect spherical and non-spherical clusters. These N-representative points are formed from the training step to detect spam email using distance measures. The data set used from the Kaggle website included many objects of ham and spam emails. The results show good performance accuracy with 97.848% in M-DBSCAN compared with 95.918% for standard DBSCAN accuracy and efficient values in false-negative rate, false-positive rate, f-score and online time detection.


Introduction
The computer networks have recently jeopardized with unwanted commercial bulks emails. These emails are spam emails and spammer who's sending them. There are a set of reasons that harm the losing of the company reputation and network consumption. Some of these reasons are inadequate staff training, virus, worm, theft of data, unauthorized copying of data, and program alteration [1]. Kaspersky reported the volume of spam emails sent within two years (2016 -2018), see figure (1).  Furthermore, pernicious attachments of spam emails such as malware, Marcos script, and Javascript were illustrated in this report. The cybercriminals tried to send blackmail to the users in 2019. Theoretically, dealing with a case opened against the message recipient to the storage and distribution of pornographic images of minors to pay $10,000 [2] [3]. There are four spam email filtering approaches content-based, sample-based, heuristic-based, and trained-based. E-commerce is the main motivation that needs to avoid a lack of trust in online marketing. A spammer sends a large number of spam emails, which attached harmful scripts. This could overwhelm the network and prevent legitimate users from accessing the resources securely [4] [5].
Machine learning has launched to play an integral role to identify and prevent spam email. In general, data mining tasks can be classified into two main categories: descriptive data mining or predictive data mining. The first category summarizes the general purpose of the data succinctly. We can apply the methods of statistical analysis to describe the main purpose of the data. For instance, a histogram can create a picture of the data distribution as a graphical display. Moreover, it is possible to utilize frequency to extract the number of data iterations. In the second category, predictive data mining aims to predict the data model and then identify particular data's behaviour [6][7].

Data Mining Techniques
A data mining technique may use one or more of the following data technique association, classification and clustering methods: -

Association Methods
These methods work potentially to find the correlation between the tuples or sets of data records. It mainly depends on the rule expression form. An association rule consists of deriving the set of rules in the form of . Where X and Y are sets of attributes values with X∩Y≠0. It is commonly used in market data transaction [8].

Classification methods
It is a predictive model which has been involved in organizing data into classes. It comprises two steps to achieve classification, training and testing steps. In the first step, the classification model was trained using one classification technique, such as the neural network and decision tree. This would have occurred in the presence of the target (class). In the second step, training features could be harnessed to classify unknown data or points. For example, the decision tree features are used to classify the data point with the unknown class to one class obtained in the training step [9]. Many performance metrics are utilized to calculate the accuracy, false-positive rate, false-negative rate, and time. The confusion matrix is used to calculate the metrics above [10].

Clustering methods
It is unsupervised descriptive data mining methods to extract new knowledge and clustering groups of data into sub-clusters. Objects in each cluster similar to each other and different from other objects in the other clusters. One of the distance-measures was recruited for efficient classification. The distance measured (X, Y) takes two variables in the space and returns a numeric distance between these two arguments. Figure (2) illustrates the output clusters using one of the cluster analysis methods, and it is three groups (A, B and C) in coordinate space. Furthermore, some cluster analysis is applied to produce clusters with different size and densities [11].

Figure 2. The landscape of three different clusters in coordinate space
In this study, the M-DBSCAN has been proposed to classify the spam, and hame emails using the Kaggle dataset, which consists of spam and ham emails with (4993) uniques values spam emails 29% and the ham emails with 71%.

Related works
Spam emails are among the most complicated email services problems because they jeopardize network resources and consume bandwidth. A lot of work has been done on spam filtering. Most dominant spam filters are based on machine learning classification techniques. Classification plays an important role in detecting the target type at the training and testing steps, including finding fraud, checking attacks, intruder and early diseases. These algorithms are supported by vector machine, neural networks, Naive Bayes and Decision tree. Therefore, most researchers are interested in finding the best classifier for spam detection.
In a published study [12], the clustering algorithms were carried out, the digest algorithm represented emails, and then the emails were clustered using the DBSCAN. The results showed that accuracy was improved by 30% compared with other Standard DBSCAN. The Naïve Bayes, KNN, and reverse DBSCAN classifiers are implemented to classify the spam emails based on the two public datasets [13].
The results present good performance in terms of accuracy, recall, precision, and F-Measure. The comparative study was conducted between the classification algorithms in terms of using n-gram or without through the use of public data in 2007 TREC Public Spam Corpus [31].
The study introduced that the results were outperformed in terms of n-gram and combined datasets for the classifier naïve base, decision tree, artificial neural network, random forest, and linear regression. On the other hand, another study [14] shows spam emails can be distinguished across some features because they know if they used the same set of features for a long time. The anti-spam companies could develop tools as anti-spam filtering. Since the spam emails may be prone to so-called "concept-drift", the study proposed Ensemble-based Lifelong Classification using Adjustable Dataset Partitioning (ELCADP).
The results have shown that this model outperforms several data mining algorithms in terms of accuracy performance. Spam emails are clustered using K-means data mining cluster analysis in [32].
The proposed work consists of four steps, and the first step is tokenized the incoming message. Information gain is calculated in the second step to select the best features from the incoming email message, while the feature vector created in the third step [33]. Finally, the K-means is applied using to detect the spam cluster from the ham cluster. Naïve bases are used to detect spam emails using two public datasets, Spam Data and SPAMBASE datasets [14]. The datasets were evaluated using accuracy, recall, and precision performance metrics [34].

The Proposed System
Currently, there is an increasing concern in data mining approaches for spam emails detection. Specifically, we concern about lower time and accurate results for a large amount of spam emails dataset. These approaches have been designed and implemented to identify the knowledge discovery from the spam emails dataset. Data mining's main steps are data selection, preprocessing, transformation, data mining models, and evaluation [10].   The dataset source has been from the website https://www.kaggle.com/venky73/spam-mailsdataset/data Enron-1 spam emails with spam emails and ham emails. The spam email dataset consists of two columns, the emails data and label_num, which either be (1) for spam and (0) for ham with the total unique values (4993), where the emails percents were (29% and 71%) for the spam and ham emails respectively [15]; [20]. It is clearly shown in figure (3) that after getting the spam email dataset, the preprocessing step is used to remove the stop words and special characters and then we used the Vector Space Model (VSM), which Salton introduces to measure the similarity between texts [16]; [21]. We replace each word in a vector to binary representation [0,1] and then, the cosine distance between two vectors space is calculated from the formula (1).
(1) Where x, y refers to a vector of n points for each. Where the cosine distance can be found by multiplying both vectors and dividing by multiplying each vector's sum, values range between -1 and 1, where -1 is perfectly dissimilar, and 1 is perfectly similar. The main idea of DBSCAN identifies a set of objects in the data space based on the cluster in data space is a high point density [22]. The intercluster objects have a high point density during the intracluster with low points density. Parameters (MinPts) refers to a minimum of points (threshold) for cluster objects to be dense, while the parameter (eps) is a distance measure to locate neighbouring points for any point (P) [19]. there are three points in DBSCAN core, border and noise. The core point refers to a Point P with at least m points with distance n from itself. Point P is called border if it has at least one core point at a distance n. Point p is called noise if it is neither a Core nor a Border [17] [18]. Figure (4) shows the main concepts of core, noise, and border points [23].

Results and Discussion
Dataset was labeled as ham and spam. N-representative points were extracted based on two clusters (ham and spam) by setting the N-representative points= 5 [26]. The testing points are assigned the class label having the minimum distance of N-representative points [27]. Quantitative statistical analysis was performed based on the confusion matrix, and four metrics were evaluated (accuracy, precision, recall, and f-score) as shown in the following formula ( 2-5) [28].
(3) The formula (3) is illustrated the recall (4) The relation between precision and recall is calculated in the formula (5) by f-score (5) Table (1) shows the main results using the Modified DBCAN by setting the N-representative point=5 for each cluster. We find the N-representative point=5 satisfied the best results for performance metrics above using trial and error [29]. It is clearly shown from the table (6) that the increasing training percent would not have an impact on the accuracy of the proposed system. However, time is increased for the Modified DBSCAN [30]. The results show that the proposed system displays superior performance in accuracy metrics compared with the k-means, which has an accuracy (93%) and time (88 secs). In contrast, the DBSCAN shows less accurate results than the modified DBSCAN. The reason that the k-means can't detect the cluster with non-spherical shapes compared with Modified-DBSCAN.

Conclusion
This study was intended to intensify intruders who send spam emails by using one of the unsupervised data mining algorithms. The main role of the extracted Nrepresentative points from each cluster is their implementation in online testing. Each test point is classified in a proactive way for spherical and nonspherical cluster shapes. The modified M-DBSCAN algorithm's evaluation results involved relatively higher rates in terms of accuracy compared to the results with the original algorithm DBSCAN and K-means. A total number of (747) spam emails and (4825) ham emails have been used, and the results proved the efficiency of the proposed algorithm with accuracy and precision of more than 97% and a recall rate over 88%. It is thereby concluded that the modified M-DBSCAN deals with nonspherical shapes (clusters) with very high accuracy and detection rates.