Use the ensemble methods when detecting DoS attacks in Network Intrusion Detection Systems

Building a good IDS model from a certain dataset is one of the main tasks in machine learning. Training multiple classiﬁers at the same time to solve the same problem and then combining their outputs to improve classiﬁcation quality, called ensemble method. This paper analyzes and evaluates the performance of using known ensemble techniques such as Bagging, AdaBoost, Stacking, Decorate, Random Forest and Voting to detect DoS attacks on UNSW-NB15 dataset, created by the Australian Cyber Security Center 2015. The experimental results show that the Stacking technique with heterogeneous classiﬁers for the best classiﬁcation quality with F − Measure is 99.28% compared to 98.61%, which is the best result are obtained by using single classiﬁers and 99.02% by using the Random Forest technique.


Introduction
For companies that are constantly connected to the internet in today's technology age, the terms DoS and DDoS are not too strange. Especially in recent years, news that the information and technology division of large organizations around the world has been hacked and the stolen data always contains the terms DoS and DDoS. The full name of "DoS" is "Denial of Service" and DDoS is "Distributed Denial of Service", is a form of denial of service attack. This is a fairly common form of attack today, it makes the target computer can not handle the task and lead to overload. These DoS attacks often target virtual servers (VPS) or web servers of large businesses such as banks, governments or e-commerce websites, ... A fourth quarter security report produced by Kaspersky [1] indicated that the source attack of DDoS originated from 86 nations with attack duration up to 279 hours. The most common type of attack is still SYN flooding (79.7%), with UDP flooding in second place (9.4%). The least popular is ICMP flooding (0.5%). Johnson Singh et al. [2] claimed that 540Gbps * Corresponding author. Email: thanhhn@bvu.edu.vn DDoS attack occurred on 31st August 2016 against a federal government official website of Rio Olympic 2016 and the Ministry of Brazilian Sport. Accordingly, DDoS attacks have been around for decades and are not particularly sophisticated, at least considering the most modern network attack standards. But DDoS attacks continue to be a popular method for attackers. An Intrusion Detection System (IDS) is an important tool used to monitor and identify intrusion attacks. To determine whether an intrusion attack has occurred or not, IDS depends on several approaches. The first is a signature-based approach, in which the known attack signature is stored in the IDS database to match the current system data. When IDS finds a match, it recognizes it as an intrusion. This approach provides a quick and accurate detection. However, the disadvantage of this is having to update the signature database periodically. Additionally, the system may be compromised before the latest intrusion attack can be updated. The second approach is based on anomalies or behavior-based, in which IDS will identify an attack when the system operates without rules. This approach can detect both known and unknown attacks. However, the disadvantage of this method is low accuracy with a high false alarm rate.
Building a good IDS model from a certain dataset is one of the main tasks in machine learning. A strong classification is desirable, but it is difficult to find. Marwane Zekri et al. [3] designed a DDoS detection system based on the C.4.5 algorithm to mitigate the DDoS threat. This algorithm, coupled with signature detection techniques, generates a decision tree to perform automatic, effective detection of signatures attacks for DDoS flooding attacks. In another study by Pei et al. [4], the random forest algorithm is used to train the attack detection model. The experimental results show that the proposed DDoS attack detection method based on machine learning has a good detection rate for the current popular DDoS attack. The idea of training multiple classifiers at the same time to solve the same problem, and then combine their output to improve accuracy, called the ensemble method. When a combination, also known as a multiclassification system, is based on learners of the same type, it is called a homogeneous ensemble. When it is based on learners of different types, it is called a heterogeneous ensemble. Generally, the generalization ability of a ensemble classifier is better than a single classifier, because it can increase weak classifiers to produce better results than a single strong classifier. In this paper we evaluated and analyzed six different ensemble classifier techniques, called Bagging, AdaBoost, Stacking, Decorate, Random Forest and Voting, using various basic classifiers such as Decision Trees (DT), Naive Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), k nearest neighbors (KNN) and Random Tree (RT); These were applied on the UNSW-NB15 dataset. The remainder of this paper is organized as follows: Section 2 presents the ensemble machine learning methods used in the experiments; Section 3 presents the datasets, the evaluation metrics and the results obtained by using ensemble techniques when detecting a DoS attack on the UNSW-NB15 dataset; and Section 4 is discussions and issues need to be further studied.

Ensemble techniques
Since the 1990s, the machine learning community has been studying ways to combine multiple classification models into a set of classification models for greater accuracy than a single classification model. The purpose of aggregation models is to reduce variance and / or bias of algorithms. Bias is a conceptual error of geometric models (not related to learning data) and variance is an error due to the variability of the model compared to the randomness of the data samples ( Figure 1). Buntine [5] introduced Bayesian techniques to reduce variance of learning methods. Wolpert's stacking method [6] aims to minimize the bias of algorithms. While Freund and Schapire [7] introduced Boosting, Breiman [8] suggested ArcX4 to reduce bias and variance, while Breiman's Bagging [9] reduced the variance of the algorithm but did not increase the bias too much. Approach to random forests [10] one of the most successful model collection methods. The random forest algorithm builds trees without branches to keep the bias low and uses randomness to control the low correlation between trees in the forest.

Bootstrap
A very well known method in statistics introduced by Bradley Efron in 1979 [11]. This method is mainly used to estimate standard errors, deviations and calculate confidence intervals for parameters. This method is performed as follows: From an initial population, a sample L = (x 1 , x 2 , ..., x n ) consisting of n instances, calculation of the desired parameters. In the following steps, repeat b times the creation of the L b pattern also includes n instances from L by retrieving the sample with the replacement of the instances in the original sample and then calculating the desired parameters.

Bagging (Bootstrap Aggregation)
This method is considered as a method of summarizing the results obtained from Bootstrap. The main ideas of this method are as follows: Give a training set D = {(x i , y i ) : i = 1, 2, ..., n} and suppose we want to make a prediction for the variable x. A sample of B datasets, each of which consists of n randomly selected elements from D with substitution (like Bootstrap). Therefore

Boosting
Unlike the Bagging method, which builds up a classifier in ensemble with training instances of equal weight, the Boosting method builds a classifier in ensemble with different weighted training instances. After each iteration, the incorrectly anticipated training instances will be weighted, and the correctly predicted training instances will be rated smaller. This helps Boosting focus on improving accuracy for instances that are incorrectly predicted after each iteration. An Boosting algorithm was originally defined as an algorithm used to convert a weak machine learning algorithm into a strong machine learning algorithm. This means that it converts a machine learning algorithm that solves a binary classification problem better than a random solution into an algorithm that solves the problem well. Schapire's original Boosting algorithm is a recursive algorithm. At the end of recursion, it combines the hypotheses generated by weak machine learning algorithms. The error probability of this ensemble is proven to be smaller than the error probability of weak hypotheses. Adaboost is an algorithm that combines a diverse set of classifiers by running machine learning algorithms with different distributions on the training set [12].

Stacking
Stacking is a way to combine multiple models, introducing the concept of meta classifiers. It is less widely used than Bagging and Boosting. Unlike Bagging and Boosting, Stacking can be used to combine different models. The process is as follows: (1) Divide the training into two separate sets.
(2) Train the base classiers at the beginning.
(3) Test the base classifier in the second part.
(4) Use the predictions in (3) as inputs, and the correct classification results as outputs to train a meta classifier.
In Stacking, the combined mechanism is that the output of the classifiers (level 0 classifiers) will be used as training data for another classifier (level 1 classifier) to produce the result most accurate forecasts. Basically, we allow level 1 classifier (meta classifier) to find the best mechanism for combining level 0 classifiers on their own.

Random Forest
Random Forest (RF) is a classification method developed by Leo Breiman at the University of California, Berkeley. The summary of the RF algorithm for stratification is explained as follows: -Get K bootstrap instances from the training set.
-For each bootstrap pattern, an unpruned tree is constructed as follows: At each node, instead of choosing the best division among all predicted variables, we randomly select a sample m of the predicted variable then chooses the best division among these variables.
-Make predictions by summing up the predictions of K trees.
The learning of the RF includes the random use of input values or a combination of those values at each node in the process of constructing a decision tree. RF has some strong points: (1) High precision; (2) The algorithm solves problems with lots of noise data; (3) The algorithm runs faster than other ensemble machine learning algorithms; (4) There are intrinsic estimates such as the accuracy of the conjecture model or the strength and relevance of the features; (5) Easy to perform in parallel.
However, to achieve these strengths, the execution time of the algorithm is quite long and requires a lot of system resources.
Through the above findings about RF algorithm, we have commented that RF is a good classification method because: (1) In RF, the errors (variance) are minimized because the results of RF are synthesized through many learners; (2) Random selection at each step in the RF will reduce the correlation between learners in summing up the results.

Decorate
In Decorate (Diverse Ensemble Creation by Oppositional Relabeling of Artificial Training Examples), a combination is created repeatedly, first learning a classifier and then adding it to the current combination. We initialize the association to contain the trained classifier for the given training data. Classifiers in each successive iteration are trained on initial training data in conjunction with some artificial data. In each iteration, artificial training instances are created from data distribution; in which the number of instances created is determined to be a part, Rsize, of the training file size. The labels for these artificially created training instances are chosen to be the maximum different from the predictions of 3 EAI Endorsed Transactions on Context-aware Systems and Applications 08 2019 -11 2019 | Volume 6 | Issue 19 | e5 the current population. The creation of artificial data is explained in more detail later. We refer to the labeled training set that is labeled as diverse data. We train a new classifier on the combination of initial training data and diverse data, thus forcing it different from the current suits. Therefore, adding this category to the mix will increase its diversity. While forced to diversity, we still want to maintain training accuracy. We do this by rejecting a new classifier if adding it to an existing collection reduces its accuracy. This process is repeated until we reach the desired committee size or exceed the maximum number of iterations.

Experiments
Programs and algorithms in the experiment using the Java programming language, based on the library, Weka machine learning framework developed by Waikato University, New Zealand [13]. Part 3.1 presents the datasets used in the experiment. Part 3.2 presents the evaluation metrics used in the experiment. Part 3.3 presents the results obtained by using homogeneous and heterogeneous ensemble techniques when detecting a Buzzers attack on the UNSW-NB15 dataset.

Datasets
About the datasets used in IDS; KDD99, NSL-KDD and UNSW-NB15 are the most popular datasets, according to statistics from 2015 to 2018, showing that the NSL-KDD dataset is used 38%, KDD99 is 23% and UNSW-NB15 is 12% [14]. (1) It contains modern normal behavior and contemporary synthesis attack activities.
(2) The probability distribution of training and testing datasets is similar.
(3) It includes a set of features from payload and header of packages to reflect effective network packets.
(4) The complexity of evaluating UNSW-NB15 for existing classification systems shows that this dataset has complex patterns [14], especially when classifying DoS attacks [16].
This means that the dataset can be used to evaluate current and new classification methods effectively and reliably. However, because the UNSW-NB15 dataset is still quite new, it has not been used by many scholars in their studies. Therefore, there are limitations when comparing results with other studies. The performance evaluation of the classifiers is done by measuring and comparing metrics:

Evaluation metrics
-Time for training and testing The use of Accuracy to assess the quality of classification has been used by many scholars. However, the class distribution in most nonlinear classification problems is very imbalanced. Therefore, using Accuracy to evaluate the quality of classification of a model is not really effective [17]. Therefore, 4 EAI Endorsed Transactions on Context-aware Systems and Applications 08 2019 -11 2019 | Volume 6 | Issue 19 | e5 the more comprehensive metrics recommended for the evaluation of F − Measure and G − Means are calculated as follows [18], [19]: Here, β is the coefficient that adjusts the relationship between Precision and Recall and usually β = 1. It is the probability that a randomly sampled positive sample will be ranked higher than a randomized negative sample, AU C = P ((score(x + ) > score(x − )) The higher the AU C, the more accurate the model is in classifying classes. The ROC curve represents the pair of metrics (T P R, FP R) at each threshold with T P R is the vertical axis and FP R is the horizontal axis ( Figure 2). The evaluation metrics used in the experiments of this paper are: Sensitivity, Specif icity, P recision, F − Measure, G − Means, AU C, Training time and Testing time.

Experimental results
We apply various machine learning algorithms in the misuse detection module in order to find the best method for detecting DoS attacks based on F − Measure, G − Means, AU C and speed (computation time). We use six single algorithms from the Weka Data Mining Tools: J48 (DT), NaiveBayes (NB), Logistic (LR), LibSVM (SVM), IBk (KNN) and RandomTree (RT), then apply these algorithms into six different ensemble classifiers, which are Bagging, AdaBoost, Stacking, Decorate, Voting and Random Forest, as shown in Figure 3 below. The classification results are presented in Table 2, whereby KNN was selected as the best single classification solution because of the highest F − Measure index (0.9416). As a reminder, F − Measure shows a good correlation between P recision and Recall.    (1) KNN is the single classifier for the best classification results in Table 1.
(2) Voting is a technique for building multiple models (with component classifiers using DT, NB, LR, SVM, KNN and RT) and a simple statistic based on the majority of votes used to combine predictions.
(3) Mix Stacking is a Stacking technique with heterogeneous base classifiers using DT, NB, LR, SVM, KNN and RT; The meta classifier selected for use is KNN (k = 3).
(4) Adaboost is a homogeneous ensemble algorithm that produces the best results among the ensemble algorithms: Bagging, Adaboost, Stacking and Decorate.
Accordingly, the Mix Stacking is selected because of the highest F − Measure index (0.9928). Figure 4 compares the evaluation metrics of the Mix-Stacking algorithm with other algorithms, whereby the Mix-Stacking algorithm gives better results than the decision tree and the random forest as suggested by the other authors [3], [4].

Conclusions
From the results of experiments with homogeneous and heterogeneous ensemble techniques on the UNSW-NB15 dataset above, we have some comments: 6 EAI Endorsed Transactions on Context-aware Systems and Applications 08 2019 -11 2019 | Volume 6 | Issue 19 | e5 (1) The ensemble classifiers give better classification quality than the case of the use of single classifiers.
(2) The use of AdaBoost algorithm with component classifiers using DT helps to improve the best classification quality compared to other ensemble algorithms when classifying DoS attack on UNSW-NB15 dataset.
(3) The training and testing time of ensemble classifiers is larger than that of single classifiers, especially when using Stacking and Decorate techniques.
(4) Decorate technique helps to improve classification quality with small training datasets such as: NSL-KDD, KDD99 [20]. However, for large datasets such as UNSW-NB15, this algorithm is not effective.
(5) The use of F − Measure to evaluate classification quality improves the harmonious relationship between P recision and Recall.
At the same time, the experimental results also set out issues that need to be further studied, especially the contents: (1) Combined with feature reduction techniques [21], [16] to have a more effective classification system on both criteria: training time and F − Measure.
(2) The ability to process data as well as calculation of machine systems plays an important role in the operation of algorithms as well as machine learning methods. Since then, improve processing efficiency and access to artificial intelligence.