Majority Voting and Feature Selection Based Network Intrusion Detection System

Attackers continually foster new endeavours and attack strategies meant to keep away from safeguards. Many attacks have an effect on other malware or social engineering to collect consumer credentials that grant them get access to network and data. A network intrusion detection system (NIDS) is essential for network safety because it empowers to understand and react to malicious traffic. In this paper, we propose a feature selection and majority voting based solutions for detecting intrusions. A multi-model intrusion detection system is designed using Majority Voting approach. Our proposed approach was tested on a NSL-KDD benchmark dataset. The experimental results show that models based on Majority Voting and Chi-square features selection method achieved the best accuracy of 99.50% with error-rate of 0.501%, FPR of 0.005 and FNR of 0.005 using only 14 features.


Introduction
The development of malware is a significant problem for network intrusion detection system designers. Malicious attempts have gotten more complicated, and the most difficult task is identifying unknown and obfuscated malware, because malware authors use a variety of information obfuscation escape strategies to avoid detection by an NIDS. Furthermore, security threats have proliferated, such as zeroday attacks on Internet users; as a result, computer security has become more crucial as information consumption has become a part of our daily lives [1][2][3][4].
Between network NIDS and HIDS, there is a considerable difference. An intrusion detection system (IDS) is a piece of software or hardware that detects malicious traffic, takes corrective action automatically, and responds automatically to stop intrusions. Despite their use, intrusion detection approaches are constrained by a number of factors, including Gain-Ratio Attribute Evaluation, and Info-Gain Attribute Evaluation to increase data quality. We discovered that by employing feature selection strategies, we were able to gain considerable acceptable attack detection accuracy while incurring minimal system overhead. Using only 14 features and the Majority Voting and CHI features selection methods, the accuracy was 99.50 % with an error rate of 0.501 %, FPR of 0.005, and FNR of 0.005.
• We have conducted a performance comparison of our proposed work with other comparable techniques and discovered that our approach outperforms other approaches in attack detection accuracy. The remainder of this paper is organized as follows. Section 2 discusses related work on intrusion detection. Section 3 describes the proposed methodology in detail. Section 4 discusses the experimental results, the performance of the proposed model, and its comparison to existing models. In Section 5, the discussion is given. Finally, in Section 6, the conclusion is presented.

Related Work
Previously, we reported some research on machine learning approaches in network intrusion detection systems. The issues that each study is concerned with differ, such as feature selection, data reduction, and classification model optimization. Leevy, J. L., et al. discovered that when available, the top performance ratings for each study were exceptionally high overall, probably due to overfitting. They also noticed that the majority of the studies did not address class inequality, which might bias the findings of a large data research. They discovered that the CSE-CIC-IDS2018 data cleaning information was repeatedly insufficient, raising concerns about experiment repeatability. According to them, their survey found considerable research gaps [1]. Khraisat A. et al. provided a taxonomy of current IDS, as well as a full review of noteworthy recent work and an overview of datasets commonly used for evaluation purposes. They also highlighted future research challenges to fight such approaches and improve the security of computer systems [2]. Laghrissi F. et al. used deep learning approaches to identify dangers in Long Short-Term Memory (LSTM). They employed PCA and mutual information (MI) as techniques for dimension reduction and feature selection. They tested their technique on a standard dataset, KDD99, and the findings show that PCA-based versions attain the highest accuracy for teaching and testing in both binary and multiclass classification [3]. Megantara, A. A. et al. proposed a crossbreed machine learning strategy that blends the specific feature selection methodology indicating supervised understanding with the information reduction method representing unsupervised learning to build an appropriate model. It works by picking acceptable and significant functions using a feature importance decision tree centered recursive feature reduction approach and finding anomaly/outlier data using the Local Outlier Factor technique, according to them. Their experimental results demonstrate that the suggested approach achieves the highest accuracy in identifying R2L (i.e.99. 89 %) and outperforms most previous research in the NSL-KDD dataset for other attack types [4]. Jadhav, A. D., et al. created the Two-Phase Invasion Recognition System (TP-IDS) in two steps to improve accuracy. They employed SVM and kNN in stage of the particular TP-IDS. In order to improve accuracy, Decision Tree and Nave Bayes are used during Phase II of the TP-IDS system validation stage. According them, each phase makes use of the Hadoop distributed system as the primary data storage space and processing structures, which generally permits parallel processing in order to improve system overall performance and so achieve efficiency within TP-IDS [5]. Divyasree T. H. et al. proposed an efficient intrusion detection system based on Ensemble Core Vector Machine (CVM). They employed CVM algorithms based on the notion of the smallest enclosing ball. It detects U2R and R2L attacks, as well as Probe and DoS attacks. They used the KDD Cup99 dataset to train and test the classifiers. They also used the chisquare test to determine the most significant attributes for each attack, and then applied a weighted function to these features to minimize dimensionality. As a consequence, the test results reveal that the model outperformed earlier strategies in all four attacks while needing less processing time [6]. To identify and categorize network threats, Ashiku, L. et al. proposed leveraging heavy learning architectures to construct an adaptive plus robust network incursion detection system. Their emphasis will be on how deep learning, or DNNs, may enable adaptable IDSs with learning capabilities to identify known and unique or even zero-day network behaviour patterns, shut down the system intruder, and limit the chance of penetration. They used the UNSW-NB15 dataset to demonstrate the utility of the model representing real-time network communication behaviour with synthetic attack operations [7]. To safeguard the cloud from potential attacks, Elmasry, W. et al. recommended the creation of a one-of-a-kind integrated cloud-based intrusion detection system (CIDS). The suggested CIDS, according to them, includes of five primary modules that execute the following tasks: monitoring the network, capturing traffic flows, extracting features, analyzing the flows, identifying intruders, responding to and documenting all actions. They employed an upgraded bagging ensemble system with three deep learning models to accurately anticipate intrusions. They demonstrated that the suggested technique resolves all of the issues raised in the cloud threat literature [13]. Sistla, V. P. K., et al. created deep learning algorithms in NIDS prediction models to identify abnormalities and threats automatically. They evaluated the proposed model's performance on the NSL-KDD dataset using metrics such as accuracy, recall, precision, and F1 score. They claim that the experimental findings suggest that the proposed deep learning model outperforms earlier shallow models [14]. RNNIDS was created by Sohi, S. M. et al., who used Persistent Neural Networks (RNNs) to detect detailed patterns in issues and produce identical patterns. They verified that RNNs work incredibly well to develop new, previously undiscovered variants of attacks, as well as synthetic signatures from the most complicated viruses, to boost intrusion detection performance even more. They have enhanced the appearance of a new NIDS, RNNs function incredibly well to produce malicious datasets including previously concealed virus variants, for example. To evaluate the practicality of their methodologies, they conducted extensive tests using publicly accessible data sets, which revealed a significant increase in the detection rate of commercially available NIDS (up to 16.67 %) [15]. Zhou, Y., et al. created an intrusion detection system that is mostly based on feature extraction and even ensemble techniques. During the first time, they demonstrated the CFS-BA heuristic dimensionality reduction strategy, which picks the ideal subset constructed after feature relationship. They demonstrated an ensemble technique that works with the C4, 5, Random Forest (RF), and Forest by Simply Penalizing Capabilities (Forest PA) methods in this scenario. Finally, for intrusion detection, they used a voting approach to include the majority of the basic learner probability distributions. According on the experimentation findings employing the NSL-KDD, AWID, and CIC-IDS2017 datasets, the suggested CFS-BA collecting approach outperforms other relevant and decreasing edge techniques [16]. Mane, S. et al. employed deep neural networks to detect network intrusions and suggested an explainable AI framework to promote transparency at all stages of the machine learning process. They accomplished this by employing Explainable AI algorithms, which will make ML designs less of a new black box by explaining why a new prediction is made. This information might be generated using column creation from SHAP, LIME, Contrastive Explanations Technique, ProtoDash, and Boolean Decision Guidelines (BRCG). They provide the final results of applying these algorithms to the NSL-KDD information set for Intrusion Detection System [17]. Guezzaz, A., et al. suggested a selection tree-based technique for finding incursions with higher data quality. They used network pre-processing and entropy selection feature collection to increase data quality and suitable training, and they created a selection tree classifier to have dependable intruder individuality. As a result, a learning from mistakes analysis on a handful of data sets reveals that the recommended model provides reliable insights. With the majority of the NSL-KDD and CICIDS2017 data sets, their technique obtained 99.42 % and 98.80 % accuracy, respectively [18]. Li, L., et al. have suggested a unique hybrid approach for efficiently detecting community intruders. In the suggested model, the Gini index is used to choose the best subset of features, the gradient boosted decision tree (GBDT) approach is used to detect network intrusions, and the particle swarm optimization (PSO) methodology is used to fine-tune the GBDT parameters. They used the NSL-KDD dataset to put the suggested models through their tests in terms of accuracy, detection rate, precision, F1 score, and false alarm rate.
According to the results, the suggested model outperforms the compared techniques [19]. Using the UNSW-NB15 dataset as a benchmark, Moualla, S. et al. proposed a unique neighbourhood IDS that plays an important role in network security measures and prevents current cyber-attacks on sites. According to them, their suggested system is a learning-based, multi-class system. The method is based on the Synthetic Group Oversampling Technique (SMOTE) approach to deal with imbalanced patterns in the dataset, and then uses a Randomized Trees Classifier (Extra Trees Classifier) to extract the specific key features in the dataset using the Gini select contamination qualifying criterion. Following that, they employed a pretrained extreme learning machine (ELM) model for each attack separately, using "One-Versus-All" as the specific binary classifier associated with them. The specific experimental data show that the proposed approach outperforms similar activities [20]. Xu, W., et al. proposed the Interpretable Intrusion Detection System, a revolutionary intrusion detection system based on model-based interpretability. They coupled Normal and Attack samples rebuilt with AutoEncoder (AE) with training examples to emphasise the Normal and Attack attributes, resulting in an astonishing effect for the classifier. They then employed Additive Tree (AddTree) as a binary classifier, which offered good predictive performance in the particular combined dataset while preserving adequate model-based interpretability. They investigated the suggested approach using the UNSW-NB15 dataset. According to them, I2DS obtained a recognition accuracy of 99.95 %, which is greater than the bulk of current intrusion detection systems [21]. Kabir, E. et al. have suggested a unique intrusion detection system based on Least Square Support Vector Machine sampling. They've split the detecting procedure into two halves. The whole dataset is separated into specified arbitrary subgroups in the first stage. To detect intrusions, the extracted samples are subjected to the least square support vector machine in the second step. They were able to achieve a reasonable level of accuracy and efficiency [22]. Learning-based classifiers have been proven to be vulnerable to adversarial instances, according to Zhang, F. et al. They illustrate how adversarial inputs adjusted solely based on the model decision outputs may readily evade a discrete-valued random forest classifier. They presented a gradient-free evasion method. Random forests have been shown to be much more vulnerable than SVMs [23]. CyberPulse++, a machine learning-based security system described by Rasool, R. U., et al., uses a pre-trained machinelearning repository to evaluate collected network statistics in real-time to detect abnormal route performance on network links. It efficiently addresses various issues faced by network security solutions, according to them, including the feasibility of large-scale network-level monitoring and data collecting. They have shown that the system can proactively identify and fight against link flooding attacks in real time with little bandwidth and computational overhead [24]. Rasool, R. U., et al. have illustrated the susceptibility of the software-defined networking control layer to link flooding attacks, as well as how the attack technique varies from that used to attack traditional networks, which mostly entails attacking the connections directly. They introduced CyberPulse, a novel effective countermeasure based on a machine learning-based classifier for preventing link flooding attacks in software-defined networks. They compared CyberPulse to competing techniques for accuracy, false positive rate, and efficacy on actual networks constructed with Mininet. According to them, the results suggest that CyberPulse is capable of accurately classifying harmful traffic and successfully mitigating them [25]. Figure 1 show the framework of our proposed network intrusion detection system. It consists of three phases like, feature selection phase, training phase and testing phase. The NSLKDD dataset is given as input to the four feature selection methods like, correlation based feature selection (CFS), information gain ratio (Gain Ratio), chi square (CHI) and information gain (Info gain) [26][27][28].

Feature Selection Techniques
where, Merits = the heuristic ``merit" of feature subset S containing k features, rcf = the mean feature-class correlation and rff = the average feature-feature intercorrelation We evaluated CFS with genetic search technique on NSL-KDD training dataset and selected 8 top ranked features as shown in Table 1 out of 41 total features to evaluate the performance of the machine learning classifiers. To determine the worth of an attribute, CHI computes the value of the chi-squared statistic in relation to the class. The CHI score is typically calculated using an equation (2) [26,30], (2) where, X = the no. of times feature a and class c occur together, Y = the no. of times feature a occurs without class c, W = the no. of times class c occurs without feature a, Z = the no. of times neither a or c occurs and N = the total size of the training set. We evaluated chi-squared attribute selection with ranker search technique on NSL KDD training dataset and selected subset 14 top ranked features as shown in Table 2 out of 41 total features to evaluate the performance of the machine learning classifiers.
Gain is a criteria for attribute selection in the ID3 approach. It is also referred to as information gain. The property with the largest information gain is chosen as the splitting attribute for the node N in information gain. This function reduces the amount of data required to classify dataset D in a partition and returns the partitions with the lowest impurity. Information gain is defined as the difference in entropy between before and after splitting the dataset D on attribute A. Entropy is used in equation (4) to compute the uncertainty in the dataset D.
where, X = the set of classes in dataset D, p(x) = the proportion of number of elements in class x to the number of elements in dataset D. SplitInfo describes how equally the attribute splits the dataset and is calculated by equation (5), where,

Dj D
represents the weight of j th partition.
We evaluated gain ratio feature selection with ranker search technique on NSL KDD training dataset and selected a subset of 14 top ranked features as shown in Table 3 out of 41 total features to evaluate the performance of the machine learning classifiers.
where, H is the information entropy. We evaluated InfoGain feature selection with ranker search technique on NSL KDD training dataset and selected a subset of 14 top ranked features as shown in Table 4 out of 41 total features to evaluate the performance of the machine learning classifiers.

Classification Methods
Six machine learning methods were used for training and testing. We tested the suggested technique using XGBoost, J48 Decision Tree, AdaBoost, Random Forest, REPTree, and Majority Voting. Similarly, we used the Majority Voting method to create a multi-model classification system. These are the most extensively used and efficient classification methods. To perform the experiments, we employed the WEKA API implementation for these learning approaches.
The specifics of each algorithm are as follows [31]. These are the commonly used machine learning algorithms to solve the task of classification. These are widely used and researched algorithms, with applications in a broad variety of fields such as text classification, image classification, intrusion detection, malware detections etc. These algorithms have provided significant and promising classification results in different domains.

(i) XGBoost
XGBoost (eXtreme Gradient Boosting) is a well-known and effective method. Gradient boosting is a supervised learning strategy that integrates an ensemble of estimates from a number of simpler and weaker models in order to anticipate a target variable accurately. The XGBoost approach performs well in machine learning challenges due to its robust handling of a wide range of data types, relationships, and distributions, as well as the vast range of hyperparameters that can be finetuned. XGBoost [32] can address regression, classification (binary and multiclass), and ranking problems.
(ii) J48 Decision Tree One of the most popular classification approaches is J48 decision tree learning. It is highly efficient and has classification accuracy comparable to other learning methods. A decision tree is a tree that reflects the classification model that has been learned. It's an easy-to-understand decision tree classification paradigm. At WEKA, J48 is a modified C4.5. By recursively partitioning data, the C4.5 technique generates a categorization decision tree for a given data set. The depthfirst strategy is used to broaden the selection. The method evaluates all feasible data split tests and chooses the one with the highest information gain [33].

(iii) AdaBoost
AdaBoost is the most widely used and researched algorithm, with applications in a broad variety of fields. Freund and Schapire developed the AdaBoost algorithm in 1995. Abstract Boosting is a machine learning approach that combines a large number of weak and wrong classifiers to create a highly accurate classifier. It's easy to use, fast, and simple to understand. It does not need any prior information from the weak learner, therefore it may be used in conjunction with any weak hypothesis identification approach [34].

(iv) Random Forest
As the name indicates, a random forest is made up of a large number of individual decision trees that work together as an ensemble. For each tree, the random forest creates a class prediction, and the class with the most votes becomes our model's forecast. The core concept of The Random Forest is the knowledge of the community, and it is a simple yet powerful one. The random forest model is particularly effective because it consists of a large number of generally uncorrelated models (trees) that work together to outperform each of the individual constituent models [35].
(v) REPTree The REPTree classifier is a rapid decision tree learner that builds classification and regression trees using the C4.5 method. It constructs a regression/decision tree using information gain/variance and truncates it using errorreduced pruning [36].

(vi) Majority Voting
Voting is the most fundamental ensemble approach, and it is typically pretty effective. It may be used to solve classification and regression problems. It breaks down a model into two or more sub-models, in this case five. The majority voting process is used to integrate predictions from each sub-model. The majority voting method is depicted in Figure 2. It is a meta-classifier that uses a majority vote to identify machine learning classifiers that are conceptually similar or dissimilar. We use majority voting to forecast the final class label, which is the class label that classification models most usually predict. We predict the class label y using equation (7) and the majority vote of each classifier Cj. [26,[37][38],

Motivations for using the multi-model strategy
Following are the motivations to use multi-model strategy for the detection of network intrusions, • It's a strategy for improving model performance, with the goal of outperforming any single model in the ensemble. • Majority Voting is based on the performance of many models, therefore huge mistakes or misclassifications from one model will not be a hindrance. • A model's poor performance can be compensated for by the good performance of other models. • When you combine models to produce a forecast, you reduce the chances of one model making an incorrect prediction by having several models that can make the correct prediction. • Majority Voting makes the estimator more robust and less prone to overfitting.

Dataset
The NSL-KDD dataset was created to address some of the drawbacks of the KDD99 dataset. Despite the fact that this revised version of the KDD dataset has significant flaws and may not be a perfect representation of real-world networks, it can still be used as a benchmark dataset to help academics evaluate different intrusion detection systems and address the lack of public records for network-based IDS. The NSL-KDD train and test sets also include a large corpus of data. This advantage allows tests to be run on the complete collection rather than a random sample. As a result, the findings of multiple research studies are consistent and similar. In this work, the NSL-KDD dataset is used, which contains the datasets KDDTrain+ and KDDTest+. There are a total of 125,973 instances in the KDDTrain+ dataset with 58,630 attack traffic and 67,343 normal traffic. The KDDTest+ set has a total of 22,544 instances. A detailed overview of the instances is shown in Table 5 [39][40].

Evaluation Measures
To evaluate the performance of classifiers, we used the following metrics. A binary classifier assigns a positive or negative label to all data items in a test dataset. This classification (or prediction) produces four outcomes true positive (TP), true negative (TN), false positive (FP) and false negative (FN) [41].

Performance Evaluation of Random
Forest classifier using feature selection techniques on NSL-KDD dataset       Table 12 and Figure 3 shows the comparative performance evaluation of our proposed approach using Majority Voting classifier with other similar works on the NSL-KDD dataset. It is found that our approach outperforms the other approaches in the attack detection accuracy of 99.50% using only 14 selected features out of 41 features.

Positive impacts of Network Intrusion Detection System
Following are some of the positive impacts of network intrusion detection systems.
• NIDS can be configured to display the specific information included within the packets. This feature can be used to detect intrusions such as exploitation attacks and botnet-infected packets. • NIDS looks at the number and types of attacks. This information can be utilised to improve security measures. It can also be examined for flaws with network settings. • To satisfy certain standards, NIDS logs can be used as documentation. • This increased efficiency can help a business to save money on employees while also covering the costs of installing the NIDS.

Negative impacts of Network Intrusion Detection System
Following are some of the negative impacts of network intrusion detection systems.
• NIDS does not stop or prevent attacks; rather, it aids in their detection. • NIDS is extremely valuable for network monitoring, but the value of the information it provides is entirely dependent on what you do with it. • Intruders can employ encrypted packets to sneak into the network since an NIDS can't see them. • Although a NIDS reads the data from an IP packet, the network address may still be faked. • One major drawback of a NIDS is that it frequently alerts false positives. • Because an NIDS examines protocols as they are collected, they are vulnerable to the same protocolbased attacks as network hosts. • An NIDS's signature library determines how effective it is. It won't register the latest attacks if it isn't updated often, and it won't be able to warn about them.

Benefits for the academic community and government
Following are the benefits of NIDS for the academic community and government.
• Academic community can use NIDS as a research platform for further study to solve open challenges in thus field. • They can provide more significant solutions to detect the new threats in an efficient way with minimum false positives. • Government agencies and organizations can effectively use NIDS to detect the threats before they can expose the systems.

Conclusions
We provided feature selection and majority voting-based classification for detecting attacks using the NSL-KDD dataset. Correlation-based feature selection (CFS) with GreedyStepwise, Chi-squared attribute evaluation (CHI) with Ranker, Gain Ratio Feature Evaluation with Ranker, and InfoGain Feature Evaluation with Ranker search strategies were used for feature selection. Using these feature selection strategies, we picked 8 features with CFS, 14 with CHI, 14 with GainRatio, and 14 with InfoGain. To evaluate the performance of various feature selection techniques and selected features, we used six machine learning classifiers: XGBoost, Random Forest, AdaBoost, REPTree, J48 Decision Tree, and Majority Voting. We built a multi-model classification model using the Majority voting classifier. According to the testing findings, our suggested solution, which employs a Majority Voting classifier and feature selection algorithms such as CHI and InfoGain, achieved a 99.50 % attack detection accuracy with minimum system overhead. Furthermore, we compared our suggested technique to other comparable approaches and found that it outperforms in terms of attack detection accuracy.