Improving Network Intrusion Detection Classifiers by Non-payload-Based Exploit-Independent Obfuscations: An Adversarial Approach

Machine-learning based intrusion detection classifiers are able to detect unknown attacks, but at the same time, they may be susceptible to evasion by obfuscation techniques. An adversary intruder which possesses a crucial knowledge about a protection system can easily bypass the detection module. The main objective of our work is to improve the performance capabilities of intrusion detection classifiers against such adversaries. To this end, we firstly propose several obfuscation techniques of remote attacks that are based on the modification of various properties of network connections; then we conduct a set of comprehensive experiments to evaluate the effectiveness of intrusion detection classifiers against obfuscated attacks. We instantiate our approach by means of a tool, based on NetEm and Metasploit, which implements our obfuscation operators on any TCP communication. This allows us to generate modified network traffic for machine learning experiments employing features for assessing network statistics and behavior of TCP connections. We perform the evaluation of five classifiers: Gaussian Naive Bayes, Gaussian Naive Bayes with kernel density estimation, Logistic Regression, Decision Tree, and Support Vector Machines. Our experiments confirm the assumption that it is possible to evade the intrusion detection capability of all classifiers trained without prior knowledge about obfuscated attacks, causing an exacerbation of the TPR ranging from 7.8% to 66.8%. Further, when widening the training knowledge of the classifiers by a subset of obfuscated attacks, we achieve a significant improvement of the TPR by 4.21% - 73.3%, while the FPR is deteriorated only slightly (0.1% - 1.48%). Finally, we test the capability of an obfuscations-aware classifier to detect unknown obfuscated attacks, where we achieve over 90% detection rate on average for most of the obfuscations.


Introduction
Network intrusion attacks such as exploiting unpatched services continue to be one of the most dangerous threats in the domain of information security [1], [2]. Due to an increasing sophistication in the techniques used by attackers, misuse-based/knowledge-based [3] of intrusion detection. Anomaly-based approaches are based on building profiles of normal users and trying to detect anomalies deviating from these profiles [3], which might lead to detection of unknown intrusions, but on the other hand it might also generate many false positives. In contrast, the classification approaches take advantage of both misuse-based and anomaly-based models in order to leverage their respective advantages. The classification detection methods firstly build a model based on the labeled samples from both classesintrusions and the legitimate instances. Secondly, they compare a new input to the model and select the more similar class as the predicted label. Classification and anomaly-based approaches are capable to detect some unknown intrusions, but at the same time they may be susceptible to evasion by obfuscation techniques.
Scope and Assumptions: Due to efficiency reasons as well as pervasive encryption, we assume in this work a classification-based network intrusion detection system that does not perform deep packet inspection and its model works with TCP connections objects, not single packets. Also, we assume an adversary who knows design details of such a system, 1 but cannot modify its training data. The adversary can only modify the input of the system in a limited way that has to conform the protocol specification of the TCP/IP stack including victim's application. The adversary can achieve it in several ways: a modification of exploit code, adding padding at the application layer of exploit code, artificially influencing network or transport layer protocols. If an adversary wants to take advantage of huge database of existing exploits to make their obfuscated mutations and massively exploit targets, adding padding or various changes to exploit code, it may be time consuming and unsustainable with newly obtained exploits. 2 Therefore, the easiest way for an adversary is to design non-payload-based obfuscation techniques working at network and transport layers, which will mutate instances of known intrusions in an exploit-independent way. This will make attacks similar to a legitimate traffic. We follow this idea in our paper and construct exploit-independent modifications of attacks at network and transport layers of TCP/IP. According to the taxonomy of adversarial attacks against IDS [4], our adversarial approach belongs to evasions of measurement phase of IDS. Considering influence, security violation, and specificity as dimensions of taxonomy of attacks against learning systems [5], our obfuscated attacks belong to: 1) exploratory attacks, which exploit misclassification but do not affect training data, 2) integrity attacks, which compromise assets via false negatives, and 3) indiscriminate attacks, which compromise wide class of instances.
Despite the fact that non-payload-based evasions and obfuscations of network attacks are not new research topics [6], [7], [8], they are still challenging subjects -to this date, this is witnessed by a few citations of the Stonesoft's technical report [9], which for the first time successfully applied non-payloadbased obfuscations for existing network attacks. There exist several related works considering non-payloadbased adversarial evasions of network attacks for payload-based intrusion detection [7], [10], [11]. However, to the best of our knowledge, there are no studies on non-payload-based intrusion detection and obfuscation-based adversarial evasion (except our previous research [12][13][14]).
Problem statement: In this work we address the following questions: • Is it possible to evade the detection of a nonpayload-based intrusion detection classifier by obfuscation techniques? • If so, is it possible to increase the resilience of such a classifier against obfuscated attacks, or even detect unknown ones?
Proposed Solution: To address this problem, we define a set of obfuscation operators based on nonpayload-based modifications of connection-oriented communications accomplished by NetEm utility [15] and ifconfig command. Subsequently, we propose several experiments to train a classifier using obfuscated attacks as well as obfuscated legitimate connections and compare it against another model of a classifier that is unaware of obfuscated attacks.

Contributions:
The main contributions of this paper are as follows: a) We define non-payload-based obfuscation techniques and their influence on a classification task in an intrusion detection classifier. b) We implement several obfuscation techniques as part of our obfuscation tool and later conduct a data collection experiment that employs the obfuscation tool. c) We perform an evaluation of non-payload-based obfuscation techniques using our dataset, and we reveal them as: 1) successful in evading detection by five classifiers that leverage selected subset of network connection features designed in [16], as well as 2) successful in an improvement of evasion resistance of the classifiers against unknown obfuscated attacks. d) Moreover, we elucidate an alternative view on the outcome of our results, which is denoted as training data driven approximation of a network traffic normalizer. e) The collected dataset is provided to the research community.

Background
Consider a session of a protocol at the application layer of the TCP/IP stack that serves for data transfer between the client/server based application. The interpretation of such application data exchanges between client and server can be formulated, considering the TCP/IP stack up to the transport layer, by connection k that is constrained to connection oriented protocol TCP at L4, Internet protocol IP at L3 and Ethernet protocol at L2. The TCP connection k is represented by start and end timestamps, ports of the client and the server, IP addresses of the client and the server, sets of packets sent by the client P c , and by the server P s , respectively.

Features Extraction
At this time, we can express characteristics of a TCP connection by network connection features. The features extraction process is defined as a function that maps a connection k into space of features F: where n represents the number of defined features. Each function f i that extracts feature i is defined as a mapping of a connection k into feature space F i : and each element 3 of codomain F i is defined as e = (e 0 , . . . ,e n ), n ∈ N 0 , where Γ + denotes positive iteration of the set Γ . In the context of this work, examples of such features are show in Table A.1 of Appendix.

Intrusion Detection Classification Task
Referring to [17], let X = V × Y be the space of labeled samples, 4 where V represents the space of unlabeled samples and Y represents the space of possible labels.
Let D tr = {x 1 , x 2 , . . . , x n } be a training dataset consisting of n labeled samples, where Consider classifier C which maps unlabeled sample v ∈ V to a label y ∈ Y : and learning algorithm A which maps the given dataset D to a classifier C: The notation y predict = A(D tr , v) denotes the label assigned to an unlabeled sample v by the classifier C, build by learning algorithm A on the dataset D tr . Now, all extracted features of the connection k can be used as an input of the trained classifier C that predicts the target label: where y predict ∈ {Intrusion, Legitimate}.

Proposed Approach
Considering the background from the previous section, now we describe non-payload-based obfuscations that aim at modification of the behavioral characteristics of a remote attack connection, and thus can influence the outcome of the intrusion detection classification task.

Non-Payload-Based Obfuscations
Consider connection k a representing a remote attack communication executed without any obfuscation. Then, k a can be represented by features which are delivered to the previously trained classifier C. Assume that C can correctly predict the target label as an intrusive one, because its knowledge base is derived from training dataset D tr containing intrusive connections having similar (or the same) behavioral characteristics. Now, consider connection k a which represents intrusive communication k a executed by employment of non-payload-based obfuscations aimed at modification of its network behavioral properties. The obfuscations can modify the P c and P s packet sets of the original connection k a by insertion, removal and transformation of the packets. The modifications of P c and P s of the connection k a can cause alteration of the original features' values F a to new ones. Thus, features extracted over k a are represented by f (k a ) → F a = (F a 1 , F a 2 , . . . , F a n ) and have different values than features F a of the connection k a . Therefore, we conjecture that the likelihood of a correct prediction of k a -connection's features F a by the previously assumed classifier C is lower than in the case of connection k a . Also, we conjecture that classifier C trained by learning algorithm A on training dataset D tr , containing some obfuscated intrusion instances, will be able to correctly predict higher number of unknown obfuscated intrusions than classifier C. These assumptions will be evaluated and analyzed later.

Obfuscation Tool
We designed a tool that morphs network characteristics of a TCP connection at network and transport layers of the TCP/IP stack by applying one or a combination of several non-payload-based obfuscation techniques. Execution of direct communications (non-obfuscated ones) is also supported by the tool as well as capturing network traffic related to a communication.
The tool is capable of automatic/semi-automatic run and restoring of all modified system settings and consequences of attacks/legitimate communications on a target machine. After the successful execution of each desired obfuscation on the selected service, the output contains several network packet traces associated with pertaining obfuscations. The behavioral state diagram of the obfuscation tool is depicted in Figure 1.

Description of Data Collection
We applied the obfuscation tool for a specific set of vulnerable network services and obtained samples of network packet traces related to malicious as well as legitimate communications executed with the employment of particular obfuscations in a virtual network environment. Also, we collected network traffic samples of direct attacks for each vulnerable service. These network packet traces were passed to a feature extraction process that first identified all TCP connections and then extracted features per each TCP connection. The collection of these TCP connectionbased feature vectors is referred to as dataset, which is analyzed in further machine learning experiments.

Description of Machine Learning Experiments
We performed several classification experiments in order to evaluate the effectiveness of the proposed obfuscation techniques as well as feedback of a classifier having obfuscated data included in its training process. All of our experiments considered two class prediction, discerning between legitimate and malicious TCP connections. Therefore, obfuscated and direct attacks were represented by the same class. We executed the following experiments: • For the purpose of finding the best subset of network connection features, we ran the Forward Feature Selection (FFS) method. FFS started to run with an empty set of features and in each iteration executing cross validation, it added a new feature contributing by the best improvement of average recall of all classes. In order to alleviate the possibility of the selection process becoming

Packets' order modifications
• reordering of 25% packets; reordered packets are sent with 10ms delay and 50% correlation (i) • reordering of 50% packets; reordered packets are sent with 10ms delay and 50% correlation

Combinations
• normal distribution delay (µ = 10ms, σ = 20ms) and 25% correlation; loss: 23% of packets; corrupt: 23% of packets; reorder: 23% of packets (o) • normal distribution delay (µ = 7750ms, σ = 150ms) and 25% correlation; loss: 0.1% of packets; corrupt: 0.1% of packets; duplication: 0.1% of packets; reorder: 0.1% of packets (p) • normal distribution delay (µ = 6800ms, σ = 150ms) and 25% correlation; loss: 1% of packets; corrupt: 1% of packets; duplication: 1% of packets; reorder 1% of packets (q) Table 1. Experimental obfuscation techniques with parameters and IDs stuck in local extremes, we allowed acceptance of one iteration without improvement. • Considering the selected subset of features, we evaluated evasion resistance of a classifier trained on direct attacks and legitimate traffic only, while testing was performed on the whole dataset including obfuscated attacks. • Next, we widened the knowledge of a classifier by adding some obfuscated attacks into the training set and compared its evasion resistance with the previous case. • Another experiment tested capability of the classifier to detect unknown obfuscated attacks by customized leave-one-out validation. • Finaly, we analyzed the success rate of evasion per particular vulnerable service.

Implementation
The proposed obfuscation techniques had been instantiated as part of the obfuscation tool designed and implemented in the Unix environment. Parametrized instances of these techniques (introduced in our previous work [18]) are presented in Table 1. The selection of particular obfuscation techniques was primarily motivated by the need for achieving divergent behavior of obfuscated network attacks as well as by capabilities of Unix OS. We experimented with various parameters' values with the intention to cover a wide range of divergent behaviors and, moreover, for the case of attacks, preserve the exploitation successful. 5 The methodology presented in this paper allows for a straightforward extension of the proposed obfuscation set.

Implementation Notes and Setup
The obfuscation tool is based on open source tools and is written in the Python and Ruby programming languages. For the purpose of an automatic attack execution an utility from Metasploit framework was used. Tcpdump tool was chosen to perform network traffic capture between the attacker's machine and vulnerable one. Most obfuscations were carried out by Linux tc utility and its extension NetEm [15], respectively. NetEm enabled us to add latency of packets, loss of packets, duplication of packets, reordering of packets, and other outgoing traffic characteristics of the selected network interface.  (VMs) were configured with private static IP addresses in order to enable easy automation of the whole exploitation process. Our testing network infrastructure consisted of the attacker's machine equipped with Kali Linux and vulnerable machines that were running Metasploitable 1, 2, 6 and Windows XP with SP 3.

Vulnerable Services
For proof-of-concept purpose, we aimed at selection of vulnerable services with the high severity of their successful exploitation leading to remote shell code execution through an established backdoor communication. Although there exist plethora of publicly available exploit-codes for contemporary vulnerabilities, the situation with corresponding available vulnerable SW is different due to understandable prevention reasons. Therefore, we selected older available high-severity vulnerable services that are outdated but may serve as a demonstration of our approach. The following listing contains an enumeration of vulnerable services involved in our experiments, complemented by brief description of their exploitation: a) Apache Tomcat: -firstly, a dictionary attack was executed in order to obtain access credentials into the application manager instance. Further, the server's application manager was exploited for transmission and execution of malicious code. 6 https://information.rapid7.com/metasploitable-download. html b) Microsoft SQL Server: -a dictionary attack was employed to obtain access credentials of MSSQL user and then the procedure xp_cmdshell enabling the execution of an arbitrary code was exploited. c) Samba service: -vulnerability in Samba service enabled the attacker of arbitrary command execution, which exploited MS-RPC functionality when configuration username map script was allowed. There was no need of authentication in this attack. d) Server service of Windows: -the service enabled the attacker of arbitrary code execution through crafted RPC request resulting into stack overflow during path canonicalization. e) PostgreSQL database: -a dictionary attack was executed in order to obtain access credentials into the PostgreSQL instance. Standard PostgreSQL Linux installation had write access to /tmp directory and it could call user defined functions (UDF) that utilized shared libraries located on an arbitrary path (e.g., /tmp). An attacker exploited this fact and copied its own UDF code to /tmp directory and then executed it. f) DistCC service: -vulnerability enabled the attacker remote execution of an arbitrary command through compilation jobs that were executed on the server without any permission check.
An example of a TCP sequence diagram comparing direct and obfuscated attacks on Samba service is depicted in Figure 2, where each arrow contains the

Collected Network Traffic Dataset
We applied our obfuscation tool for automatic exploitation of the enumerated vulnerable services using the proposed obfuscations. The captured related malicious network traffic, which we further passed to TCP connection-level feature extractor [16]. When an exploitation leading to a remote shell was successful, simulated attackers performed simple activities involving various shell commands (such as listing directories, opening and reading files). The average number of issued commands was around 10 and text files of up to 50kB were opened/read. Note that we labeled each TCP connection representing dictionary attacks as legitimate ones due to two reasons: 1.) from the behavioral point of view, they independently appeared just as unsuccessful authentication attempts, which may occur in legitimate traffic as well, 2.) more importantly, we employed ASNM features whose subset involves context of an analyzed TCP connection for their computation -i.e., ASNM features capture relations to other TCP connections initiated from/to a corresponding service. On the other hand, legitimate network traffic was collected from two sources: a) Common usage of all previously mentioned services was obtained in an annonymized form, excluding the payload, from a real campus network with accordance to policies in force.
Analyzing packet headers, we observed that a lot of expected legitimate traffic contained malicious activity, as many students did not care about up-to-date software. Therefore, we filtered out network connections yielding high and medium severity alerts by signature-based Network Intrusion Detection Systems (NIDS) -Suricata and Snort -through Virus Total API [19]. b) The second source represented legitimate traffic simulation in our virtual network architecture and also employed all of our non-payloadbased obfuscations for the purpose of partially addressing overstimulation in adversarial attacks against IDS [4], and thus making the classification task more challenging. However, only 109 TCP connections were obtained from this stage, which was also caused by the fact that services such as Server and DistCC were hard to emulate. 7 Simulation of legitimate traffic was aimed at various SELECT and INSERT statements when interacting with the database services (i.e., PostgreSQL, MSSQL); several GET and POST queries to our custom pages as well as downloading of high volume data when interacting with our HTTP server (i.e., Apache Tomcat); and several queries for downloading and uploading small files into Samba share. The final dataset is summarized in Table 2 and is also available from http://www.fit.vutbr.cz/ ihomoliak/asnm/ASNM-NPBO.html.

Evaluation
All machine learning experiments were performed in Rapid Miner Studio [20] using five different classifiers: two with parametric models -Gaussian Naïve Bayes  Table 3. Direct attacks and legitimate traffic cross validation and Logistic Regression; and three with nonparametric models -Gaussian Naïve Bayes with kernel density estimation, SVM with radial kernel function, and Decision Tree with maximal depth of 10 levels. Note that parametric models make assumptions about the data, which means that they use a finite set of parameters for modeling the data. This makes them simple and fast, but on the other hand they are not flexible in modeling of data that do not contain their assumed distribution. In contrast to them, nonparametric models have no assumptions about the data, and thus they may use unlimited number of parameters. The advantage of these models is their flexibility, but on the other hand they may overfit the training data. Across all of our experiments, network connection features were instantiated by FFS-selected subset of ASNM features (e.g., Table A.1), whose full list is available in Appendix D of [14]. Note that some features of ASNM may lead to overfitting of training data due to laboratory conditions of VMs' setup where attacks were executed. Therefore, such features were removed from the dataset in the preprocessing phase of our experiments and consist of TTL-based features, IP addresses, ports, MAC addresses, occurrence of source/destination host in monitored network. Considering our current dataset's class distribution, we decided to select 5-fold cross validation, which creates big enough folds for binary classification. All cross validation experiments have been adjusted to employ stratified sampling during assembling of folds, which ensured equally balanced class distribution of each fold.

Forward Feature Selection
The experiment consisted of two executions of the FFS per each classifier. In each of executions, we optimized a few important parameters of the classifiers using grid approach. In the cases of both Naïve Bayes classifiers, we enabled Laplace correction in order to prevent models from high influence by zero probabilities of some values, and moreover in the kernel-based version we optimized the bandwidth of kernels. In SVM, we optimized: 1) parameter C that represents trade-off between a soft and hard boundary of the hyperplane and 2) parameter gamma of the Gaussian radial kernel that influences the variance of the Gaussian kernel. The   Table 4. Prediction of obfuscated/all attacks by classifiers trained without knowledge about obfuscated attacks regularization parameter lambda was optimized in the case of Logistic Regression -the parameter controls overfitting of the model at the expense of incorporating the bias. And finally, in the case of the decision tree, we used gain ratio as a criterion for selection of attributes for splitting, while we optimized minimal gain required for splitting, which controls the number of splits. The first execution set of FFS took as input just legitimate traffic and direct attack entries, and represented the case where intrusion detection classifiers were trained without knowledge about obfuscated attacks. We denote the selected features as FFS DL (Direct + Legitimate). The second execution set took as input the whole dataset of network traffic -consisting of legitimate traffic, direct attacks as well as obfuscated ones, and thus represented the case where classifiers were aware of obfuscated attacks. Here, we denote the selected features as FFS DOL (Direct + Obfuscated + Legitimate). We assume FFS DL features set as less informed (and thus less tuned) than FFS DOL features, therefore when FFS DL features are used, we assume that classifiers do not have knowledge about obfuscated attacks, while FFS DOL features are used when we assume the opposite case. Both FFS-selected feature sets are presented in Table A.1 of Appendix (columns FFS DOL and FFS DL).

Evasion of Intrusion Detection Classifiers
A 5-fold cross validation was performed using direct attacks with legitimate traffic using FFS DL features. The performance measures of the classifiers validated by cross validation are shown in Table 3. Then the classifiers trained on all direct attacks and legitimate traffic instances were applied for the prediction of the obfuscated attacks and all attacks, respectively   Table 4). Here TPRs were deteriorated for all classifiers, which means that some obfuscated attacks were successful -they were predicted as legitimate traffic, and thus caused evasion of the classifiers. Note that in the case of direct attacks and legitimate traffic cross validation, non-parametric classifiers achieved better performance than parametric classifiers, while in the case of obfuscated attacks non-parametric classifiers were more significantly deteriorated in TPR, which was caused by their property of overfitting known data.

Widening the Knowledge of the Classifiers
In order to improve the resistance of the classifiers against evasions, we widened their knowledge about different mixtures of obfuscated attack instances, which was accomplished by random 5-fold cross validation of the whole dataset. In this experiment, it is justified to use FFS DOL features that consider knowledge about obfuscated attacks for updating not only the model of a classifier but also underlying feature set. Additionally, we show the results with FFS DL features, which consider updating model only. The results of this experiment are shown in Table 5. Comparing against the results from the previous experiment (see FPRs from Table 3 and TPRs from Table 4b), most of the classifiers were significantly improved in TPR, while FPR was deteriorated only slightly. This confirms the fulfilled assumption that the classifiers trained with knowledge about some obfuscated attacks are able to detect the same and similar obfuscated attacks. The only exception is the Gaussian Naïve Bayes classifier that updates model only, not the underlying feature set (Table 5a). Here is important to note that this classifier makes strong assumptions about the modeled data and when we searched for the optimal feature set with direct and legitimate traffic (FFS DL), it was unable to further optimize FPR, which remained high in contrast to other classifiers. Therefore, when obfuscated attacks were added into the cross validation, the classifier was unable to use the same features and the same strong assumptions about the original data for fitting the different data. However, in the case of updating the feature set (Table 5b), both TPR and FPR of the classifier were improved.

Detection of Unknown Obfuscated Attacks
For the purpose of explicitly testing the classifiers' capability to detect new kinds of obfuscated attacks, we performed customized leave-one-out validation using FFS DOL features, where the classifier was stepby-step trained on all permutations of the whole dataset excluding only obfuscated attack samples created by a single obfuscation technique, or its instance, respectively; while it was validated on the excluded part of the dataset. Table 6 presents ordered ratios of correctly detected unknown obfuscated attacks per obfuscation technique as well as per its instance. Comparing detection performance of unknown obfuscated attacks, either per instance or per obfuscation technique, we concluded that in most of the obfuscations, there were achieved high detection rates that indicate the acceptable resistance of the obfuscations-aware classifiers against unknown obfuscated attacks. The only exceptions are obfuscation techniques that modify MTU. This can be explained by the fact that the majority of the features employed in our experiments is mostly sensitive to packet lengths, which are influenced by fragmentation-based obfuscations. This phenomenon is more significant in the cases of non-parametric classifiers due to their property of overfitting the training data. In    Table 7. Successfully obfuscated attacks (evasions) per service general, parametric classifiers were more successful in detecting unknown obfuscated attacks and their correct predictions were more stable than in non-parametric classifiers. However, in the case of one parametric classifier -Gaussian Naïve Bayes -we also have to take into account its worse FPR in comparison to Logistic Regression (shown in Table 3).

Successful Evasions per Service
This experiment compares and analyzes success rate of evasion per vulnerable service. The results presented here originate from a binary classification experiment in which the classifier was trained without obfuscated attacks and validated on the whole dataset (Table 4) using FFS DL features. The obfuscations are considered successful if they are predicted as legitimate traffic; the situation represents evasion case. Ordered ratios of successfully obfuscated attacks per service are present in Table 7. The minimum achieved ratios of attacks that evaded detection are shown in bold and belong to parametric classifiers, which was already mentioned above. Most successful obfuscated attacks are those exploiting Apache service. From a detailed analysis of this service, we found out that instances of direct attacks had very flat value distribution of many features in comparison to other direct attacks. Examples of such features are the standard deviation of inbound and outbound packet sizes of the connection, and other features dependent on the packets' length variability. Therefore, obfuscated attacks caused more variability of the features that were in many cases similar to legitimate traffic. On the other hand, in the cases of Server, many features of the direct attacks were more divergent across their instances, and thus obfuscations contributed to the divergence only in a low scale. Therefore, most of the obfuscated attacks had similar characteristics like direct ones, which enabled their detection.

Discussion
Impact of Obfuscations on Feature Divergence. To assess the impact of proposed obfuscations on the divergence of ASNM features, we compared values of each FFS-selected feature of obfuscated attacks with feature values obtained from direct attacks executed with the same exploit. In other words, we quantified the change that obfuscations bring by computing the ratio of divergent single-feature obfuscated attacks against the closest single-feature direct attack using the same exploit on the same service. We found out that this ratio (averaged per all FFS DL features) is higher than 55% in the case of all classifiers, which can be viewed as the proposed obfuscations were able to influence the majority of features in obfuscated attacks. The example of this ratio computed per each input feature of Gaussian Naïve Bayes classifier is presented in Retraining. Although we had demonstrated that our adversarial classification approach to network intrusion detection can detect unknown obfuscated attacks with high performance, it is still possible to design and apply unknown network connection morphing techniques to bypass the detection. In order to keep this performance as high as possible, retraining of the classifier should be performed each time a new form of obfuscation is known to occur. However, such retraining of generative classifiers (both Naïve Bayes classifiers) relates to sub-model of the malicious class only, and therefore is faster than retraining monolithic models of discriminative classifiers such as SVM, decision tree, and logistic regression, where the whole model incorporating both classes has to be retrained. This favors generative classifiers over discriminative when a frequent update of a model is required. On the other hand, retraining of the legitimate sub-model of generative classifiers should be also performed once in a while in order to ensure that all new manners of using particular services are captured. Next aspect related to fast retraining of classifiers is whether they can be retrained with preserving the feature set and still provide high performance. From this point of view, we have found Naïve Bayes with Gaussian kernels as the most convenient classifier (see results in Table 5a).
Extension of Obfuscations. Our obfuscations are not exhaustive but cover a wide range of network connection morphing possibilities that can influence the detection performance of a non-payload-based intrusion detection classifier. On the other hand, the methodology presented in this paper allows for a straightforward extension of the obfuscations.
High Rate of Attacks. Our dataset has the ratio of malicious to legitimate connections equal to 5.9%, while in practice this ratio is usually several orders of magnitude lesser. Although an arbitrary value of this ratio does not distort the performance of the classifier when correct performance measure is chosen (e.g., F 1 -measure, average recall of classes), it might impact the accuracy of modeling the legitimate class whose high volume occurred in practice can result in high divergence of data, which might not be captured by models built from our dataset in sufficient manner. Therefore in practice, classifiers would require much more legitimate data than in our dataset.
Normalizers. If we would assume the existence of optimal network normalizer that would be able to completely eliminate the impact of proposed nonpayload-based obfuscation techniques, then these obfuscation techniques would be useless. Nevertheless, if such optimal network normalizer would exist, then it would be still prone to state holding and CPU overload attacks [21], [22], [23]. Contrary, if we would not assume network normalizer as part of our system, then non-payload-based obfuscation techniques might be employed as training data driven approximation of network normalizer that would not be prone to previously mentioned attacks. The situation can be demonstrated by our binary classification experiments (Section 5.2). Consider intrusion detection classifier validated on direct attacks and legitimate traffic whose average recall is higher than 90% for each classifier (Table 3). Here training and testing data of the classifier were built upon normalized malicious network traffic represented by direct attacks. Then, the model trained on the direct attacks and legitimate traffic was applied to prediction of the obfuscated attacks. In this case, obfuscated attacks may represent un-normalized malicious network traffic, and thus the classifier achieved worse performance than in the previous case: TPR was significantly decreased while FPR was preserved from the previous step (Table 4a). In order to alleviate negative performance impact of un-normalized malicious network traffic (represented by obfuscated attacks) on our system, we can include obfuscated attacks in the training process of the classifier. This case is interpreted by performance measured contained in Table 5b. There was achieved average recall over 97% for each classifier, primarily thanks to significant improvement of TPR in most of the classifiers. Thus, as an alternative outcome of our work, a network normalizer element may be omitted from classification-based intrusion detection infrastructure and can be approximated by appropriate training data.

Related Work
Using taxonomy of attacks against learning systems [5], we categorize our approach to the class of exploratory indiscriminate attacks violating integrity via false negatives. The same class of attacks was addressed for example in the field of spam filtering [24], [25], malware detection [24], payload-based anomaly intrusion detection [26], [27], automatic speech recognition [28], etc. However, to the best of our knowledge, this class of attacks had not been studied in the non-payload-based intrusion detection yet, including anomaly-based and classification-based approaches. Further, we aim at related work of evasive adversarial attacks against IDS, and we divide it into payload-based and non-payload-based approaches, plus their combination. Additionally, papers dealing with network traffic normalization are also described.
Payload-based Evasions. The first work dealing with payload-based evasions is described in [8] and presents a tool called Whisker. The author aims at anti-intrusion detection tactics by performing mutations of the HTTP request in a way that a web server is able to understand the request, but intrusion detection systems can be confused. Vigna et al. [29] proposed a framework generating exploit mutations to change the appearance of a malicious payload bypassing detection of NIDS.
The proposed framework was evaluated on two wellknown signature-based NIDSs -Snort and RealSecure. A similar approach was proposed by Fogla et al. [26] in their polymorphic blending attacks that change the payload of a network worm in order to look normal, and thus effectively evade a byte frequency-based anomaly NIDS. Other approaches use many different techniques for evading detection by changing the payload, e.g., obfuscation techniques such as malware morphism [30] and other attack tactics against IDSs [4]. All of these adversarial approaches are similar to our approach, but in contrast they deal only with evasions of payloadbased NIDSs.
Non-payload-based Evasions. Previous methods can evade payload-based NIDS systems primarily by morphing the payload, but do not need to be efficient against non-payload-based network intrusion detectors, which are most sensitive on the attack morphing at the network and transport layers of the TCP/IP stack. Fragroute [7] is a tool that was written to test intrusion detection systems by using simple ruleset language enabling interception and modification of egress traffic with minimal support for randomized or probabilistic behavior. Fragroute implements three classes of attacks -insertion, evasion, and denial of service. AGENT [10] uses several methods of altering network traffic by packet splitting, duplicate insertions, etc. Watson et al. [11] proposed a method called Protocol Scrubbing that represents active mechanisms for transparent removing of network attacks from protocol layers in order to allow passive IDS systems to operate correctly against evasion techniques. Wright et al. [31] proposed thwarting of network traffic classifiers by optimally morphing one class of traffic to look like another class with respect to a given set of features, while they employ padding or splitting the data into smaller parts. This is similar to our approach, but in contrast the authors aim at network traffic classification in general, rather than intrusion detection. An example of evasion that deals with tunneling of malicious network traffic in payload of HTTP(S) protocol is presented in [12] and [13]. Although the authors do evasion of payload-based NIDS as well, the main focus of the work is the evasion of non-payload-based NIDS.

Conclusion
The motivation behind our work is to strengthen non-payload-based intrusion detection classifiers in an attack-independent way, assuming an adversary who can massively mutate known exploits to attack huge amount of targets. With this in mind, we executed remote attacks and legitimate communications on selected vulnerable network services while utilizing various non-payload-based obfuscation techniques based on NetEm and MTU modifications with the intention to make behavioral characteristics of the attacks being similar to those of legitimate traffic, and thus cause evasion of our experimental non-payload-based intrusion detection classifiers. The summary of the presented results revealed non-payload-based obfuscation techniques as partially successful in evading detection by five classifiers (two parametric and three nonparametric), which were trained without prior knowledge about them. On the other hand, if some of the obfuscated attacks were included in the training process of the classifiers, then they were able to detect other unknown obfuscated attacks with high performance. From the practical point of view, we discussed requirements on fast retraining of classifiers, where we identified Naïve Bayes classifier with Gaussian kernels as the most convenient one due to its capability to update the model of a single class independently of another class, and also it does not need to replace the feature set and still can provide a high performance. Note that we do not envision to use our obfuscation-aware non-payloadbased classifiers as an independent security solution but as a complementary part of existing solutions, such as to misuse-based and anomaly-based intrusion detection systems that perform deep packet inspection. In our future work, we plan to perform experiments with existing implementations of network normalizers as well as verify the effect of non-payload-based obfuscation techniques on wider spectrum of vulnerabilities. Another option is to explore impact of proposed obfuscations on communication between bots and C&C server.