An Investigation of Performance Analysis of Anomaly Detection Techniques for Big Data in SCADA Systems

Anomaly detection is an important aspect of data mining, where the main objective is to identify anomalous or unusual data from a given dataset. However, there is no formal categorization of application-specific anomaly detection techniques for big data and this ignites a confusion for the data miners. In this paper, we categorise anomaly detection techniques based on nearest neighbours, clustering and statistical approaches and investigate the performance analysis of these techniques in critical infrastructure applications such as SCADA systems. Extensive experimental analysis is conducted to compare representative algorithms from each of the categories using seven benchmark datasets (both real and simulated) in SCADA systems. The effectiveness of the representative algorithms is measured through a number of metrics. We highlighted the set of algorithms that are the best performing for SCADA systems.


Big Data Analysis in SCADA Systems
Supervisory Control and Data Acquisition (SCADA) systems are widely used for monitoring and control of Industrial Control System (ICS) of national critical infrastructures, including the emerging energy system, transportation system, gas and water systems, and so on.Generally, ICS is comprised of Programmable Logic Controllers (PLCs), Remote Terminal Units (RTUs) with Intelligent Electronic Devices (IEDs), a telemetry system, a Human Machine Interface (HMI) and a supervisory (computer) system.In a SCADA based ICS, communication infrastructures connect the supervisory (computer) systems and the RTUs.The operational process and requirements of SCADA systems, which are used for industrial networks, have characteristics distinct from enterprise networks.The primary objective of a SCADA system is to control real-life physical equipment and devices, e.g., an energy system SCADA may be used for monitoring and control of the generation plants.On the other hand, conventional information based traffic network is used for data processing and transfer [23].As the primary objective of the SCADA is different from the conventional information network, the operational process and its requirements vary significantly.Since the SCADA is used to control critical infrastructures, the failure severity is very high which requires a high level of reliability.Moreover, the data acquisition, processing, and transmission require real-time operation or atleast near real-time operation.Besides, the data transferred through the SCADA devices are both periodic and aperiodic [23].For example, in a SCADA based energy transmission system, an RTU sends the information of the voltages and currents of a node every few seconds continually (which is periodic) and it also sends a warning when the current exceeds the maximum rating (which is aperiodic).It is also important to ensure that the transmitted data is received without losing any information within a specific time-frame.A conventional information traffic network can withstand even a high data loss but this is not the case for the SCADA devices as the real-time physical process is highly dependent on the data they receive.In Figure 1, a brief overview of SCADA architecture is given.Next, we briefly discuss the importance and significance of Big Data analysis in a SCADA based ICS.
Typically, big data has three dimensional properties (3V) that include volume, velocity and variety [28].The term 'volume' is related with the amount of data and its dimensionality.'Velocity' is the processing speed of the data.The last property of big data, 'variety' refers to the mix of different types of data.Now, we discuss the essence of big data analysis in a SCADA network considering the 3V properties of big data.

Communication Network
Generally, a SCADA system is dispersed across a large geographic area and is combined of multiple independent systems [23].Hence, lots of sensor devices and actuators are used to monitor and control of this wide spread large networks.Therefore, the amount of data received in a SCADA is also huge which makes the data analysis a challenging issue.Moreover, the recent trend of using Ethernet and web standards combined with traditional SCADA standards has shifted the SCADA paradigm from event-driven to process-driven, enabling the control of SCADA devices under streaming information exchange.Besides, significant amount of monitoring devices are used to ensure the observability of the processes.All of these technological advancements have provided an improved control performance of the SCADA system; however, big data issue has been emerged with the increased volume of information used in a SCADA network [28].
The second property of the Big data is the 'velocity' at which the data is processed.In a SCADA system, this property is very crucial as the time requirement of SCADA data exchange is real-time or near realtime.Therefore, those applications which need faster processing, big data is a critical factor and needs significant attention.Even applications which are based on post-event analyses face noticeable challenge to handle the huge amount of data from a SCADA network.Therefore, improved and robust techniques, which are capable of handling big data within sufficient time frame, will add extra value to manage the SCADA network more efficiently and reliably.In a SCADA system, field devices are responsible to collect different types of data for monitoring a physical system.Therefore, data received from 'variety' of sources also make the processing very challenging.As a result, the big data issues need to be addressed as all 3V properties of big data is observed in the data received from the SCADA system.
Based on this scenario, performance analysis of anomaly detection techniques is a research requirement.Recently, a number of approaches have been proposed for big data analysis [4][5][6][7][8][9][10].However, for SCADA systems, we are the pioneer to investigate the anomaly detection techniques in big data perspective.Our contribution in this paper are the following: • We categorize the anomaly detection techniques based on nearest neighbour, clustering and statistics.
• Representative algorithms in each category are applied on benchmark SCADA systems datasets.
• We evaluate the performance of the algorithms using a number of metrics such as accuracy, false positive rate, hit rate, F-measure and MCC.
• Finally, we highlight the set of techniques that are efficient for big data analysis.
Rest of the papers are organized as follows.Section 2 provides fundamental aspects of anomaly detection and a taxonomy.Section 3 contains the discussion on the different categories of anomaly detection algorithms.Section 4 discusses the proposed criterion to benchmark anomaly detection algorithms and their merits/demerits.Section 5 provides the experimental results and detailed discussion on the performance comparison.We conclude our paper in section 6.

Anomaly Detection Fundamentals
Anomaly detection is an important data analysis task.The main objective of anomaly detection is to detect anomalous or abnormal data from a given dataset.This is an interesting area of data mining research as it involves discovering new and rare patterns from a dataset.Anomaly detection has been widely studied in statistics and machine learning.It is also known as outlier detection, novelty detection, deviation detection and exception mining [1].Based on the characteristics of data instances, anomalies are grouped into three categories (Figure 2).These are discussed below: • Point Anomaly: When a particular data instance deviates from the normal pattern of the dataset, it can be considered as a point anomaly.For a familiar example, we can consider expenditure on electricity bills.If the usual bill per month is about 100 dollars, and if for one month it is 500 dollars then obviously it is a point anomaly [3].
• Contextual Anomaly: When a data instance is anomalous in a particular context, but not in other times, then it is termed a contextual anomaly, or conditional anomaly.For example, the expenditure on credit card during a festive period, e.g., Christmas or New Year, is usually higher than the rest of the year.Although, the expenditure during a festive month can be high, it may not be anomalous due to the expenses being contextually normal in nature.On the other hand, an equally high expense during a nonfestive month could be considered as a contextual anomaly.
• Collective Anomaly: Collective anomaly is a pattern in the data when a group of similar data instances behave anomalously with respect to the entire dataset.It might happen that the individual data instance is not an anomaly by itself, but due to its presence in a collection it is identified as an anomaly.For example, a denial of service attack can be considered as a group of network traffic instances affecting the network as well as collective anomaly [2,24].
One important issue in anomaly detection is how the anomalies are represented as output.Generally there are two categories: • Scores: Scoring based anomaly detection techniques assign a score to each of the data instances.
Then the scores are ranked and analyst used to choose the anomalies or use a threshold to select.
• Binary: According to these techniques, outputs are considered in binary fashion, i.e. either anomaly or not.Techniques which provide binary labels are computationally efficient since each of the data instances do not have to provide scores.

Anomaly Detection Techniques
In this section, we discuss the anomaly detection techniques covered in the scope of this paper.There are various kinds of anomaly detection techniques based on different theories [1,25].In this paper, we classify the anomaly detection techniques in two major categories.These are the following: • Supervised Learning: It is the machine learning task of inferring a function from labelled training data [39].The training data consist of a set of training examples.In supervised learning, the training examples consist of an input object and a desired output value.A supervised learning algorithm learns from the training data and creates a knowledge base which can be used for mapping new and unseen data.
• Unsupervised Learning: It tries to find hidden structure in unlabelled data, which distinguishes unsupervised learning from supervised learning [44].For example, clustering can be considered as unsupervised learning algorithms, where pre-labelled data is not necessary [48].
Supervised learning algorithms require pre-labelled data.Labelled data are rare and difficult to find.However, when pre-labelled data is available, the unseen data cannot be mapped which are not present in the labelled data, such as zero day attacks in the intrusion detection domain [24].Inspired by this

Nearest Neighbor (NN) based Anomaly Detection and Related Works
The concept of nearest neighbor has been widely used in several anomaly detection techniques.The key assumption used in this scenario is 'Normal data instances stay in a dense neighborhoods and the anomalies stay far away from their neighbors' [20].Next, we present a couple of anomaly detection techniques [1] based on this idea.Figure 4 shows a simple example of k-NN method.The corresponding algorithm is shown in Algorithm 1. [20]  a non-outlier, otherwise it is an outlier.This concept was further extended by Ramaswamy et al [11] where the anomaly score is based on the k-nearest neighbor implementation.Ramaswamy et al [11] provided outlier definition based on the distance of a point from its k th nearest neighbor.They provided a ranking of top-n outliers by the measure of the outlierness of the points.According to them, top-n points with the maximum distance to their own k th nearest neighbor are considered as outliers.They also exploited index-based and nestedloop algorithms to detect outliers.Furthermore, they proposed a partition-based algorithm to prune and process the partitioned groups to improve efficiency for outlier detection.Their algorithm reduces the cost of computation in large, multidimensional data sets.Breunig et al [21] proposed to assign each object a degree of being outlier.This degree is called the Local Outlier Factor (LOF).LOF depends on how isolated the object is with respect to the surrounding neighbourhood.The local outlier factor of an object p is calculated using the equation ( 2), where MinPts defines the minimum number of points as a notion of density and lrd is the local reachability density (1).(For more details on the mathematical terms please see [21].

Knorr et al
This outlier factor of object p calculates the degree to which p can be called as outlier.The outlier factor is the average of the ratio of the local reachability density (lrd) of p and those of p's MinPts-nearest neighbours.The author also described mathematically the LOF for objects deep in a cluster along with general bounds (upper, lower, and tight).The Theorem 1 depicts a general upper and lower bound on LOF(p) for any data object p.For the theorem, following terms are necessary.
Jin et al [36] proposed an approach for mining only top-n local outliers because the LOF [21] values for every data object require a large number of k-nearest neighbour searches and can be very computationally expensive.They proposed an efficient microclusterbased local outlier mining algorithm to find the topn local outliers in a large database.A microcluster MC (n, c, and r) is a summarized representation of a group of data p 1 , , p n , which are so close together that they are likely to belong to the same cluster.Here, , is the mean center while r = max(d(p i ,c)), i = 1,...,n, is the radius.Data are compressed into small clusters, and small clusters are represented using some statistical information as microclusters.Three different algorithms are combined to find top-n local outliers.First, k-distance bounds for each microcluster are computed.Then using these k-distance bounds, the LOF bounds are calculated.Finally, given an upper bound and a lower bound for the LOF of each microcluster, top-n local outliers are ranked.He et al [26] introduced a new definition for outlier, the semantic outlier.A semantic outlier is a data point that behaves differently from the other data points in the same class.A measure for identifying the degree of each object being an outlier is presented, which is called the semantic outlier factor (SOF).To mine semantic outliers, an algorithm is also proposed.They used a SQUEEZER algorithm, which is used to produce good clusters for categorical datasets, and then used their algorithm to calculate the SOF value for each of the objects.Their proposed outlier definition works by identifying the similarity between a specific set and a record.Given a set of records R and a record t, the similarity between R and t is defined as follows: The semantic outlier factor of a record t is defined as in equation (6).
Spiros et al [33] introduced local correlation integral (LOCI) for evaluating outlierness, which is very efficient in detecting outliers and groups of outliers.The main advantage of this approach is an automatic data-dictated cut-off to determine whether a point is an outlier.They introduced the multigranularity deviation factor (MDEF), which at radius r for a point p i is the relative deviation of its local neighborhood density from the average local neighborhood density in its neighborhood.
Zhange et al [17] proposed a new outlier detection definition, local distance-based outlier factor (LDOF), which is sensitive to outliers in scattered datasets (Figure 5).LDOF uses the relative distances from an object to its neighborhood to measure how much objects deviate from their scattered neighborhood.The higher the violation degree an object has, the more likely the object is an outlier.The local distance-based outlier factor of p i is defined in equation (7) where d p i the knearest neighbors are the distance of object p i and D p i is the k-nearest neighbor inner distance of p i .8).To achieve a normalization making the scaling of PLOF independent of the particular data distribution, the aggregate value nPLOF (9) is obtained during PLOF computation.

Clustering based Anomaly Detection and Related Works
As discussed earlier that anomaly deviates from the regular characteristics of the data.Consequently, the goal of clustering is to group together similar data and it is used to detect anomalous patterns in a dataset [40].
There are three key assumptions when using clustering to detect anomalies [24]: 1. Assumption 1: Once the clusters are created, any new data that do not fit well with existing clusters of normal data are considered as anomalous.For example, if we consider density based clustering algorithms [48] such as DBSCAN, we find that it does not include noise inside the clusters.As a result, noise is considered anomalous.For example, in the Figure 6, C1 and C2 are clusters containing normal instances and A1, A2 are anomalies.In some cases, a cluster contains both normal and anomalous data.It is expected that normal data lie close to the nearest cluster centroid and anomalies are far away from the centroids (Figure 7).Based on this assumption, anomalies are detected using a distance score.
In [40], the authors considered an outlier according to distance of a data instance from the centroid.If the distance is a fixed multiple of mean distances of all other data points from the centroid then it is considered as an outlier.Formally, 'an object in a set of data is an outlier if the distance between the object and the centroid of the dataset is greater than multi times the mean of the distances between centroid and other objects in the dataset' [40].They also showed that removing outliers from clusters can significantly improve Amer et al [14] introduced Local Density Cluster-Based Outlier Factor (LDCOF) which can be considered as a variant of CBLOF [22].The LDCOF score ( 16) is calculated as the distance to the nearest large cluster divided by the average distance to the cluster center of the elements in that large cluster.LDCOF score will be A when p ∈ C i ∈SC where C j ∈LC and B when p∈ Cluster Based Outlier Detection (CBOD) [37] is another technique which consists of two stages.In the first stage, it generates clusters from a given dataset and in the second stage it computes outlier factor as the weighted sum of distances between a particular cluster and rest of the clusters.The outlier factor of cluster C i , OF(C i ) is defined as the weighted sum of distances between cluster C i and the rest of the clusters.The outlier factor OF(C i ) measures the outlier degree of cluster, the bigger the value is, the bigger the possibility of being an outlier cluster.

OF(C
Minimum b clusters which satisfy the criteria as follows are labelled as outlier clusters.They used detection rate and false alarm rate to measure performance.

Statistical based Anomaly Detection and Related Works
The statistical approaches discussed here are considered as the first generation techniques for anomaly detection.Figure 9 portrays the the most commonly used µ ± 3σ rule for detecting anomalous data.A normally distributed data follows a bell curve and can be mathematically represented in equation (18).Here, µ stands for the mean or average, σ is the standard deviation and σ 2 is the variance.When the µ=0 and σ = 1, the distribution is called standard normal distribution.The data with values greater than µ + 3σ or less than µ − 3σ is considered anomalous.
These techniques are also named as model-based techniques.Models are based on probability distribution of the data and anomalies are detected as how well the data fit into the model.Statistical based approaches are categorized into two groups depending on probability distribution as follows: • Parametric Approaches: In these approaches the probability distribution of the data is known (supervised).Then, using the distribution parameters, anomalies are detected.A point is an anomaly if it deviates significantly from the data model.However in many situations prior knowledge of distribution is not possible to attain.As a result, supervised learning techniques are not preferred over the unsupervised learning techniques instead of having less accuracy.Wu et al [30] proposed two algorithms for outlying sensors and event boundary detection.The basic idea of outlying sensor detection is as such, each sensor first computes the difference between its reading and the median of the neighboring readings.Each sensor then collects all differences from its neighborhood and standardizes them.A sensor is an outlier if the absolute value of its standardized difference is sufficiently large.The algorithm for event boundary detection is based on the outlying sensor detection algorithm.For an event sensor, there often exist two regions, with each containing the sensor, such that the absolute value of the difference between the reading of the sensor and the median reading from all other sensors in one region is much larger than that in another region.These approaches are not effective because they do not consider the temporal correlation of sensor readings [1].Bettencourt et al [29] proposed an anomaly detection technique to identify anomalous events and errors in ecological applications of distributed sensor networks.This method uses spatio-temporal correlation of sensor data to distinguish erroneous measurements and events.A measurement is considered anomalous when its value in the statistical significance test is less than user specified threshold.The disadvantage of this approach is dependence on the user specified threshold [1].Jun et al [31] presents a statistical based approach, which uses alpha-stable distribution.The proposed algorithm consists of collaborative timeseries estimation, variogram application and principle component analysis (PCA).Each node detects any temporally abnormal data and transmits the verified data to a local cluster-head, which detects any survived spatial outlier and determines the faulty sensors accordingly.Their approach achieves 94% accuracy when the noise level is alpha = 0.9.Although alpha-stable distribution might be considered for real sensor data and cluster based structure may be susceptible to dynamic changes of network topology [1].
• Non Parametric Approaches: These approaches have no knowledge about the underlying data distribution like unsupervised learning methods.
A distance measure is used to identify anomalies in this scenario.Anomalies are those points which are distant from their own neighborhood in a dataset.Various detection techniques are available with a wide range of parameters.They resemble anomaly detection using clustering based assumption 2. Parametric methods are not flexible enough like non-parametric methods but due to dimensionality and computational complexity the efficiency might deteriorate in some cases.There are two widely used approaches in this category are discussed as follows- -Histogramming: This model counts the frequency of occurrence of different data instances and compares the test instance with each of histogram categories to test whether it belongs to any of them [18].Sheng et al [32] proposed a histogrambased technique for anomaly detection to reduce communication cost for data collection applications of sensor networks.
Rather than collecting all the data in one location for centralized processing, they propose collecting hints about the data distribution and using the hints to filter out unnecessary data and identify potential anomalies.Main drawbacks of this technique are communication overhead and one dimensional data [1].
-Kernel Function: This function is used to estimate the probability distribution function (pdf) of the normal instances.Data instances which lie in the low probability area of pdf are declared as anomalies.Palapans et al [15] proposed a technique for online deviation detection in streaming data.They discussed how their technique can be operated efficiently in the distributed environment of a sensor network.In the sensor data, a value is considered as an anomaly if the number of values being in its neighborhood is less than a user specified threshold.This technique can also be implemented for identification for of anomalies in a more global perspective [1].

Criteria for Benchmarking Anomaly Detection Algorithms
This section provides a discussion on the key aspects to evaluate anomaly detection algorithms in terms of big data.We propose the following points to be considered while selecting the benchmark anomaly detection techniques in SCADA systems: • Size of the Data (Volume): Size is an important factor for anomaly detection algorithms.More importantly, in case of big data, it is a crucial parameter to measure the efficiency of the anomaly detection algorithm.Some anomaly detection technique might work well on small dataset but perform poorly on big data and viceversa!
• Dimensionality: It is closely related with the computing efficiency of any data mining techniques.It is quite common that big data has high dimensionality and as the dimensionality increases the data become sparse.As a result similarity/dissimilarity calculation at this situation is challenging.
• Type of Data: Handling identical data type and mixed type is completely different.For example, handling only numerical data for anomaly detection is more computationally efficient than dataset with numerical, categorical and binary type of data.Also, in case of big data, it is an important issue to consider the efficiency of the anomaly detection.
• Velocity: This criterion deals with complexity of the anomaly detection algorithms.
• Input Parameter: Selecting the best possible parameters for any algorithm is a challenge.It is more challenging when input parameters required for big data.A non-optimal value of input parameter causes computational burden.Also more the number of input parameters more it gets complex.In unsupervised fashion, it is also a challenge to provide the best parameter values to the anomaly detection techniques.So, less is better in this case.
In Table 1, we showcase the characteristics of anomaly detection algorithms based on the criterion

Strength and Weakness
We highlight the merits and demerits of the anomaly detection techniques discussed in Section 3.
Nearest Neighbour Techniques: The main advantage of nearest neighbour based techniques is their unsupervised characteristics.However, when anomalies have a large number of close neighbours, it is not possible to identify them correctly.Also, the distance computation requires significant computation and it becomes more complex when the data has mixed type of data such as numerical, categorical, binary etc.
Clustering Techniques: The techniques used to detect anomalies in binary fashion are computationally efficient irrespective of the clustering algorithm since each object in dataset is not required to assign an outlying factor like scoring based output.The top-N anomaly concept is absent in these techniques and hence are unsupervised.The main drawback of these techniques is inaccuracy of detecting all the rare class instances.Since not all the data objects are taken into consideration for being outlier, many of them might be missing and normal instances may be detected as anomalies.
The scoring based techniques have the maximum effectiveness in detecting anomaly accurately since all the objects are under consideration as candidate anomalies.But the loophole of these techniques is computational cost.Since all the objects are taken under consideration to assign outlyingness factor.Top-N anomalies must have to be specified by data analyst and thus the approach becomes supervised.
Statistical Techniques: Statistical approaches come with strong mathematical background to detect anomalies.But parametric approaches are not feasible when the prior knowledge on the data distribution is not available and hence quite useless in many aspects.In comparison, nonparametric methods are quite useful since the data distribution knowledge is not required.However, these methods might have high computational complexity for high dimensional datasets.Also user-defined parameters are not easy to set.

Experimental Evaluation on SCADA Systems Big Data
This section starts with a brief discussion on the datasets used.Then we discuss about the evaluation metrics used in the paper.Finally, we showcase the evaluation results showing in figures and tables.

SCADA Datasets used in this paper
Table 2 contains the description of the characteristics of some of the common SCADA datasets widely used [28].The simulated anomalies in the Sim1 and Sim2 contain man-in-the-middle attacks [38].Here a water distribution system is simulated using the EPANET library [46].Anomalies were created using the man-inthe-middle attacks.In this scenario, water pumps were turned off when the reserve in the tanks are low.
In the single-hop, multi-hop (indoor and outdoor) datasets, anomalies are injected [45].For the singlehop scenario, two indoor and two outdoor sensor nodes are used to collect the temperature and humidity data for six hours.Anomalies are introduced by using a kettle of hot water at one of the sensors.The simultaneous raise in the temperature and humidity is considered anomalous in this scenario.In the multi-hop situation, multi-hop routing is used to create a larger sensor network.Like single-hop datasets, anomalies are introduced using the hot water at the temperature and humidity sensors.

Evaluation Measures
We measure the performance of the anomaly detection algorithms using the standard evaluation criteria [1].These are briefly discussed here.All of them share some common concept of confusion matrix.The 2 × 2 matrix contains the number of True Positive (TP), False Positive (FP), True Negative (TN), False negative (FN).Table 3 displays the confusion matrix.TP: No. of anomalies correctly identified as anomalous.

FP:
No. of normal data incorrectly identified as anomalous.
TN: No. of normal data correctly identified as normal.
FN: No. of anomalies incorrectly identified as normal.
Listed below are the five evaluation measures based on confusion matrix.
• FPR -False Positive Rate also named as FPR is another metric which is the proportion of non-relevant data that are retrieved, out of all non-relevant data available.The lower the value is better the anomaly detection technique is.Equation (20) shows the way to calculate FPR.
• Recall -Recall is the fraction of the data that are relevant to the query that are successfully retrieved.In the case of anomaly detection, recall is also known as TPR, Hit Rate, can be calculated using (21).

Recall = T P T P + FN (21)
• F-1 -F-1 score is the harmonic mean of precision (T P/T P + FP) and recall.Equation (22) shows the way to calculate F-1.
Last but not least, we also consider the run time (in seconds) as an important evaluation criteria for anomaly detection algorithms.

Experimental Results
This section contains the performance analysis of anomaly detection techniques based on the evaluation measures discussed in the previous section.For simplicity, we scale all the metrics between 0 and ±100.The representative algorithms are the following and standard values are considered for the input parameters for all the techniques: • Nearest Neighbour: -k-NN: Each data instance is given score for being anomalous based on the average distance to the nearest neighbours [11].
-LOF: LOF provides anomaly score to the data instances based on the local density of the data points [21].
-COF: The connectivity based outlier factor is a modification of the LOF approach which can handle outliers deviating from low density patterns [43] -aLOCI: Calculates the outlier score based on local correlation integral [33].
-LoOP: The LoOP score represents the probability that the object is a local density outlier [47].
• Clustering: -CBLOF: CBLOF creates clusters from the given dataset and then it categorizes the clusters into small clusters and large clusters using the parameters α and β.The anomaly score is then calculated based on the size of the cluster the point belongs to as well as the distance to the nearest large cluster centroid [22].
-LDCOF: This local density based anomaly detection algorithm sets the anomaly score based on the distance to the nearest large cluster divided by the average cluster distance of the large cluster [14].
-CMGOS: This method calculates the anomaly score based on a clustering result.
The outlier score of an instance is dependent on the probability of how likely its distance to the cluster center is [14].
• Statistical: -HBOS: Calculates an outlier score by creating an histogram with a fixed or a dynamic binwidth [18].
-LIBSVM: Computes the outlier score using one-class SVMs [42].This operator extends the semi-supervised one-class SVM such that it can be used for unsupervised anomaly detection.

≤220
We categorize the performance of the anomaly detection algorithms based on the taxonomy of anomaly in SCADA systems (Figure 10).For the real For the simulated datasets, it is surprising that semi-supervised anomaly detection technique LIBSVM has better recall than others, however suffers from unacceptable run time.On the other hand, nearest neighbour based method k-NN has very low run time complexity and acceptable recall.Clustering based approaches are not well suited for the simulated datasets here and statistical approach HBOS outperforms clustering techniques.Table 5 displays the results on simulated datasets.

≤16
For the datasets with injected anomalies in multihop senario, we found the performance (Table 6) of clustering based approaches is the best considering the evaluation measures.Nearest neighbour based approaches are the next best.Among the HBOS and LIBSVM approach, the latter has the better results in terms of anomaly detection but attains high computational burden (run time).
Finally, for the datasets in single-hop scenario, it is seen that, the clustering-based methods perform consistently well, but the nearest neighbour methods are quite variable (Table 7).LIBSVM performs better than HBOS but still suffers from high run time complexity.
It is interesting to observe that, for all the anomaly detection techniques the Recall and F-1 values are identical.Since, top N anomalies detected by the techniques are matched with the actual N number of anomalies in the dataset, the Recall and F-1 scores will always yield exactly the same values.Finally, we summarise the performance of each of the anomaly detection techniques in Figure 11.In Table 8 we also summarize the performance on different SCADA datasets.We suggest the usage of these techniques analysing the results discussed earlier.The sign ( √ ) indicates the affirmative gesture to apply the techniques and the sign (×) discourages the usage.

Conclusion and Future Works
This paper gives a detailed discussion on the popular anomaly detection techniques on SCADA systems and analysed their performance.We come to a conclusion that nearest neighbour and clustering based approaches are more suitable for SCADA systems than statistical and semi-supervised support vector machine based approaches.In future we will investigate the following: • How to find the most suitable input parameter values?
• How to incorporate the idea of contextual anomaly in big data perspective?
• How can incorporation of multi-view clustering [16], hierarchical clustering [12] and co-clustering [13] improve the efficiency of clustering-based anomaly detection techniques?

Figure 1 .
Figure 1.An overview of SCADA Architecture

Figure 2 .
Figure 2. A simple taxonomy of anomaly

Figure 3 .
Figure 3. Taxonomy of Anomaly Detection Techniques

Figure 3
shows a simple taxonomy for anomaly detection in the scope of this paper.The terms anomaly and outlier are used interchangeably throughout the paper.
presented an algorithm to detect distance-based outliers.They consider a data point O in a dataset T a DB(p;D)-outlier if at least a fraction p of the data points in T lies greater than distance D from O. Their index-based algorithm executes a range search with radius D for each data point.If the number of data points in its D-neighborhood exceeds a threshold, the search stops and that data point is declared as Algorithm 1: Basic k-NN Algorithm Input: D = { (x 1 ,c 1 ),....,(x N ,c N )} x = (x 1 ,....,x N ) new instance to be classified.Begin for each labelled instance (x i ,c i ) Calculate d(x i ,x), the distance from x i to x Order d(x i ,x) from lowest to highest, (i=1,.....,N) Let D k x be the k-nearest instances to x Label x by the most frequent label in D k x end End

) 5 EAIFigure 5 .
Figure 5.The explicit outlierness of object p i with the help of LDOF definition.A is the center of the neighborhood system of p i .The dashed circle includes all neighbors of p i .The solid circle is reformed neighborhood region of p i .Adapted from[1]

Figure 10
displays a simple taxonomy of anomalous scenarios in SCADA systems.There are three major categories of anomalies based on the datasets used in this paper.The real anomalies are from water treatment plant.The simulated anomalies are designed by computer software.In real sensor nodes, the anomalies are injected by creating changes in temperature.The real anomalies in the WTP dataset [35] are caused by the inclement weather.It contains data of the daily measures of sensors in a urban waste water treatment plant.Solid overload caused by stormy 10 EAI Endorsed Transactions on Industrial Networks And Intelligent Systems 02 -05 2015 | Volume 2 | Issue 3 | e5

Figure 10 .
Figure 10.Taxonomy of anomaly in SCADA systems

F- 1
= 2T P 2T P + FP + FN (22) • MCC -The Matthews correlation coefficient is a popular measure in machine learning to identify the quality of binary (two-class) classifications.It considers the true and false positives and negatives for calculating the measure.The MCC provides a value between -1 and +1.A MCC score of +1 represents a perfect anticipation and -1 indicates complete opposite scenario between 11 EAI Endorsed Transactions on Industrial Networks And Intelligent Systems 02 -05 2015 | Volume 2 | Issue 3 | e5 observation and prediction (23).MCC = (T P * T N ) − (FP * FN ) (T P + FP)(T P + FN )(T N + FP)(T N + FN )

•
How to reduce the run time complexity of semi-supervised support vector machine based anomaly detection?EAI Endorsed Transactions on Industrial Networks And Intelligent Systems 02 -05 2015 | Volume 2 | Issue 3 | e5

Table 1 .
Characteristics of anomaly detection algorithms

Table 2 .
Characteristics of the SCADA datasets

Table 3 .
Standard confusion metrics for evaluation of anomaly detection algorithm

Table 4 .
Performance of Anomaly Detection Techniques on Real SCADA Dataset (WTP: Water Treatment Plant)

Table 5 .
Performance of Anomaly Detection Techniques on Simulated SCADA Datasets

Table 6 .
Performance of Anomaly Detection Techniques on Datasets with Injected Anomalies

Table 7 .
Performance of Anomaly Detection Techniques onDatasets with Injected Anomalies (Single-Hop)

Table 8 .
Characteristics of anomaly detection algorithms