An Evaluation Framework for Moving Target Defense Based on Analytic Hierarchy Process

A Moving Target Defense (MTD)-enabled system is one which can dynamically and rapidly change its properties and code such that the attackers do not have sufficient time to exploit it. Although a variety of MTD systems have been proposed, fewwork has focused on assessing the relative cost-effectiveness of differentMTD approaches. In this paper, based on a generic MTD theory, we propose five general evaluation metrics and an assessment framework on top of Analytic Hierarchy Process (AHP), which aggregates these five metrics and systematically evaluates/compares security strengths and costs of multiple MTD-based approaches in the same category. This framework could be widely used in different MTD categories under various attacks and it will enable a security specialist to choose the best MTD approach from a set of possible alternatives based on his/her goal and understanding of the problem. A detailed case study on a specific MTD category called software diversification validates the effectiveness of this framework. Our evaluation results rank three software diversity algorithms and choose the best one among three based on problem setting and situation constraints. Received on 24 December 2017; accepted on 26 December 2017; published on 4 January 2018


Introduction
In the history of arms-race between attackers and defenders, the game setting has always favored the attackers.This is because as defenders we must assume that the attackers know how the system works, and hence we must carefully examine the whole system to make sure that no vulnerability exists, while the attackers only need to know a single vulnerability to break the system.To ensure that a system is free of vulnerability is extremely difficult, if not impossible, especially for large systems with millions lines of code.Hence, besides keeping up with novel and advanced techniques for identifying and patching system vulnerabilities, detecting malware, and building new systems with security embedded from scratch, recently an innovative theme in cyber security has emerged.
This new theme is named Moving Target Defense (or MTD) [1].The philosophy of MTD is that instead of attempting to build flawless systems to prevent attacks, one may continually change certain system dimensions over time in order to increase complexity and cost for attackers to probe the system and launch the attack.Rather than leaving the system properties and code static and persistent long enough for an attacker to exploit vulnerabilities, a MTD-enabled system would rapidly change its properties and code such that the attackers do not have sufficient time to study, search, and further to exploit.This strategy ultimately reverses the asymmetric advantage of attackers.
The state-of-art approaches based on the concept of MTD can be roughly classified into a few categories.For example, various obfuscation techniques have been proposed to safeguard individual system against code injection attacks and memory error exploits, e.g., address space randomization (ASLR) [2][3][4] and instruction set randomization (ISR) [5,6].Higher level MTD approaches have been mainly based on diversity-inspired software assignment [7][8][9], system and network re-configuration, substitution, and shuffling techniques [10][11][12][13].In military environments, frequency hopping techniques such as Frequency Hopping Spread Spectrum (FHSS) [14] has been used to defend against eavesdropping and radio jamming for long time.Such techniques are all examples of MTD.
Although a variety of moving target defenses have been discussed in literatures [1], a well-accepted methodology to assess the cost-effectiveness of different MTDs is still missing.Many fundamental questions have been raised regarding the evaluation of MTD.For example, how to compare two or more MTD approaches, and how to determine the most costeffective MTD-based approach for addressing a specific security issue?
The difficulty of evaluating the strength of MTDbased approaches is mainly due to the following reasons.First, there is a lack of knowledge about which criteria or metrics that must be considered.Currently different MTD approaches [1] use a different set of terminologies to evaluate their performance.Second, because such factors as the types of exposed attack surface, and the capabilities, resources and strategies of both attackers and defenders vary among systems, different MTD approaches are needed to address different threats; thus, it is difficult to find unified criteria that fit all the MTD approaches.Third, the effectiveness of a MTD may or may not be quantitatively measurable, as in practice system security analysis is often done in an ad-hoc and descriptive way.Despite all the challenges, we still need a way to evaluate and compare different MTD techniques in the same category.Although there are many previous works on evaluating system and network security mechanisms, to our knowledge, few focused on MTD evaluations.
In this paper, we propose an assessment framework for systematically evaluating and comparing the security strengthes and costs of multiple MTD-based approaches.This framework will enable a security specialist to choose the best MTD approach from a set of possible alternatives based on his/her goal and understanding of the problem.The ability to choose the appropriate MTD approach (w.r.t. the specific system or network setting and constraints) would be critical for the successful deployment of MTD systems.Our main contributions are in three aspects: • Based on a generic MTD theory, we carefully select five general evaluation metrics tailored for MTD evaluation and comparison; • We aggregate five evaluation metrics by proposing a generic MTD evaluation framework for the first time, based on Analytic Hierarchy Process (AHP).Due to its generality, this framework could be used in different MTD categories under a variety of attacks; • We present a detailed case study on evaluating a specific MTD category called software diversification with evaluation results, which validates the effectiveness of our proposed evaluation framework.The following content is organized as follows.We summarize the related work in Section 2 on MTD systems and their evaluations.We present our system models, including a uniform MTD theory model and our attack model in Section 3.Then, we propose a generic evaluation framework for MTD in Section 4. After that, we give a detailed case study on a specific MTD category, named software diversification and we also present our evaluation results in Section 5.In Section 6, we discuss the issues to apply our evaluation methodology in other MTD categories and how to apply our methodology in different levels.Last, we conclude our paper and discuss future work in Section 7.

Related Work
The dynamic nature of Moving Target Defense (MTD) alleviates the asymmetry of timing differences between attacks and defenses.So far, many MTD systems have been developed [15][16][17].
Major MTD systems could be divided into four categories.The first category is software-based diversification such as [18].The basic defensive idea is to switch among multiple functionally equivalent but internally different program variants to hinder the attacks.Different software implementations are not supposed to share the same vulnerability, so even if the attacker exploits a vulnerability in one software version it still takes similar time to compromise other software versions.The second category is called runtime-based diversification.Basic defensive idea here is to mitigate attacks via introducing randomization in runtime environment dynamically.Examples in this category include instruction set randomization [5,6], system call number randomization [19], and address space layout randomization [20]).The third MTD category is named networkbased diversification [21], such as host IP mutation and hopping, network database schema mutation, and random finger printing, with the basic defensive idea to be: randomly change network configurations without causing network service failures.The fourth MTD category is dynamic platform techniques [21], which changes platform properties or switches among different platforms to stop attacking processes.Examples in this category include virtual machine rotations, serverswitching techniques, and self-cleaning techniques.
However, based on the current literature, it is hard to compare different MTD systems by evaluations and to pick an optimal system based on the given situation and constraints.Current MTD evaluation methodologies [22][23][24] could be divided into four main categories, too.They are: attack-based experiments, simulationbased evaluation, mathematics-based evaluation, and game theory based evaluation [25].Among these, mathematics-based evaluation can be further categorized into probability-based evaluation and Markov model based evaluation [26].There also exist hybrid approaches which combine at least two above methods together.The major limitation of previous work is that there is a lack of a generic, systematic, and fine-tuned way to evaluate and compare MTD systems of the same category.
In this paper, for the first time, we carefully choose five general evaluation metrics, based on a generic MTD theory.We also aggregate these five metrics by proposing an evaluation framework on top of Analytic Hierarchy Process (AHP).We validates the effectiveness of this framework through a case study on a specific MTD category called software diversification and we discuss how to apply it to other MTD categories.

A Uniform MTD Theory Model
A uniform MTD theory model is shown in Figure 1.In this model, MTD system starts from an initial state.Based on the current pool of adaptation space, there will be numerous valid states (after eliminating those defined by the environment constraints) in the space that the initial state could transit to.MTD system randomly picks a state based on the current situation and transits to that state after a small but unpredictable delay.Ideally, this process is an infinite loop and the MTD system will never run out of valid states, even under some attacks.

Attack Model
The attacker's eventual goal is to stop the MTD selfevolving process, i.e., to terminate the infinite loop.There are two ways for the attacker to break the MTD system.The first way is to break states as many as possible by exploiting vulnerabilities, then once the MTD system is evolving into one of those vulnerable states the MTD system will be broken into.The second way is for the attacker to be able to predict the next state and break that specific state.These two methods should be equally difficult for the attacker to achieve if our MTD defensive system is designed in a good manner.

The Proposed Generic MTD Evaluation Framework
We first give an overview of the evaluation framework.We then briefly present all the five general evaluation metrics.The AHP procedure which aggregates all the evaluation metrics is discussed at the end of this section.

An Overview of the Evaluation Framework
We propose that the evaluation of MTD-based approaches be generalized with five metrics: survivability, unpredictability, movability, stability, and usability.In practice, however, not all metrics can be directly quantitatively measured with absolute meanings, so we will further propose a way for metric aggregation based on relative importance of metrics, which enables us to quantitatively assign weights to metrics in an automatic way and compare MTD approaches in each same category.An overall flowchart of the evaluation framework could be seen in Figure 2.

Evaluation Metrics
Below, we justify the rationale to pick these five general evaluation metrics (including survivability, unpredictability, movability, stability, and usability) based on our system models, then we briefly explain the concepts/definitions of all the five metrics.The whole picture and interactions among different metrics could be seen in Figure 3.
Survivability.Survivability describes the degree to which a system/network is able to withstand attacks.Higher survivability means that the system under attack is able to function as in normal at a high probability for a longer time.Survivability is closely related to the size of attack surface and the types of vulnerabilities.The larger the attack surface, the more ways a system could be attacked, and the lower the survivability.On the other hand, if the vulnerability is in the kernel level, greater damage could be done.Although it is desirable to directly measure survivability of the whole system, in reality this is often not possible because a system may carry multiple types of vulnerabilities of different impacts.In this paper, we consider survivability in the context of a particular attack, and it is measured through either one of the following two ways, depending on whether the system of protection is a host or a networked system: (i) likelihood that a host machine will be compromised in the presence of an attack when an attacker exploits a vulnerability; (ii) the maximal number of machines in a network that will be compromised altogether when an attacker discoveries and exploits a particular vulnerability.
More specifically, survivability relates to the life span (or depth) of the MTD self-evolving loop.Ideally, a successful MTD system should have an infinite loop.In reality, if the loop is longer, that means that the survivability of the MTD system is higher.Considering the cost of the system, there exists a max number of possible states in practice.Definition 1. Survivability is defined as the relative depth of the MTD self-evolving loop, i.e., it is a real value between 0 and 1, which represents the ratio between total number of states in the MTD self-evolving loop and the max number of possible states in practice.

Unpredictability. MTD-based approaches introduce ran-
domness into the protected system to make its state more unpredictable to an attacker.As a fundamental criterion, unpredictability requires the critical aspects of the system to keep uncertain to the attacker, which makes it difficult for the attacker to anticipate defensive actions.Higher unpredictability means that an attacker is less likely to accurately determine the "key" of transformation of a particular MTD strategy; consequently the data collected by the attacker contains more noises.In general, unpredictability is determined by the number of states in the moving space of a system and the probability that each state is traversed.
More specifically, unpredictability relates to the unlinkability between two consecutive valid states.Definition 2. Unpredictability is defined as the unlinkability between two consecutive valid states in the MTD self-evolving loop.Unpredictability will be high (close to 1) if the current state is equally likely to transit to all the valid states in the next adaptation space, i.e., prob(S i → S j )= 1  n , where n is the total number of valid states in the next adaptation space and j is an integer between 1 and n.

Movability.
Movability is an important characteristic of a MTD-based strategy that overcomes the limitation of static defense mechanisms by dynamically altering some properties of systems/programs (e.g., proxy substitution or reconfiguration of the network such as IP addresses).For an actual host or a network, there could be many practical constraints on moving.Instead of randomly changing states, a movable strategy or algorithm should be designed to conform to such constraints.Hence, movability is defined as the degree to which certain aspects of a system can be altered without impacting its normal operations; that is, how well a MTD strategy is able to accommodate given practical constraints.
More specifically, movability relates to the breadth (or width) of the MTD system self-evolving loop.Definition 3. Movability M t is defined as the width of the MTD system self-evolving loop at time t, i.e., the total number of valid states in the adaptation space at time t, after eliminating all the invalid states due to practical constraints.

Stability.
Stability describes the performance of the MTD approach to be sustained and effective over the "moving space".Assuming a particular MTD approach changes the system's state from one to another, stability requires the security level of the system remains approximately the same.A non-stable MTD approach might produce one state with a greater attacking surface (e.g., lots of entry points available to untrusted users, or lots of untrusted software running) and another state with a smaller attacking surface, thus exposing the system to a greater danger once compromised in the first state.
More specifically, stability means that the MTD selfevolving channel should have roughly even or balanced widths.Definition 4. Stability is defined as the standard deviation of widths of the MTD system self-evolving loop.If the standard deviation is higher, stability of the MTD system will be lower.
Usability.Usability is defined, from user's experience, as "ease of use".It is the degree to which a user is satisfied with a particular system state.In other words, it reflects how comfortable it is for system users to perform tasks or routine operations on a protected system.A protected system adopting MTD is considered to have high usability if it causes little inconvenience (e.g., not requiring users to be re-trained for every change of state), efficient in time and resources (e.g., minimum delay and disruption), etc.
More specifically, usability reflects the tradeoffs among ease of use, performance, and security.Definition 5. Usability complements the other evaluation metrics by defining how easy of use the MTD system is.An ideal MTD system should maximize ease of use and minimize the performance penalty while achieving maximally possible security.
Note that we do not consider hardware/software cost when comparing MTD approaches, because the comparable MTD approaches are from the same category and hence have similar hardware/software requirements.

Aggregation of Evaluation Metrics by Analytic Hierarchy Process (AHP)
To achieve a comprehensive and systematic evaluation, we need fully consider different types of information, which each relates to a specific criterion/attribute.Hence, after identifying the criteria for evaluation, we further adopt a methodology to aggregate them and provide a comprehensive analysis.Many analytical tools have been discussed to address these problems associated with the field of decision analysis, such as Multiple Attribute Utility Theory (MAUT) [27], Multiple-Attribute Decision Analysis (MADA) [28], Multiple Correspondence Analysis (MCA) [29] and Analytic Hierarchy Process (AHP) [30], to name a few.
We adopt the multi-criteria evaluation methodology named Analytic Hierarchy Process (AHP), for measuring the relative strength of MTD approaches.AHP [30,31] plays an important role in many real world decision situations such as government, business, industry, healthcare and education.It provides the decision makers a comprehensive and rational framework for structuring a decision problem, for representing and quantifying its elements, for relating those elements to overall goals and their understanding of the problem, and for evaluating alternative solutions (A running example on how to choose a company leader can be found in [32]).
AHP suits our problem setting very well because of the following reason.For MTD approaches of different categories (e.g., software diversity, address space/data/instruction set randomization, N-version), their actual security and performance concerns vary a lot, because the type and size of attack surface, the likelihood of successful attacks, as well as cost of dynamically changing attack surface in each category are very different from one another.As such, although the five metrics we propose are generic, their relative weights in final evaluation would vary for different categories.Hence, to determine the best approach in each category, we will first understand the network/system model, the security model and the cost model for each category.We can leverage AHP to evaluate the alternative MTD approaches in each category against each criterion/metric that measures how well a method accomplishes a particular criterion.Then we compare the alternative MTD approaches by generating a score of each alternative MTD approach for ranking.Note that a very attractive feature of AHP is that for comparison purpose it will help generate relative scores for criteria which cannot be directly quantified with an absolute meaning.
General Principles of AHP.Specifically, using AHP we can first construct a hierarchy, as shown in Figure 4.This hierarchy has three levels: the top one is our goal to find the best MTD strategy in a category, the second one includes the five criteria we proposed, and the bottom one includes the alternative MTD approaches to evaluate.Once the hierarchy is built, we can systematically evaluate its various elements by comparing them in a pairwise way, with respect to their impacts on an element above them in the hierarchy.

Figure 4. An AHP hierarchy to choose the best MTD approach
For example, we will start from the bottom level by comparing each pair of alternatives w.r.t. each of the five metrics above and totally we will perform 3 * 5 = 15 comparisons.During the comparison, we calculate numerical weights (priorities) for each of the decision alternatives and these numbers represent the alternatives' relative ability to achieve the decision goal.For each criterion (metric), the weights of all alternatives are then transferred to an AHP matrix to calculate the priority of each alternative.After evaluating the alternatives with respect to their strength in meeting the criteria, we will then evaluate the criteria with respect to their importance in reaching the overall goal.Following a similar process, each criterion will be given a weight w.r.t. the goal.Finally, with all the priorities of the criteria with respect to the goal, and the priorities of the alternatives with respect to the criteria, we can synthesize and calculate the priorities of the alternatives with respect to the goal.The one with the highest priority will be the winner.(we will show a concrete example later in the case study of Section 5.) A Detailed Procedure of AHP.Suppose that m criteria are considered and n alternatives are to be evaluated.In our case, m = 5 and n = 3.The AHP can be implemented in three consecutive steps [33,34].
Step 1. Computing the vector of criteria weights: The AHP starts with creating a pairwise comparison matrix A. The matrix A is a m × m matrix with real numbers.Each entry a jk in matrix A represents the importance of jth criterion relative to the kth criterion.The entries a jk and a kj satisfy the following constraint: a jk × a kj = 1.Obviously, a jj = 1 for all 1 ≤ j ≤ m.The relative importance between two criteria is measured according to a numerical scale from 1 to 9 [34].
Once the matrix A is constructed, it is possible to derive from A the normalized pairwise comparison matrix A norm by summing all the entries in each column, then each entry a jk of the matrix A norm is computed as Finally, the criteria weight vector w (that is an mdimensional column vector) is built by averaging the entries on each row of A norm , i.e., Step 2. Computing the matrix of alternative scores: The matrix of alternative scores is a n × m matrix S with real numbers.Each entry s ij of S represents the score of the ith alternative with respect to the jth criterion.In order to derive such scores, a pairwise comparison matrix B (j) is first built for each of the m criteria.The matrix B (j) is a n × n matrix with real values.Each entry b ih of the matrix B (j) represents the evaluation of the ith alternative compared to the hth alternative with respect to the jth criterion.Similarly, the entries b ih and b hi satisfy the following constraints: b ih × b hi = 1 and b ii = 1 for all i.An evaluation scale [34] will be used to translate the decision maker's pairwise evaluations into numbers.
Second, the AHP applies to each matrix B (j) the same two-step procedure described for the pairwise comparison matrix A, i.e., it divides each entry by the sum of the entries in the same column, and then it averages the entries on each row, thus obtaining the score vectors s (j) , j = 1, • • • , m.The vector s (j) contains the scores of the evaluated alternatives with respect to the jth criterion.
Finally, the score matrix S is obtained as S = [s (1) • • • s (m) ], i.e., the jth column of S corresponds to s (j) .
Step 3. Ranking the alternatives: Once the weight vector w and the score matrix S have been computed, the AHP obtains a vector v of global scores by multiplying S and w, i.e., v = S × w.The ith entry v i of v represents the global score assigned by the AHP to the ith alternative.
AHP incorporates an effective technique for checking the consistency of the evaluations made by the decision maker when building the pairwise comparison matrices involved in the process, namely the matrix A and the matrices B (j) .The technique relies on the computation of a suitable consistency index CI.CI is obtained by first computing the scalar λ as the average of the elements of the vector whose jth elements is the ratio of the jth element of the vector A × w to the corresponding element of the vector w.Then, CI = λ−m m−1 .A perfectly consistent decision maker should always obtain CI = 0, but small values of inconsistency could be tolerated.

A Case Study on Software Diversification MTD
To demonstrate the applicability of our proposed evaluation framework, next we present a case study on a network level approach -dynamic software diversity based MTD.

Software Diversification MTD
Software diversity [7][8][9], in spirit of survivability through heterogeneity, has been one of the major MTD approaches.Specifically, the purpose of software diversity is to select and deploy a set of off-the-shelf software to hosts in a networked system, such that the number and types of vulnerabilities presented on one host would be different from that on its neighboring nodes.In this way, one would be able to contain an automated worm attack in an isolated "island".
An illustrating example is showed in Figure 5.We use an undirected graph as the abstraction of a general networked system.Example networked systems include intranet, enterprise social networks, tactical mobile ad hoc networks, and wireless sensor networks of different network topologies.In this figure, there are 11 machines represented by nodes and 5 distinct pieces of vulnerable software represented by different colors.An attack can propagate by exploring one type of vulnerability (color).From the figure, we can see that a successful attack exploiting the green color can compromise up to four machines (v2, v5, v7, v11), but it can only compromise one machine when it exploits the yellow color as machines with the yellow color cannot communicate directly.

Quantifying Five Evaluation Metrics
Next, we discuss how to instantiate and quantify five general evaluation metrics.
Survivability.According to Figure 5, a defective edge between two nodes (which share the same color) indicates that the exploitation of one type of vulnerability on one host can lead to the compromise of the other.The size of a connected component indicates the number of compromised machines if the corresponding vulnerability is discovered and exploited by the attacker (e.g., via a worm attack).We denote a connected component as a common vulnerability graph (CVG).If one can effectively limit the size of the largest CVG, system survivability can then be improved.A better software assignment algorithm should be able to produce a software assignment solution with a smaller largest CVG.Formally, the survivability of a networked system can be computed as follows: where s max (c i ) denotes the size of the largest CVG that is formed by color c i , and N denotes the network size.
Unpredictability.The goal of the attacker is to compromise as many nodes as possible; thus, the vulnerability (color) of the largest CVG is always the attackerŠs first choice to exploit.Unpredictability of a software assignment strategy in this case describes the difficulty for an attacker to determine the prevailing color (which forms the largest CVG) after shuffling.For example, given two software assignment algorithms, to compare their unpredictability, we can observe the distribution of colors for the largest CVGs across a number of software shufflings.If the prevailing colors are uniformly randomly distributed, it would be hard for an attacker to learn the pattern and predict the next prevailing color.The attacker has to try out every color with about an equal probability in order to compromise the network to the largest extent.
To formulate unpredictability of software assignment in a quantitative way, we use entropy to measure the expected or average ŚsurpriseŠ over all shufflings, reflecting the uncertainty of the prevailing color before it is determined.If the color of the largest CVG is c, let p(c i ) be the probability c = c i .Given a set of colors C and the probabilities of their occurrences, unpredictability produced by the software diversity algorithm is quantified as: ( Movability.In order to survive from long-lasting (persistent) attacks, software diversity mechanisms should further adopt the technique of software shuffling.By (periodically) re-allocating software on the machines, the attack surfaces of the systems continually change to confuse the attacker and thus delay the attack.In practice, however, to make the assignment solution generated after each shuffling acceptable, the software assignment algorithm needs to take a number of realistic constraints into account.A software assignment algorithm may be able to well or only partially accommodate the practical constraints that give rise by host and software requirements.Here host constraint means that certain hosts should be installed with some specific types of software to perform required functionality (e.g., to deploy a database server it is required to assign DB2).Software constraint means certain combination of software should (or should not) be assigned to specified hosts simultaneously (e.g., PHP, Apache, MySQL and Linux need to be assigned together to implement LAMP on a single node).Besides, in practice the constraints are not equally important.Some of the constraints are critical and thus cannot be violated, for example, the case of LAMPa lack of any one of these four components would cause a service failure on the web server.On the other hand, some constraints are less critical and thus can be relaxed to some extent without impacting the essential functionality of the machine.For example, suppose there is a constraint that a PC should not install a program (for the lack of understanding of its security).If a software assignment algorithm has to assign the program to this machine for greater security of the network, one may, for example, relax this constraint by launching it through a browser-based SaaS cloud service.
We assume that a software assignment algorithm unable to satisfy the practical constraints is less appropriate than an algorithm that well accommodates them.Thus we propose to use penalty score to quantitatively determine the functionality loss caused by violations of given practical constraints.Specifically, we initially assign a penalty score to each constraint reflecting its significance and the penalty is only applied when that particular constraint is violated.For instance, penalty(x i ) denotes the penalty score of violating constraint x i .For the most critical constraints that cannot go against, we assign an infinite value to them.In this way, given a set of constraints CST R = {x 1 , x 2 , • • • , x m }, the total penalty score for the system is given as: where specifies which constraint is violated.
Stability.Stability of a software assignment algorithm measures variation of the quality of the assignment solutions, in terms of variation of the largest CVG sizes generated by shuffling.In this case, we equate stability with variability.Note that for the unpredictability of shuffling, ideally the shuffling process should be stateless, i.e., each shuffling is independent of the history.Standard deviation can be used to quantify the average difference between assignment solutions, independent of the temporal order in which each assignment solution is generated.The less the variation of the largest CVG size, the more stable of a given software assignment algorithm.Let E(S) be the mean size and N be the number of shufflings, the stability of the software assigning algorithm is quantified as: Usability.The combination of the software installed in a machine is prone to variation due to shuffling.For example, Microsoft Office may be substituted by Open Office and Firefox may be substituted by Internet Explorer or Google Chrome.Among all these options, some software products are easier to operate as compared to others.Besides, users tend to choose software that they are familiar with or have certain preference with.A software assignment algorithm might cause inconvenience to users in accomplishing their tasks (e.g, when a secretary's computer cannot install Microsoft Word despite the fact she has been using it in the past years), and it is very likely not a comfortable assignment for some users.Thus, for software diversity algorithms, usability is defined to reflect user's experience regarding the assigned software, e.g., familiarity, comfort, satisfaction and ease of use from a user's point of view.A good algorithm should be able to take user's concerns as one input and maximize overall usability.
We will use acceptance rate (a real value from 0 to 1) to measure user's satisfaction level, reflecting their attitudes toward shifting from a particular software product to another one.We first categorize software products based on their functionalities.For example, Linux, Snow Leopard, Windows are in the operating system category while Firefox, Chrome, IE and Opera are in the web browser category.Software may be replaced by another one in the same category.Software substitution with high acceptance rate indicates users are satisfied or at least have little trouble with the assignment.Low acceptance rate, on the other hand, indicates that a user finds it inconvenient (or even being prevented from doing his job) switching to new software.
There are two methods to measure usability of a software shuffling.The first one asks users to assign an acceptance rate for every software substitution (in pairwise) in each category based on their experience and attitude.Given these ratings, one can automatically compute the overall acceptance rate for each assignment (compared with the previous assignment) and check whether it is optimal.The other way is to conduct a survey after each shuffling.Users are asked to provide their feedbacks/scores indicating their willingness to accept or reject the assignment of software in their systems.By adding up the scores from users, the final score is then used as the overall usability of a software assignment.Generally speaking, the first approach is more preferable as it only conducts user survey once.A good algorithm may take into consideration individual user acceptance rates when running.

Evaluation Results
We first evaluate three software diversification algorithms in terms of our general evaluation metrics including survivability, unpredictability, movability, and stability (usability is subject to the user survey results), which builds a foundation for our AHP procedure.Then, we use AHP procedure to produce the best alternative/option among three algorithms for this category.Three software diversification algorithms under our consideration are: • Algorithm I: basic software diversity algorithm [35]; • Algorithm II: an adjusted software assignment algorithm [36]; • Algorithm III: an Ant Colony Optimization (ACO) based software assignment algorithm [18].Figure 6 shows an example to compare three software assignment algorithms in terms of survivability.Here the x-axis is the ratio #weight/#color, where #weight is the number of vulnerable software assigned to a single machine and #color is the total number of vulnerable software installed in the whole network system.This ratio implicitly reflects the likelihood of sharing same colors among nodes.The y-axis is the size of the largest CVG.As we can see, Algorithm I outperforms the other two algorithms by creating smaller s max under the same situation.Algorithm III is better than Algorithm II.Thus it is straightforward to rank these algorithms as Algorithm I > Algorithm III > Algorithm II, in terms of survivability.Note that in practice it is not necessarily always the case one algorithm outperforms the others all the time.(These three algorithms will be used for illustration purpose throughout this paper.)

Figure 6. Survivability of different algorithms
To gain intuition for unpredictability, we show below an illustrative example on distribution of the prevailing color (of the largest CVG).As observed in Figure 7, in Algorithm III, s max (c i ) is formed by any color c i among all the available colors with approximately the equal probability, which indicates the random nature of assignments generated by Algorithm III.According to Shannon theory, when p(c i ) = p(c j ) for any i j, unpredictability (entropy) reaches a maximum.As for Algorithm I and Algorithm II, they are more predictable to the attacker than Algorithm III.For example, in the shuffling outcome of Algorithm I, color 1 is more likely to cause largest CVGs, hence more preferable for the attacker to exploit.Here totally 20 colors (vulnerable software) are assigned in the networked system, and the x-axis is the numerical label for each color.Figure 8 is an example that plots the moving penalties for three algorithms.Each data point in this figure is obtained by calculating the penalty score for a corresponding software assignment.It is observed that, in general, the penalties resulted from Algorithm II are the highest, indicating the performance of Algorithm II is largely restricted by practical constraints (lowest movability).Algorithm III cannot accommodate constraints well either.Algorithm I has the least penalty scores, so it offers better software assignment strategies over Algorithm II and Algorithm III in terms of movability.

Figure 8. Movability of different algorithms
To fully evaluate the stability of an algorithm, we will need to try different network settings.In this way, one can see the ability of the assigning algorithm to accommodate mutable environments (e.g., different types of network topologies such as scale-free, random or regular graph).Again we use the three algorithms as an example to explain the concept of stability.Suppose we run each of the three algorithms twenty times while changing network topologies and use all generated assignment solutions.In Figure 9, we observe that the variation range of s max (c i ) of Algorithm II is much smaller, which means its standard deviation is lower than the other two.Hence, it produces more stable results, even though the size of s max (c i ) of Algorithm II is larger than Algorithm I (that is, Algorithm II is inferior to Algorithm I in terms of survivability).

Figure 9. Stability of different algorithms
Table 1 is a simple illustrative example to compare and rank three available software diversity algorithms.We consider the five proposed metrics as the evaluation criteria for ranking (if more metrics are required to be considered, this example can be expanded accordingly).We are interested in comparing relative strength of the alternative software assignment algorithms and determining the best one in terms of the proposed evaluating metrics.First, we need a judgment matrix for determining weights of the 5 metrics according to their importance.In comparing the 5 metrics, for illustration purpose, we assume they are ordered as Unpredictability > Movability > Survivability > Stability > Usability based on their importance.The weight assigned to each metric can then be determined by adopting a scale referred to as 9-point scale of measurement [30].The following 5 × 5 matrix contains all of the pairwise comparisons for the metrics, as shown in Table 1.
The matrixes for metrics of the MTDs are omitted here because the value for each metric is mostly given in the previous table.The final decision matrix for this problem can be generated and the final scores are shown in Table 2.
Finally, the ranking for these three software diversity algorithms is: Algorithm III > Algorithm I > Algorithm II, as shown in Figure 10.

Discussions
In this section, we discuss i) how to apply our generic evaluation framework to other MTD categories, such as runtime-based diversification and network-based diversification; ii) how to apply our generic evaluation in different levels, such as system-level, platform-level, and network-level.

Evaluations on Other MTD Categories
In Section 2 we discussed four different MTD categories and in Section 5 we discussed a detailed case study on a specific MTD category called software diversification.In this Section, we discuss how to apply our generic evaluation framework to other three MTD categories.Since our five evaluation metrics are general, they could be applied to all other three MTD categories.The ways that we instantiate or quantify them might be different for each different category.The idea of Analytic Hierarchy Process (AHP) in the generic evaluation framework is similar, where five general evaluation metrics might have different weights for different categories and the alternative algorithms in each MTD category will be different.Therefore, the flowchart for our generic evaluation framework will stay the same for all the MTD categories evaluations and comparisons and our generic framework will work for all the MTD categories.

Applying the Proposed Generic Evaluation Framework in Different Levels
Our generic evaluation framework can also be applied in different levels, including system-level, platformlevel, and network-level.Software diversification is a network-level solution, so we have already seen the instantiation of our generic framework on networklevel.For system-level, we consider a specific operating system on a machine, then all the general evaluation metrics will be defined in this domain.Also, for platform-level, we consider a single machine with potentially multiple operating systems, then our general evaluation metrics need to be defined for this scope.Other than this, the AHP procedure is similar.So, our generic evaluation framework will apply to the system-level and platform-level, too.

Conclusion and Future Work
In this paper, we carefully choose five general metrics for MTD evaluations and comparisons, including survivability, unpredictability, movability, stability, and usability.We also aggregate these five evaluation metrics by for the first time proposing a generic evaluation framework based on Analytic Hierarchy Process (AHP).We work on a detailed case study under a specific MTD category named software diversification with numerical evaluation results, which validates the effectiveness of our generic evaluation framework.Our evaluation framework can be easily ported and applied to other MTD categories (such as runtime-based diversification and network-based diversification) and in different levels.Finally, we discuss the ways to do them.
Our future work includes applying our generic evaluation framework to all other three MTD categories.Besides software diversification, we will study all the other three cases, including runtime-based diversification, network-based diversification, and dynamic platform techniques, in details by instantiating our general evaluation metrics and AHP procedure.We believe that our generic MTD evaluation framework will be effective and efficient for all the MTD categories in different scope/level.

Figure 1 .
Figure 1.A uniform MTD theory model

Figure 2 .
Figure 2.An overview of the generic evaluation framework

Figure 3 .
Figure 3.The MTD self-evolving loop with evaluation metrics

Figure 5 .
Figure 5. Network topology utilizing a diverse software distribution.Dashed or solid lines mean two nodes can communicate directly (e.g., through TCP/IP or being friends in a social network).Solid lines further indicate two nodes share at least one common color.

Figure 7 .
Figure 7. Unpredictability of different algorithms.Here totally 20 colors (vulnerable software) are assigned in the networked system, and the x-axis is the numerical label for each color.

FindFigure 10 .
Figure 10.Evaluation result a 11 a 12 a 13 a 14 a 15 a 21 a 22 a 23 a 24 a 25 a 31 a 32 a 33 a 34 a 35 a 41 a 42 a 43 a 44 a 45 a 51 a 52 a 53 a 54 a 55 a 11 a 12 a 13 a 14 a 15 a 21 a 22 a 23 a 24 a 25 a 31 a 32 a 33 a 34 a 35 a 41 a 42 a 43 a 44 a 45 a 51 a 52 a 53 a 54 a 55

Table 2 .
Final scores for every algorithm