Characterizing and Leveraging Granger Causality in Cybersecurity: Framework and Case Study

Causality is an intriguing concept that once tamed, can have many applications. While having been widely investigated in other domains, its relevance and usefulness in the cybersecurity domain has received little attention. In this paper, we present a systematic investigation of a particular approach to causality, known as Granger causality (G-causality), in cybersecurity. We propose a framework, dubbed Cybersecurity Granger Causality (CGC), for characterizing the presence of G-causality in cyber attack rate time series and for leveraging G-causality to predict (i.e., forecast) cyber attack rates. The framework o ﬀ ers a range of research questions, which can be adopted or adapted to study G-causality in other kinds of cybersecurity time series data. In order to demonstrate the usefulness of CGC, we present a case study by applying it to a particular cyber attack dataset collected at a honeypot. From this case study, we draw a number of insights into the usefulness and limitations of G-causality in the cybersecurity domain.


Introduction
Cyber attacks have become a big threat against the modern society in many aspects, such as critical infrastructure, economy, and citizen privacy. According to a 2019 report by Symantec [1], a compromised credit card can be sold/purchased for up to US$45 in the underground market, whereas compromised websites can be sold/purchased for up to US$2.2 million each month. According to a 2019 report by ForgeRock [2], 2.8 billion consumer data records are breached in 2018, costing more than US$654 billion to U.S. organizations; the report also states that in the first quarter of 2019, cyber attacks against the U.S. financial services sector cost more than US$6.2 billion. These huge damages call for studies to understand and characterize cyber attacks from various perspectives and at various levels of abstractions.
Most studies on cyber attacks focus on microscopic levels of abstractions (e.g., how to defend against a particular attack). These studies are absolutely important because they provide the necessary building-block solutions. However, understanding and characterizing cyber attacks from macroscopic levels of abstractions is equally important but much less investigated. Such macroscopic-level studies are important because they would offer insights towards holistic solutions to defending cyber attacks.
One particular kind of macroscopic study is to forecast (i.e., predict) cyber attacks at macroscopic levels, so as to achieve what may be called predictive situational awareness. There have been a number of studies in both univariate time series analysis in the cybersecurity domain (e.g., [3][4][5][6][7][8][9][10][11][12][13][14]) and multivariate time series analysis in the cybersecurity domain (e.g., [7,[15][16][17]). The present study belongs to this category, but initiating a new perspective of research. Specifically, we investigate the usefulness of causality in cybersecurity. Since the notion of causality is elusive, we focus on a particular approach known as Granger Causality (G-Causality) [18], which can be understood as follows: If one time series can be leveraged to help predict another time series more accurately than using the historic data of the latter alone to predict it, then the former is said to G-cause the latter. We call the former time series a "helper" because it can used to help predict the latter more accurately.
Our Contributions. This paper makes two contributions. First, we initiate the investigation on the usefulness and limitations of G-causality in cyber attack rate time series. We propose a framework, dubbed Cybersecurity Granger-Causality (CGC) with a range of intuitive research questions, which can be adopted or adapted to characterize and leverage G-causality in other kinds of cybersecurity time series data. In order to formalize research problems, we propose using a graph-theoretic representation of G-causality between time series. In particular, CGC aims to help achieve predictive cybersecurity situational awareness and therefore possibly proactive defense (e.g., allocating more defense resources when predicting that there will be more incoming attacks). To achieve this, we recognize that one key research issue is to select an appropriate number of good helpers at a proper network resolution (e.g., /8 vs. /24), where a helper is, as mentioned above, a time series that G-causes the time series in question. This research issue goes beyond the original G-causality framework [18]. We systematically address this issue by considering multiple factors and models. To the best of our knowledge, this is the first study in characterizing the usefulness and limitations of G-causality in predictive cybersecurity situational awareness.
Second, in order to demonstrate the usefulness of the framework, we conduct a case study by applying it to a dataset collected by a low-interaction honeypot. This case study enables us to draw a number of insights, such as the following. (i) For measuring cyber attack situational awareness, network resolution matters and using a higher resolution (e.g., /24) would be better than using a lower resolution (e.g., /16 or /8). (ii) Cybersecurity posture at the /16 and /24 network resolutions do change over a period of time, albeit slowly. (iii) G-causality is widely exhibited by cyber attack rate time series at multiple network resolutions, hinting that cyber attacks are not random. (iv) Bidirectional G-causality is widely exhibited at multiple network resolutions, suggesting that G-causality does not really capture the intuitive notion of causality, which should be unidirectional.
(v) G-causality is widely exhibited across network resolutions; this represents an aspect that also goes beyond the original G-causality framework [18]. (vi) Leveraging bidirectional G-causality leads to higher prediction accuracy than leveraging unidirectional Gcausality, especially when the time series in question are dense or correspond to low-resolution networks. This suggests that G-causality is useful despite that it does not really capture the intuitive notion of unidirectional causality. (vii) When leveraging G-causality to predict time series, using an excessive number of helpers can decrease prediction accuracy. This highlights the importance of selecting an appropriate number of helpers. (viii) When time series are dense, a smaller p-value incurred in the G-causality test, which hints a stronger degree of G-causality, would lead to more accurate predictions.
Paper Outline. Section 2 reviews preliminary knowledge on the Auto-Regressive model and the notion of G-causality. Section 3 presents the CGC framework. Section 4 describes the case study. Section 5 discusses the limitations of the present study. Section 6 concludes the paper. Table 1 summarizes the main notations that are used throughout the paper.

The Auto-Regressive (AR) Model
AR is a widely used statistical model, which leverages temporal correlations of a time series to predict its future values [46]. AR uses linear regression to predict future values as a function of past observation values (indicating how far one looks back), where is the order of the AR model or lag. Formally, the AR model for time 2 EAI Endorsed Transactions on Security and Safety Online First the time series representing the number of attacks waged from network i up to time the subset of stationary time series of the subset of stationary time series X (T C ) that are also associated with G-causalitŷ predictions of x it and y it , respectively the lag value (i.e., time steps used in a prediction model) G-causality graph G(T C ) at network resolution r; e.g., r ∈ {/8, /16, /24} series X i = (x i,t ) t=1,2,... is: where β 0 , . . . , β are coefficients and ξ i,t is a whitenoise random variable (i.e., independent and identically distributed normal random variable with mean 0). In this paper, x i,t is the number of attacks that are waged from network i at time t.

G-causality
The notion of G-causality is named after its inventor Clive Granger and aims to capture causal relations between time series [18]. It is introduced to predict time series in the economics domain and later adapted to other domains [47][48][49][50]. It is defined for stationary time series, whose statistical properties (e.g., mean, variance, co-variance) do not change with time (cf. e.g., [3,5,51]).
In practice, stationarity may be tested based on the first and second moments, sometimes known as widesense stationarity. There are many methods for testing whether a time series is stationary or not (e.g., Phillips-Perron [52] and Augmented Dickey-Fuller [53]). As mentioned above, in this paper X i = (x i,t ) t=1,2,... represents cybersecurity time series, such as the cyber attack rate time series [3]. Intuitively, X i is said to Granger-cause or G-cause X j , where i j, if the past observation values of X i contain some information that can be leveraged to predict future values of X j more accurately than predicting X j by only leveraging its past observation values [18]. Similar to the lag in the AR model, the number of the past observation values of X i , which are leveraged to predict future values of X j , is also called lag and denoted by . Since a large may cause over-fitting and a small may cause auto-correlation errors [54], it is important to select an appropriate via some criterion, such as Bayesian Information Criterion (BIC) [55] or Akaike Information Criterion (AIC) [56], meaning that the optimal is the one that minimizes the AIC or BIC function.
Formally, G-causality is defined using the linear Vector Auto-Regressive model (VAR) over multivariate time series. In order to highlight the idea, let us consider the example of bi-variate VAR model, while noting that the idea is equally applicable to other multivariate time series. The bi-variate VAR model (or 2VAR) involves two time series X i and X j with lag and is described as: (3) where the A * , * 's are regression coefficients and err i,t and err j,t are white-noise errors (i.e., independent and identically distributed normal random variables with mean 0).
The VAR model is used to test G-causality between X i and X j as follows. The null hypothesis is that X i does not G-cause X j , namely that A X j ,X i,t−k = 0 for 1 ≤ k ≤ or X i has no impact on predicting X j [18]. To test this, one may use the F-statistic hypothesis test [57]. Time series X i is said to G-cause X j if the null hypothesis is rejected, meaning that the p-value in the F-statistic is less than 0.05, which is a widely-used significant level. The same method is used to test whether X j G-causes X i or not. G-causality is not necessarily symmetric, meaning that X i G-causing X j does not necessarily mean X j G-causing X i . 3 EAI Endorsed Transactions on Security and Safety Online First Van Trieu-Do et al.

Prediction Accuracy Metric
In order to evaluate prediction accuracy, we propose adopting the standard metric known as Symmetric Mean Absolute Percentage Error (SMAPE) [58]. Let (x i,t , . . . , x i,t+δ ) be the observation values of a time series and (x i,t , . . . ,x i,t+δ ) be their respective prediction values, where t is the time at which prediction starts. Then, . This metric is chosen because of its robustness in accommodating x i,t = 0, which is often encountered in cybersecurity.

The Cybersecurity Granger-Causality (CGC) Framework
The CGC framework aims to characterize the presence and utility of G-causality in the context of cybersecurity time series, as illustrated by cyber attack rate time series. The framework is designed with the mindset that it can be adopted or adapted to study G-causality in other kinds of cybersecurity time series data of a similar nature. As highlighted in Figure 1, CGC has 4 modules. As elaborated below, these modules are associated with a unique set of Research Questions (RQs).

Basic Statistical Analysis
Leveraging G-causality Data Pre-processing

Data Pre-processing
The input to the framework is some cybersecurity data, such as the cyber attacks observed by cyber defense instruments (e.g., honeypots [59,60] or network telescope [4]) over a period of time. In order to represent the data as time series, we propose considering a discrete time horizon t = 1, 2, . . . , T at some time resolution (e.g., hour or day). The dataset contains the attacking IP addresses that wage attacks against some victims. Depending on the semantic richness of the dataset, the attacks may be further divided into, for example, different types (e.g., denial-of-service or not). The basic CGC framework focuses on coping with cyber attack rates, while leaving the treatment of richer information to its extensions (partly because we have no such semantically rich datasets). We propose grouping the attacking IP addresses into networks at some resolution, such as /8, /16 or /24 networks. Recall that a /8, /16, and /24 network consists of 2 24 , 2 16 , and 2 8 IPv4 addresses, respectively. Let n denote the number of networks at a resolution in question (e.g., n = 2 8 at the /8 network resolution). For network i (1 ≤ i ≤ n) at a resolution, an appropriate pre-process is often needed to derive a time series is the current time and x i,t is the number of attacks that are waged from network i at time t.

Basic Statistical Analysis
Given the pre-processed set of n time series at a network resolution, denoted by X(T C ) = {X 1 (T C ), . . . , X n (T C )}, we propose conducting some basic analyses to deepen the understanding of the dataset. We propose associating this module with the following Research Questions (RQs): • RQ1: What is the overall cyber attack situational awareness?
• RQ2: What is the evolution of the situational awareness?
• RQ3: What is the tensity of attacks (e.g., the number of attacks per time interval, the sparsity of the time series?
• RQ4: What are the characteristics of the time series?
The preceding basic statistical analysis is both necessary and important. This is because, for model-fitting purposes, when a time series is sparse (i.e., containing many zeros in its observation values), it should be eliminated from further analysis because state-of-theart statistical techniques cannot cope with such sparse time series data. (Nevertheless, we note that innovative methods are emerging in order to deal with such sparse time series [16], which is orthogonal to the purpose of the present study.) For the time series that are not sparse, we analyze their basic statistics (e.g., mean, median, max, and variance). In order to see how the basic statistical analysis can deepen our understanding of the data in question and guide us in modeling the data, we mention the following. If a time series has a mean value that is much smaller than its variance, then the time series cannot be modelled by the Poisson process but should be fitted with another appropriate model. If several time series exhibit a similar pattern (i.e., all increasing, decreasing, simultaneously changing), then they may be correlated with each other.

G-causality Analysis
Given a set of pre-processed time series at a network resolution (with sparse ones eliminated), namely 4 EAI Endorsed Transactions on Security and Safety Online First . . , X n (T C )}, this module proceeds as follows. First, test the stationarity of X i (T C ) ∈ X(T C ) because G-causality is defined over stationary time series. For this purpose, there are many methods (e.g., Phillips-Perron [52] or Augmented Dickey-Fuller [53]). Second, test the G-causality for every pair of stationary time series (X i (T C ), X j (T C )) in X(T C ) with i j, while recalling that X i (T C ) G-causes X j (T C ) if the null hypothesis that X i (T C ) does not G-cause X j (T C ) is rejected in the F-statistic test. Let p i,j (T C ) denote the p-value in the F-statistic. Then, we only consider the time series with associated p-values that are smaller than 0.05, because such p-values indicate that the null hypothesis is not rejected. Now we propose the notion of G-causality graph, which is a simple, directed, weighted graph representation of the G-causality relations between the time series. A G-causality graph is denoted by . . , X n (T C )} is the vertex or node set that corresponds to the set of networks and represent their respective cyber attack rate time series, an arc ( ) ∈ E(T C )} represents the set of neighbor nodes that Gcause X j (T C ), and the in-degree of a node X j (T C ) is deg(X j (T C ), G(T C )) = |N (X j (T C ))|. Note that an isolated vertex or node means (i) the corresponding time series is not stationary or (ii) it has no G-causality relation to any other node. We propose associating this module with the following RQs: • RQ5: What are the characteristics of G-causality at a single network resolution?
• RQ7: Is the G-causality relation exhibited between network resolutions?
In order to simplify notations, we may omit the mentioning of T C when discussing general concepts that are applicable to any T C , such as the indegree of node X j in graph G, or when T C is clear from the context. This leads to G = (X, E, W ) and simplifies notation deg(X j (T C ), G(T C )) as deg(X j , G). We may further use X ⊆ X to denote the set of nodes corresponding to stationary time series and use X ⊆ X to denote the set of nodes associated with a G-causality relation.

Leveraging G-causality
One important utility of G-causality is to leverage it to predict cyber attack rate time series to achieve predictive situational awareness and possibly proactive defense. Therefore, we propose associating this module with the following RQ: • RQ8: How should one leverage G-causality to predict cyber attack rates?
The key research issue is to select an appropriate number of good helpers at the proper network resolution(s); this research issue goes beyond the notion of G-causality. In order to address this issue, we propose considering 4 factors: direction of G-causality (i.e., unidirectional vs. bidirectional), the number of helpers that are leveraged for prediction, p-value (i.e., small vs. medium vs. large), and layers of network resolutions (e.g., one vs. multiple layers). For either empirical or theoretical comparison purposes, these factors can be tied to any prediction model of interest. As examples, we propose considering 4 classes of models: • AR: It leverages X j itself to predict X j . This model, as reviewed above, does not leverage G-causality (or helpers) at all and serves as the baseline model.
• GC(z + 1)VAR: This a family of models that are inherent to the notion of G-causality, by leveraging z time series (helpers), which G-cause X j (i.e., neighbor nodes pointing to X j in the G-causality graph), to predict X j , where 1 ≤ z ≤ deg(X j , G) with deg(X j , G) being the in-degree of node X j in G-causality graph G. These models are elaborated in Algorithm 1 below.
• PenVAR: This model is elaborated below and aims to avoid overfitting and overparameterization of the standard VAR model without leveraging Gcausality.
• GCPenVAR: This is a hybrid of PenVAR and GC(m + 1)VAR where m = deg(X j , G), by leveraging all of the times series that G-cause X j (i.e., all neighbors pointing to X j in the G-causality graph) to predict X j .
GC(z + 1)VAR models. Algorithm 1 leverages z helpers, namely z neighbors that G-cause X j to predict X j , where 1 ≤ z ≤ m and m = deg(X j , G). The heuristic used in Algorithm 1 is to leverage the z helpers with the smallest p-values in the G-causality test, so as to avoid the combinatorial explosion of m z where m can be large. The heuristic corresponds to the greedy algorithm because p i,j may be interpreted as the degree of Gcausality, meaning that the smaller the p i,j , the stronger the G-causality. Specifically, suppose we sort the pvalues corresponding to X j 's neighbors increasingly as 5 EAI Endorsed Transactions on Security and Safety Online First Van Trieu-Do et al.

Algorithm 1
The GC(z + 1)VAR algorithm for predicting x j,T C +1 , 1 ≤ j ≤ n, by leveraging z (z ≥ 1) of the X i 's in that G-cause X j as helpers Use the (z + 1)-variate VAR model to fit X j (T C ) by leveraging X j and the z helpers corresponding to the z smallest p-values, according to Eq.(3) and an appropriate model selection criterion 5: Use the fitted model to predictx j,T C +1 6: Figure 2 illustrates 4 scenarios of GC(z + 1)VAR for predicting X j , where z ∈ {1, 2, 3, m}: • GC2VAR: Leverage X 1 with the smallest p-value as helper to predict X j via the bivariate VAR model.
• GC3VAR: Leverage X 1 and X 2 as helpers to predict X j via the 3-variate VAR model.
• GC4VAR: Leverage X 1 , X 2 and X 3 as helpers to predict X j via the 4-variate VAR model.
• GC(m + 1)VAR: Leverage all of the m neighbor X 1,j , . . . , X m,j as helpers to predict X j via the (m + 1)-variate VAR model.
where is the lag, ν represents a (d + 1) × 1 intercept vector, Φ (l) denotes a (d + 1) × coefficient matrix, and u t is a (d + 1) × 1 white noise vector (i.e., independent and identically distributed normal random vector with mean 0 and covariance matrix Σ µ , namely a diagonal matrix with elements representing variances). The model fitting is to minimize the least square errors which involves (d + 1) + (d + 1) 2 regression parameters, where || · || 2 F represents the Frobenius norm. This means that the standard VAR model is likely unstable or infeasible when d is large (i.e., high dimensions), which motivates PenVAR.
Let Φ = Φ (1) , . . . , Φ ( ) . The PenVAR model reduces the parameter space and has the following optimization objective [61]: where λ > 0 is the penalty parameter and ||Φ|| 1 represents the L 1 norm. The penalty parameterλ is selected to minimize the one-step ahead mean square prediction error (MSPE) [61]: where || · || 2 is the L 2 norm. The GCPenVAR model is obtained by incorporating G-causality into the PenVAR model. Unlike the PenVAR models that predict a vector of d-variate time series simultaneously, the GCPenVAR model, like GC(z + 1)VAR, leverages z helpers only, where 1 ≤ z ≤ d.

Case Study
Now we present a case study by applying the framework to analyze a specific dataset collected by a low-interaction honeypot. A honeypot is a cyber 6 EAI Endorsed Transactions on Security and Safety Online First defense instrument that emulates real-world Internetbased vulnerable services at a number of IP addresses. Since these services are exclusively set up for attracting attacks (i.e., no legitimate services are associated with these IP addresses), it is a widely-accepted practice to treat the incoming, unsolicited network traffic as attacks [4,5,7,[62][63][64][65][66][67][68][69]. The honeypot in question monitors 1,024 IP addresses and runs a number of lowinteraction honeypot programs, including Honeyd [70] and Nepenthes [71]. The notion of low-interaction means that the honeypot only partly emulates the services in question, which explains why the dataset is only used for analyzing the cyber attack rate (rather that cybersecurity semantically richer analyses). The dataset is collected between 2/6/2014 and 5/13/2014 (i.e., T = 97 days). Although the dataset is six years old, it is sufficient for demonstrating the usefulness of CGC (while noting that it is often difficult for academic researchers to have access to such data). Researchers with newer datasets can adopt or adapt CGC to analyze their own datasets.

Data Pre-processing
The raw data collected by the honeypot is in the standard PCAP format, which is converted into the flow format to represent attacks. A flow is described by a tuple of five fields: source (i.e., attacker in this paper) IP address, destination (i.e., victim or honeypot in this paper) IP address, source port number, destination port number, and protocol [72]. For converting PCAP data into flow data, a widely-used tool, known as the Yet Another Flowmeter (YAF) with super_mediator [72], is used. This process involves two parameters: the flow idle time and the flow lifetime. We use a widely-used combination of them: 60 seconds for the flow idle time and 300 seconds for the flow lifetime [3][4][5]69]. The source (i.e., attacker) IP addresses are grouped into networks at three resolutions: (i) /8 networks, with each consisting of 2 24 IP addresses; (ii) /16 networks, with each consisting of 2 16 IP addresses; and (iii) /24 networks, with each consisting of 2 8 IP addresses. Since the framework is equally applicable to any network resolution, we extend the notation G( In order to indicate cross-network-resolution analyses, we associate lower-resolution networks with their belonging higher-resolution networks. For example,

Basic Statistical Analysis
Characterizing the Overall Cyber Attack Situational Awareness (Answering RQ1). In order to draw insights into the network resolution(s) that would be more appropriate for characterizing the overall cyber threat situational awareness, Figure 3  In other words, the attacking networks observed by the honeypot on a daily basis concentrate at a small percentage (0.24%) of /24 networks, which are scattered in a significant number of /16 networks and even more so in a large number of /8 networks. These metrics reflect the average cyber threat landscape situational awareness, especially that some networks are better managed than others and that network resolution matters when measuring the percentage of attacking networks.
Second, cumulatively over the T = 97 days, the honeypot observed: 204 or 204/2 8      Insight 2 hints that defenders have been active in detecting and cleaning up attacking computers, albeit it may take a long period of time (e.g., at least 40 days as shown by this dataset) in having a globally visible effect. If we define intense attacking networks as those which wage attacks more than 90 days of the entire T = 97 days, we observe 70.70% /8 attacking networks, 6.59% /16 attacking networks and 0.007% /24 attacking networks. This means that the intensity of attacking networks at different resolutions can be different at orders of magnitudes, which resonates Insight 1, which however copes with all attacking networks.

Insight 3.
The percentages of intense attacking networks at different network resolutions are orders-of-magnitude different, which means that network resolution matters when characterizing the evolution of intense attacking networks.
Characterizing the Time Series (Answering RQ4). The preceding characteristics are made over the entire time horizon of the dataset. Since a core value of studying cyber attack data is to conduct prediction (i.e., forecasting). In the real world, a defender would use the data collected up to current time T C to predict future cyber attack rates for time t > T C . This means that for prediction purposes, we should characterize the time series in X(T C ). Given that T = 97 days is the time horizon of the dataset, we propose starting prediction at T C = 90 (i.e., predicting cyber attack rates for t ∈ [91, 97]).    [3,255], perhaps because the honeypot is small (i.e., 1,024 IP addresses). Note that the medians of some /8, /16, and /24 networks are 0, because these networks do not wage attacks for at least 45 days (50% of the 90 days). The variances of some /16 and /24 networks are 0 because a constant number of attacks are waged from these networks. Specifically, one /16 network wages 1 attack per day for the first 90 days; one /16 network wages 2 attacks per day for the first 90 days; and 31 9 EAI Endorsed Transactions on Security and Safety Online First Van Trieu-Do et al. We further observe that most variances are larger than their respective means. By looking into the individual time series, we observe that among the |X [/8] (90)| = 190 time series corresponding to the /8 resolution, 181 time series have a variance that is larger than their respective mean (including 132 with a variance that is at least one order of magnitude larger than the mean); no time series have a variance that equals the respective mean; 9 time series have a variance that is smaller than the respective mean. Among the |X [/16] (90)| = 15, 242 time series corresponding to the /16 resolution, 9, 953 have a variance that is larger than their respective mean (including 233 with a variance that is at least one order of magnitude larger than their respective mean); 14 have a variance that equals their respective mean; and 5, 275 have a variance that is smaller than their respective mean. Among the |X [/24] (90)| = 18, 534 time series corresponding to the /24 resolution, 2, 434 have a variance that is larger than their respective mean (including 9 with a variance that is at least one order of magnitude larger than their respective mean); 28 have a variance that equals their respective mean; and 16, 099 have a variance that is smaller than their respective mean. The fact that a time series has a variance that is much larger than its mean hints that the time series cannot be modelled by the Poisson process, which means that other kinds of models should be used to fit them.

Insight 4.
Most cyber attack rate time series cannot possibly be modeled by the Poisson process because their variance is (much) larger than their respective mean.

Characterizing G-causality
Guided by the framework, we now analyze G-causality between the time series at a certain resolution and across network resolutions. The first step is to test the stationarity of the cyber attack rate time series. Since we will use the time series in X(90) to start making protection, we test stationarity of these time series rather than those which belong to X [r] (97). Table  3 90) that correspond to non-stationary time series (while noting that each stationary time series has associated Gcausality connection in this particular case), where a node's size and color encodes its in-degree (i.e., the larger and darker the node, the larger the indegree), E [/8] (90) represents the G-causality connections between the nodes, W [/8] (90) is encoded into the color such that a smaller p i,j (90) is indicated as a darker color (albeit the differences are too small to be visually noticeable). The 30 nodes at the outer circle have the highest in-degrees, meaning that they are G-caused by many nodes. Note that we are unable to visualize G [/16] (90) and G [/24] (90) because they have too many nodes.
Insight 5. G-causality is widely exhibited by cyber attack behaviors, hinting that cyber attacks are not random.

Characterizing the Direction of G-causality (Answering RQ6).
Recall that unidirectional G-causality means that (X i , X j ) ∈ E implies (X j , X i ) E; whereas, bidirectional G-causality relation means that (X i , X j ) ∈ E implies (X j , X i ) ∈ E. Figure 7a plots the density of the nodes' in-degrees at the /8 resolution, by separating unidirectional from bidirectional G-causality. Among the |X [/8] | = 75 nodes associated with G-causality, their in-degrees vary between 1 and 54. This means that some models are not applicable to some nodes (e.g., a node with in-degree 1 can only be applied to GC2VAR, but not GC3VAR because the latter requires 2 helpers). There are |E [/8] (90)| = 2, 452 pairs of stationary time series at the /8 resolution that have associated G-causality. Among them, 1, 011 (41.24%) are unidirectional, and 1, 441 (58.76%) are bidirectional. Figure 7b plots

Insight 6.
Bidirectional G-causality is widely exhibited at a single network resolution.

G-causality Analysis across Network Resolutions (Answering RQ7).
For cross-resolution G-causality analysis, we want to know for example, whether or not a large number of cyber attacks waged from a /8 network are mainly caused by the cyber attacks waged from which of its belonging /16 networks; in order words, we want to know if G-causality can be leveraged to identify the /16 networks that are responsible for the /8 network attack behaviors. As mentioned above, there are 74 /8 networks, 2, 360 /16 networks, and 6, 017 /24 networks that are applicable for this analysis. To make the discussion succinct, let |X  is reasonable because a lower resolution network is composed of a number of higher resolutions networks.
Insight 7. G-causality across network resolution is widely observed, with most being bidirectional.

Leveraging G-causality (Answering RQ8)
As described in the framework, the key issue is to select an appropriate number of good helpers at the proper network resolution(s). The framework suggests to investigate at least the following 4 factors: direction of G-causality (i.e., unidirectional vs. bidirectional), the number of helpers that are leveraged for prediction, the p-value (i.e., small vs. medium vs. large), and the layers of network resolutions (e.g., one vs. two layers). In order to distinguish predictions leveraging bidirectional vs. unidirectional, G-causality, we use suffix "2" to indicates the former (e.g., GC3VAR2 vs. GC3VAR).
Owing to the limited length of time series, we use rolling forecast, namely using X [r] (T C ) to predict X [r] (T C + 1), where 90 ≤ T C ≤ 96 and r ∈ {/8, /16, /24}. In order to build prediction models for PenVAR and GCPenVAR with T C = 90, we divide the attack data into three parts: 40 observations for model initialization, 50 observations for training, and the rest 7 for forecast evaluation. For fitting the PenVAR model, we need to select the model with the parameterλ that minimizes the one-step ahead MSPE given by Eq. (7), namely MSPE(λ) = 1 50 90 t=41 x t − x t 2 , where || · || 2 is the L 2 norm. For this purpose, we use the grid search method to identify the optimalλ. At each network resolution r, we set lag = 2 to balance model parsimony and fitting performance. This is because a large will lead to more variables in the model, which will impose a higher computational cost. Moreover, we observe that = 2 leads to good prediction accuracy. PenVAR is only conducted at the /8 network resolution because the compute resources available to makes it is infeasible to apply it to the /16 and /24 resolutions, where n is very large or extremely Factor 1: Direction of G-causality. At each network resolution, we apply the applicable models described in the framework to predict X j (T C + 1) for X j (T C ) ∈ X(T C ) and T C ∈ [90, 96], except that PenVAR is only conducted at the /8 network resolution (owing to the computational feasibility issue mentioned above). In our experiments, we further encounter that many predictions cannot succeed because of the singularity problem associated with the GC(z + 1)VAR models, namely that inverse of a matrix does not exist when its determinant is zero. (It is worth mentioning that it may be technically possible to prune the list of independent variables to alleviate the singularity problem. However, we do not consider this because we aim at fair comparison between the models, which means that the input to these models should be the same. Nevertheless, developing new models to cope with the singularity problem in general is certainly an interesting problem for future research.) As a consequence, predictions are successful only for 31 /8 networks, 1, 820 /16 networks, and 2, 159 /24 networks. In order to compare the effect of unidirectional vs. bidirectional G-causality, we propose separating the unidirectional G-causality relations from the the bidirectional ones. This also means that we separately sort the p-values associated with the unidirectional G-causality relations and the pvalues corresponding to the bidirectional ones. Table 4 presents the summary statistics of prediction SMAPEs incurred by the same set of models leveraging unidirectional vs. bidirectional G-causality, while recalling that AR is the baseline model and PenVAR is only feasible at the /8 network resolution. Figure  8 further presents the boxplots of the SMAPEs of the models. From Table 4 and Figure 8, we make the following observations. 12 EAI Endorsed Transactions on Security and Safety Online First That is, all of the 7 summary statistics show that bidirectional G-causality leads to lower SMAPE values (i.e., higher prediction accuracy) when compared with unidirectional G-causality. This phenomenon is also exhibited by GC3VAR and GC4VAR, but not by GC2VAR, GC(m + 1)VAR, or GCPenVAR. At both /16 and /24 network resolutions, we observe that each GC(z + 1)VAR2 has no or a few statistics that are lower than its GC(z + 1)VAR counterpart and that the GC(z + 1)VAR2 models are not necessarily more accurate than the baseline AR model.

Insight 8.
Leveraging bidirectional G-causality at a single network resolution achieves a higher prediction accuracy than leveraging unidirectional G-causality.
Insight 8 may be partly attributed to that the time series corresponding to lower network resolutions are denser (i.e., on average among all of the time series during the T C = 90 days, there are 60.17%, 71.67%, and 97.94% non-zero observations at the /24, /16, and /8 network resolutions, respectively).
Second, AR is indeed less accurate than the models that leverage G-causality at the /8 network resolution, but is surprisingly more accurate for 546 time series at the /16 network resolution and 272 time series at the /24 network resolution. By looking into the data, we notice that these time series have high variations. On the other hand, GCPenVAR is more accurate than AR and the other models that leverage G-causality for 72 time series at the /16 network resolution and 47 time series at the /24 network resolution. By looking into the data, we notice that these time series are low variations and relatively sparse (typically having no more than 50 non-zero observations over the T C = 90 days). This suggests that GCPenVAR may be better at dealing with sparse time series. Nevertheless, GCPenVAR is less accurate than AR and the other models that leverage Gcausality for 77 time series at the /16 resolution and 12 time series at the/24 resolution. Insight 9. G-causality is not always useful in predicting time series with variations, and GCPenVAR may be better at predicting sparse time series, perhaps because it can leverage more information (i.e., all, rather than some, neighbors).
Factor 2: The number of "helpers". From the perspective of whether more "helpers" would lead to more accurate predictions, Table 4 shows that the prediction accuracy increases with the number z of helpers when z ∈ [1,4]. We observe that all of the 7 summary statistics show that GC5VAR2 has a lower SMAPE value (i.e., higher prediction accuracy) than GC4VAR2, which in turn has a lower SMAPE value than GC3VAR2. While the Min of GC3VAR2 equals that of GC2VAR2, the Mean, SD (Standard Deviation), Max of GC3VAR2 are lower than that of the GC2VAR2's. This suggests that GC3VAR2 offers a higher prediction accuracy than GC2VAR2. That is, GC(z + 1)VAR2 for z ∈ [1,4] suggests that more helpers would lead to more accurate predictions. However, when z > 4 Table 4 shows that the prediction accuracy decreases with z. 13 EAI Endorsed Transactions on Security and Safety Online First

Insight 10.
While leveraging G-causality can improve prediction accuracy at a single network resolution, using too many helpers might reduce prediction accuracy because GC5VAR is more accurate than GC(m + 1)VAR when m > 4.
Insight 10 suggests that there may be a threshold, below which using more helpers (i.e., the number of time series that G-causes the target time series) can improve prediction accuracy, but above which the prediction accuracy actually decreases. This may or may not be attributed to over-fitting.
Factor 3: The magnitude of p-values. We propose classifying the p i,j (T C )'s into three categories: small ( p i,j (T C ) < 0.01); medium (0.01 ≤ p i,j (T C ) < 0.03); and large (0.03 ≤ p i,j (T C ) < 0.05). Since leveraging bidirectional G-causality leads to more accurate predictions than leveraging unidirectional G-causality (Factor 1), we only consider the former. Factor 4: Layers of network resolutions. In order to see the usefulness of G-causality across network resolutions, we perform predictions by the same set of models, while making a time series corresponding to a low-resolution network the target node and leveraging its sub-networks as helpers for prediction purposes. Table 6 presents the summary statistics of the prediction accuracy of leveraging peer networks vs. sub-networks as helpers. At the /8 resolution, we observe that all models, except GCPenVAR, achieve a higher prediction accuracy when leveraging peer networks at the /8 resolution as helpers than leveraging sub-networks at the /16 resolution as helpers. At the /16 resolution, we observe that GCPenVAR, GC3VAR2, GC2VAR, and GC2VAR2 achieve a higher prediction accuracy when leveraging sub-networks at the /24 resolution as helpers than leveraging peer networks at the /16 resolution as helpers.

Insight 12.
In terms of leveraging information across network resolutions for prediction, GCPenVAR leveraging G-causality across network resolutions is particularly useful.

Discussion
Causality vs. G-causality. Intuitively, causality should be asymmetric, meaning that "A causes B" would imply that "B does not cause A". However, our empirical analysis shows that G-causality does not satisfy this property because bidirectional G-causality relation is widely exhibited at the /8, /16, and /24 network resolutions. These bidirectional G-causality relations highlight a gap between the intuitive notion of causality and the G-causality approach. Limitations. Our CGC framework has four limitations. (i) We assume away the sparse time series because 14 EAI Endorsed Transactions on Security and Safety Online First Table 5. Summary statistics of the SMAPE values for /8, /16 and /24 networks (#net is the # of networks involved), where bold-font highlights the most accurate predictions among the different p-values with respect to a specific model. Table 6. Summary statistics of SMAPE values of leveraging peer networks vs. sub-networks as helpers, where a indicates models leveraging sub-networks (i.e., time series data across multiple network resolutions), and bold-font highlights the more accurate prediction when comparing the prediction leveraging time series across network resolutions (i.e., leveraging sub-networks) and the prediction leveraging time series at a single network resolution (leveraging peer networks). We assume away the non-stationary time series, which is inherited from the notion of G-causality. Addressing these limitations is an important open problem. Our case study has two limitations that are imposed by the particular dataset. (iii) The dataset only lasts for 97 days. Although it is sufficient to demonstrate the usefulness of the framework, it would be better if we have access to a dataset with a longer period of time. (iv) The dataset is collected by lowinteraction honeypots and therefore does not present rich semantic information about the attacks. Using datasets collected from high-interaction honeypots or production networks would resolve this issue, for which the framework is equally applicable.

Conclusion
We presented the CGC framework for characterizing the usefulness of G-causality in cybersecurity, especially its usefulness in predicting cyber attack rates. We investigated a number of models and drew a number of insights, which can be leveraged as a stepping-stone towards fully understanding the usefulness and limitations of G-causality in cybersecurity. There are many open problems for future research, including: How can we rigorously, rather than empirically, characterize the usefulness and limitations of G-causality? What are the utilities of other kinds of causality (e.g., Pearl-causality) in the cybersecurity domain? How can we model sparse and non-stationary time series in a principled fashion?