Anomaly Detection and Attribution in Networks with ...

Viewer
Transcript

1

Anomaly Detection and Attribution in Networks with Temporally Correlated Traffic 2

Ido Nevat1 , Dinil Mon Divakaran2 , Sai Ganesh Nagarajan2 , Pengfei Zhang3 , Le Su2 , Li Ling Ko4 , Vrizlynn L. L. Thing 2 1 TUM CREATE, Singapore Cyber Security & Intelligence Department, A*STAR Institute for Infocomm Research (I2 R), Singapore 3 Department of Engineering Science, University of Oxford, UK 4 Department of Mathematics, University of Notre Dame, USA

Abstract—Anomaly detection in communication networks is the first step in the challenging task of securing a network, as anomalies may indicate suspicious behaviors, attacks, network malfunctions or failures. In this work, we address the problem of not only detecting the anomalous events, but also of attributing the anomaly to the flows causing it. To this end, we develop a new statistical decision theoretic framework for temporally correlated traffic in networks via Markov Chain modelling. We first formulate the optimal anomaly detection problem via the Generalized Likelihood Ratio Test (GLRT) for our composite model. This results in a combinatorial optimization problem which is prohibitively expensive. We then develop two low-complexity anomaly detection algorithms. The first is based on the Cross Entropy (CE) method, which detects anomalies as well as attributes anomalies to flows. The second algorithm performs anomaly detection via GLRT on the aggregated flows transformation— a compact low dimensional representation of the raw traffic flows. The two algorithms complement each other and allow the network operator to first activate the flow aggregation algorithm in order to quickly detect anomalies in the system. Once an anomaly has been detected, the operator can further investigate which specific flows are anomalous by running the CE based algorithm. We perform extensive performance evaluations, and experiment our algorithms on synthetic and semi-synthetic data, as well as on real Internet traffic data obtained from the MAWI archive, and finally make recommendations regarding their usability. Index Terms—Anomaly detection, Network traffic, Likelihood ratio test, Markov Chain, Cross entropy method

I. I NTRODUCTION Network attacks have been increasing over the years, with attackers increasingly seeking different and more sophisticated means to disrupt or intrude networks [1]. These attacks may cause not only financial damages but also physical losses to vital infrastructure. It is therefore extremely important to develop methods and applications to counter these and other related threats. The first step in combating these attacks is the detection of anomalies, based on which countermeasures can be put in place, in order to eliminate, or at least reduce the impact of the attacks. Anomalies in data network are patterns that deviate from the ‘normal’ expected behavior of the network. These patterns might consist of suspicious behaviors such network scans for vulnerable ports/services, attacks such as TCP SYN flooding DDoS amplification attacks, etc., or

they could also be the result of spurious traffic caused by network failures [2]. In Section III, we formally define our notion of normal and anomalous conditions in network. Anomaly detection has many applications in various areas of research. These include change detection in sensor networks [3], location spoofing detection in IoT networks [4], fraud detection [5], smart grids secuirty [6], etc. In this work, we concentrate on anomalies in data network, but the model and algorithms we develop can be applied to other domains as well. There are numerous works that have attempted to solve the problem of network anomaly detection; see, for example, the surveys [7], [8]. In the past works, a number of features with granularities varying from packets to flows to sessions have been considered (see [9] for a recent analysis). Similarly, a number of models have been used to analyze these features with the aim of detecting anomalies in network traffic (see Section II). In our work here, we concentrate on temporally correlated features of network traffic which we model via a Markov Chain, with the aim of studying the feasibility and performance of such a model in detecting network anomalies. In the process, we choose as our working example one important feature— state transitions of TCP (Transmission Control Protocol) flows—for modeling traffic data. TCP state transitions for normal flows are based on TCP’s Finite State Machine (FSM) which evolves stochastically according to a first order Markov Chain [10], [11]. We note that our model is general to encompass any feature that can be accurately modelled as Markovian FSM (e.g.,. traffic rate). In the simplest model for anomalies, one defines two mutually exclusive hypotheses: in the first, all the traffic in the network is normal; and in the second, there are one or more anomalous flows. This constitutes a binary hypotheses testing, which can be solved via the classical decision theory framework [12], [13]. However, this approach may be too restrictive in many practical cases; since in practice, we also need to identify the flows that are anomalous, in order to trigger actions and also for further investigations. The realistic requirement makes the problem much more difficult as it involves multiple competing models, and this is the scenario we tackle in this paper. In particular, we consider a multi-flow system (e.g., traffic of an enterprise network), where a subset of

active and parallel traffic flows may be anomalous. This problem is of practical interest, because it is important to not only detect an anomalous event, but also to attribute the event to the set of flows causing it, consequently attributing the event to the end-hosts. To summarize, the main goal of this paper is to answer the following important questions: 1) Can we reliably detect the existence of anomalies in temporally correlated network traffic, which only affects a subset of flows? Also, how subtle an anomaly can we detect? 2) Can we attribute the detected anomalous event to the specific subset of flows that caused it? In other words, can we detect which subset of flows are anomalous? To the best of our knowledge, these important aspects have not been addressed using a single model before. Instead, the treatment so far has been to perform a flow-byflow detection, [11] being a recent example. This approach is clearly suboptimal, since the joint density of the flows’ observations do not decompose to independent flows, but is only conditionally independent, where the conditioning is on the model parameters (i.e., the transition matrix). As such, a principled statistical analysis should consider the joint density and develop the detection algorithms based on this quantity. This kind of joint anomaly detection and attribution problem can be cast as a combinatorial hypotheses testing, which may be feasible only when the number of flows is small—the number of hypotheses to be tested is exponential in the number of flows. If there are n flows, the number of competing models required to attribute the flows to the anomaly would be 2n ; that is, the search space is exponential in the number of flows. To deal with this complexity, we frame the problem via a compact model in which the system contains only two hypotheses: one of normal behaviour, which describes the case where all the flows are normal; and the other of anomalous behaviour which describes the case where at least one of the flows is anomalous. We therefore embed all possible subsets of models which contain anomalous flows into the alternative hypothesis. This simplifies the problem structure considerably as we do not need to compare all possible sets of models. Constructing the optimal detector for our model still incurs exponential complexity, a difficulty which we overcome by constructing two different ways to perform the detection tasks. The first algorithm solves the optimization problem via a modified version of the Cross Entropy method [14], [15], searches the corresponding state-space in an efficient manner, providing the flexibility to manage the exploration-exploitation trade-off. For the second algorithm, referred to as the Flow Aggregation algorithm hereafter, we first apply a transformation of the traffic flows which results in a compact low dimensional representation; this allows us to develop a quick anomaly detection algorithm with runtime complexity linear in the number of flows. Though the transformation retains the ability to detect anomalies, it results in some loss of information of individual flows, denying the ability to detect anomalous flows. The two algorithms complement each

other, as they allow the network operator to first activate the low complexity Flow Aggregation algorithm in order to quickly detect anomalies in the system. Once an anomaly has been detected, the operator can further investigate which specific flows are anomalous by running the Cross Entropy based algorithm. Our model’s detection capability is not based on any particular pattern. Instead, we model the problem to detect anomalies that deviate from normal traffic. In particular, we develop solutions to detect anomalies (say attacks) that affect a subset of flows. Such anomalies might include attacks that are known today or even zeroday attacks. In fact, even an anomaly that consists of only very small number of flows, but yet changes the statespace significantly, would be detected by our solutions. In this paper, we present a new perspective for detecting and attributing anomalies in temporally correlated traffic communication networks. Although the modelling of traffic features via Markov chains is not new [10], [11], there is still a gap in the translation of these models into a principled statistical decision making problem. We fill this gap by formulating the problem as the optimal statistical test, known as the Likelihood Ratio Test (LRT) which minimizes the statistical risk and balances between Type I and Type II errors. We list the contributions of this paper: 1) We formulate the problem of jointly detecting anomalies in communication network traffic and attributing to the subset of flows causing the anomalies, via Likelihood Ratio Test (Section IV). 2) To solve the resulting combinatorial problem, we develop a joint detection algorithm which is computationally efficient, based on a modified version of the Cross-Entropy (CE) method [14], [15] (Section V). This algorithm not only detects the presence of anomaly, but also identifies the subset of flows causing it. 3) We also develop an alternative low-complexity detection algorithm which takes as input the aggregated traffic flow data, thus reducing the state-space size from the number of flows (usually in the order of thousands) to the number of states which a single flow contains (a few dozens) (Section VI). 4) We evaluate our algorithms using synthetic, semisynthetic as well as real network traffic data, and demonstrate the effectiveness of our algorithms in anomaly detection and attribution (Section VII). We also compare our solutions with a flow-by-flow approach based on Hoeffding Test [16], that was used recently for network anomaly detection [11]. II. R ELATED WORKS Anomaly detection algorithms in communication networks typically use some sort of summary of the raw data, know generally as features. In developing a statistical framework, one needs to understand the properties of the selected features in order to design an appropriate model. For example, some features present time correlation and should therefore be modelled accordingly, while some features are independent over time. Regardless of these attributes, it is of interest to have a summary of data 2

useful for an investigation. One such widely use summary is an information theoretic criterion, known as information entropy [17]. One of the well-known works which builds on the principle of maximum entropy to detect anomalies in network was presented in [18]. While the principle of maximum entropy is clearly a good summary for detecting network traffic anomaly, it has no direct interpretation in terms of the final decision; that is, a binary decision which partitions the observation space into two subspaces, one of normal behaviour, and the other of anomalous behaviour. This interpretation is vital for understanding and quantifying the false alarm (Type I errors) and mis-detection (Type II errors) probabilities, which quantify the quality of the detection algorithm. In order to obtain such interpretation, a statistical decision theoretic framework should be considered [19]. One such application of decision theory was developed in [20] where the authors proposed a network anomaly detection algorithm via the Likelihood Ratio Test (LRT) for independent heavy tailed (α-stable) random variables, where the feature they selected was the traffic rate (aggregate). In [21], the authors developed a sequential LRT for independent and identically distributed samples where the features were packet rate and the sample entropy of packet size distribution. They modelled the packet rate as a random variable which follows a generalized Poisson distribution and the packet size as a random variable which follows a Normal distribution. Since the features are treated independently, the final algorithm applies a logical conjunction of detections made independently. In [22], the authors developed a network anomaly detection via change-point detection, based on the cumulative sum control chart (CUSUM) statistic where the monitored feature was the number of attempted connections. Their approach follows the classic sequential hypothesis testing problem of Wald [23]. A detection scheme based on the LRT for both independent and dependent time-series correlated features (such as flow size, flow duration and flow start time) was developed in [11]. In particular, the proposal investigated the suitability of Markovian like features for anomaly detection. The calibration of the proposed model involves solving an integer programming optimization problem which is NP-hard. To overcome this problem, the authors developed a heuristic algorithm. In [24], an algorithm was developed to detect anomalies using statistical characterizations of the TCP traffic, using high order homogeneous Markov chains under both stationary and non-stationary models. In most practical cases of anomaly detection, the model parameters are unknown, making the evaluation of the marginal likelihood difficult if not intractable. In special cases, the resulting test is independent of those parameters, in which case the Uniformly Most Powerful (UMP) test exists. In other situations, there are two main approaches to handle this problem. The first is a Bayesian approach, corresponding to an assumption of a certain prior for each hypothesis. Then by marginalizing the nuisance parameters, the problem is converted into a simple hypothesis testing problem. Unfortunately, the Bayesian approach has a few drawbacks: the assumption that the prior is

known is hard to justify in most practical applications. Furthermore, the computation of the resulting integral is difficult to obtain in practice. The second approach is the Generalized Likelihood Ratio Test (GLRT) [12], [13], in which one calculates the LRT with the unknown being replaced by their Maximum Likelihood (ML) estimates. This approach was adopted in [25], where the authors developed an anomaly detection algorithm which utilises the GLRT by measuring the coherence of the current observation. The feature they selected was the number of packets of a certain type (number of TCP SYN packets sent to port 80, number of unique IP addresses contacted and volume of traffic on port 25). An important aspect of the LRT framework is to derive the probabilities of false alarm and mis-detection which leads to the choice of the optimal threshold of the test. Unfortunately, these quantities are known to be intractable for fully observed Markov Chains and do not admit simple closed form expressions. It is therefore a common practice to approximate these quantities for large sample size via an application of the large deviations asymptotic [26] which may be inaccurate [27]. The authors in [28] developed a tighter approximation of the threshold via the Central Limit Theorem. Regardless of the specific feature that all the aforementioned works utilised, or whether they considered a temporally correlated or independent samples, they did not consider two important aspects: 1) How to efficiently detect anomalies when the number of parallel flows is very large? 2) How to detect anomalies when only a subset of (parallel) flows are anomalous? In the following, we address these questions. III. S YSTEM MODEL In this section, we present the model assumptions and the related definitions, where we begin by defining a TCP traffic flow. Definition 1. TCP traffic flow A flow is defined as a set of packets localized in time, having the same five-tuple of source and destination IP addresses, source and destination ports, and protocol. For a TCP flow, obviously the transport protocol used is TCP. A TCP flow transits through a set of states of a finite state machine (FSM), which can be modelled as a DiscreteTime Markov Chain (DTMC) [29]–[31]. In the following, we formally define DTMC, and then TCP’s state-path. Definition 2. Discrete-time Markov chain (DTMC) [32], [33] Consider a sequence of random variables {S0 , S1 , S2 , . . . , Sn } on a finite or countable set of states Ω, such that Qi,j = Pr (Sk+1 = j|Sk = i, Sk−1 = ik−1 , . . . , S1 = i1 ) = Pr (Sk+1 = j|Sk = i) . The square matrix Q = (Qi,j ) , {i; j} ∈ Ω is called the one-step transition matrix, and Qi,j denotes the probability the chain moves to state j, given that it was in state i. One can incorporate a non-homogeneous (time varying) behaviour of the DTMC by defining the transition matrix Q (Φ), where Φ is a 3

set of covariates (possibly time dependent) which explains the transition matrix Q.

A time frame of length T

Definition 3. State-path of a TCP traffic flow For a given TCP traffic flow, we define a state-path as a sequence of states which are realizations from the TCP DTMC between a source IP address and a destination IP address.

(1:K)

S1:T

(K)

s1

With these definitions, we now present our network system model. The network traffic can be seen as a collection of K concurrent flows, where each of the traffic flows is a realization from the TCP DTMC. We assume that under normal conditions, the probability law which governs the behaviour of this flow, generates a state-path of a stochastic process according to the transition matrix Qn (Φ) and covariates Φ. The values of Qn (Φ) can be learned from labelled normal traffic, and the covariates can be used to incorporate time-varying non-stationary features, such as periodicity [11]. Here we assume that these covariates are known; and in the sequel, for brevity, we use the short hand notation Qn . When at least a single flow diverts from the expected “normal” behaviour, its state-path can be seen as a realization from a TCP DTMC which is different to Qn . (1:K) Let S1:T represent the observed state-paths of K TCP flows, where T denotes the number of states visited by the TCP flows in the time-window (or frame) being processed. In other words, there are T sampling points in a time-window; and for a flow, its state is recorded at each sampling time. Then, the goal of the detection (1:K) algorithm is to decide, given S1:T , whether at least one of them Formally, we design a decision is anomalous. (1:K) rule δ S1:T , which maps the input observation space

}| (1) s2 ··· (2) s2 ··· .. .. . . (K) s2 ···

{  (1) sT (2)  sT  ..   .  (K)

sT

     

K TCP flows

    

A3 Under normal condition (the H0 hypothesis): the TCP transition matrix of the k-th flow is denoted by Qn , where the transition probability is given by (k) (k) [Qn ]i,j := Pr St = j|St−1 = i; H0 ∀i, j ∈ Ω. A4 Under anomalous condition (the H1 hypothesis) : A4.1 At some unknown time τ , an anomaly may occur and change the values of the TCP-DTMC transition matrix for a subset of flows Ka ⊂ K (we denote the set of flows which remains under normal condition by Kn ⊂ K, where Kn ∪ Ka = K and Kn ∩ Ka = ∅). A4.2 The transition matrix of a flow which is anomalous is denoted by Qa , where Qa 6= Qn (meaning that at least one entry is different), and its transition probability is given by (k) (k) [Qa ]i,j := Pr St = j|St−1 = i; H1 ∀i, j ∈ Ω. The network system model described in A1-A4 is expressed compactly as follows: (k)

(k)

H0 : St |St−1 ∼ Markov (λ0 , Qn ) , k ∈ K, t = {1, .., T } ( (k) (k) St |St−1 ∼ Markov (λ0 , Qn ) , k ∈ Kn , t = {1, .., T } , H1 : (k) (k) St |St−1 ∼ Markov (λ1 , Qa ) , k ∈ Ka , t = {1, .., T } (2)

(1:K)

S1:T , into a binary decision A = {normal, anomaly}: (1:K) δ S1:T : S → A.

 z(1) s1  (2)  s1 :=   ..  .

where Qn and Qa are the one-step transition matrices of the chains under normal and anomalous conditions, respectively. The vectors λ0 and λ1 represent the initial distribution under each model.

(1)

This problem constitutes a binary composite hypothesis testing problem, since the alternative model contains the unknown parameter Qa , which represents the unknown transition matrix of the anomalous flows. In the more complex setting, which we present in Section V, the detection rule is not only binary, but also indicates probabilistically which of the K flows is anomalous, which means that the decision space is given by A = n o K normal, anomaly, [0 1] . We now present the network system model:

IV. P ROBLEM S TATEMENT In this work, we develop two algorithms to answer the following two key questions: Given an observation of a time-frame with K TCP flows: 1) Anomaly detection: Is the frame normal or anomalous? (decide between H0 and H1 ). 2) Attribution: In case the decision is H1 , which of the flows are anomalous? (find Ka ). To address these important questions, we use a statistical decision theoretic framework of hypotheses testing [12], [13]. We formulate the problem in two different ways to address these questions and develop the optimal detection algorithms in Sections V and VI.

A1 Consider a communication network consisting of a th set K of K flows, where n the state o of the k TCP flow (k) at time t is denoted by St , k ∈ {1, . . . , K}. t≥1

A2 The sequence of observations of the k-th flow n o (k) St is modelled as a Discrete-Time Markov t≥1

Chain (DTMC), which takes values from a finite set Ω = {1, . . . , J}. The discrete sampling times are such that the state transitions of all flows are captured. We express explicitly the complete description of the TCP-DTMC transitions within a frame of length T as follows:

A. Known Transition Matrices: the Likelihood Ratio Test In the ideal scenario where both the transition matrices Qn and Qa are known, the classical Neyman-Pearson approach [34] provides the optimal test for the binary 4

decision regarding each frame (if it is normal or anomalous), and is known to be the Uniformly Most Powerful (UMP) test. The resulting test is obtained via a threshold test which is based on the likelihood ratio, given by: Pr f S(1:K) |H0 H0 1:T (1:K) ≷ γ, Λ f S1:T = (1:K) Pr f S1:T |H1 H1

we do not make such an assumption (i.e., we assume Qn is known a-priori). In the next sections, we utilize the GLRT to derive the optimal test for detecting anomaly in a subset of flows. First, in Section V, we re-formulate the GLRT and generalize it to incorporate all possible combinations of subsets of flows which may be anomalous. This results in a combinatorial hypotheses test procedure which is impractical to solve via brute force due to the sheer number of possible combinations. To overcome this difficulty we derive an efficient algorithm via the Cross Entropy method. Next, in Section VI, we develop a lowcomplexity algorithm to evaluate the GLRT in Eq. (4) via flow aggregation, that allows for a compact representation of the state-space.

(3)

where f (.) is a statistical model (e.g., summary statistic) (1:K) of S1:T ; γ is a threshold that can be set to either assure a fixed false-alarm rate under the Neyman-Pearson approach or minimize the overall error probability under the Bayesian approach [13]. This idealised case (since in practice Qa is unknown) will serve as the upper bound on the detection performance.

V. A LGORITHM I: O PTIMAL S UBSET A NOMALY D ETECTION VIA C ROSS E NTROPY M ETHOD

B. Unknown Transition Matrix Qa : the Generalized Likelihood Ratio Test In many practical scenarios, we know the transition matrix of normal traffic, Qn ; but the transition matrix of anomalous traffic Qa is unknown and is attack specific. Learning Qa for all kinds of anomalies in an everchanging and dynamic Internet is a challenge. Therefore, in our model a simple hypothesis Q = Qn , is tested against a composite alternative Q 6= Qn . Clearly, in this case, the Neyman-Pearson fundamental lemma does not apply and optimality in terms of UMP test is difficult to establish [35]. In such cases, it is common to use the Generalized Likelihood Ratio Test (GLRT) [12], [13], given by ΛGLRT

(1:K) |H0 Pr f S1:T H0 (1:K) ≷ γ. f S1:T = (1:K) sup Pr f S1:T |H1 H1

Given the system model in Eq. (2), we now derive the optimal anomaly detection algorithm to solve the two questions we postulated before. We note that, under the H0 model we have a simple hypothesis; while under the H1 model, we have a composite model, due to the unknown model parameters Ka and Qa . We therefore use the Generalized Likelihood Ratio Test (GLRT) [13], and maximize the likelihood function over the space Θ = {K ∪ Qa }, where [Qa ]i,j ≥ 0, J X

∀i, j ∈ {1, . . . , J},

[Qa ]i,j = 1,

and,

∀j ∈ {1, . . . , J}.

i=1

(4)

We now express the optimal detection algorithm based on the GLRT method.

Q

Lemma 1. The test statistic of the optimal subset anomaly detection is given by:

The optimality of the GLRT for Markov sequences for the asymptotic case and finite number of sub-classes of Qa has been established in [35], and extended to the set of all finite states in [36], [37]. We note that one could also consider the case where Qn is unknown and estimate it in a similar manner to the estimation of Qa . In that case the GLRT has the following form: (1:K) sup Pr f S1:T |H0 Q∈Φ H0 n (1:K) ≷ γ. ΛGLRT f S1:T = (1:K) sup Pr f S1:T |H1 H1

Q (1:K) Λ S1:T = sup Θ

Q T (k) (k) (k) Pr S0 |Qn Pr St |St−1 , Qn t=1

k⊂Ka

Q

Pr

(k) S0 |Qa

Q T

Pr

(k) (k) St |St−1 , Qa

.

t=1

k⊂Ka

Proof. See Appendix ?? in the supplementary file. Algorithm I: The optimal flow subset anomaly detection algorithm is given by: H0 (1:K) Λ S1:T ≷ γ.

Q∈Φa

However, to have a well posed problem the space of matrices under H0 , denoted Φn should not overlap with the space of matrices under H1 , denoted Φa . Otherwise, it would be mathematically impossible to distinguish between the two hypotheses, in cases where the realizations of the null and alternative transition matrices fall inside the intersection of these spaces. This means that we need to assume that H0 : Qn ∈ Φn , and H1 : Qa ∈ Φa \Φn . This might not be a realistic assumption is some applications, and in particular for the problem of network traffic anomaly detection (for example, anomaly traffic may consist of a few flows that appear to be normal when observed independently). Therefore in our work below,

H1

The computation of the test statistic in Lemma 1 not only provides us with the likelihood ratio of the model, but also finds the subset of flows which is anomalous, (1:K) Ka . However, calculating Λ S1:T involves solving a joint combinatorial problem (over Θ := {K ∪ Qa }) which is impractical even for moderate number of flows. For example, for the case where we have K = 100flows, there (1:K) are 2100 possible combinations to calculate Λ S1:T . To overcome this computational difficulty we develop an algorithm with runtime complexity acceptable for practical purposes, that approximates the optimization problem via 5

B. Maximum Likelihood Estimator of Qa

the Cross Entropy method [14]. This method will enable us to evaluate the test statistic only for flows which are more likely to be anomalous, rather than explore the whole state-space.

As mentioned before, to evaluate the CE objective function, we need to calculate the MLE of Qa , given the (1:K) data, S1:T . The MLE of Qa is obtained as the solution to the following constrained optimization problem:

A. Anomaly Detection using the Cross Entropy Method

b a = arg max Pr S(1:K) |Q Q 1:T

We now develop a novel algorithm to evaluate the test statistic in Lemma 1. We evaluate the combinatorial optimization problem in Lemma 1 via a Monte Carlo approach which has low complexity. In particular, we consider the Cross Entropy (CE) method [14]. Based on a variation of the well known Importance Sampling technique [38], the CE method minimizes the Kullback-Leibler (KL) divergence for approximating the optimal sampling distribution. The CE method can be applied to solve NP-hard problems, by translating them into stochastic optimization problem, and subsequently performing rare event simulation techniques. We present the generic CE algorithm in Section ?? of the supplementary file. The following theorem establishes the conditions for the CE method to converge asymptotically to the optimal solution.

Q

s.t [Q]i,j ≥ 0, J X

∀i, j ∈ {1, . . . , J},

[Q]i,j = 1,

∀j ∈ {1, . . . , J}.

i=1

It is not difficult to see that by using Lagrange multipliers this optimization problem can be solved, and that the th b a is given by: (i, j) entry of Q K P T P

h

ba Q

i

=

k=1 t=1

k 1 St+1 = j 1 Stk = i

i,j

K ×T

.

(5)

The MLE in this case is known to be consistent but biased, with the bias tending towardh zero i as the sample size P b increases, which means that Qa → [Qa ]i,j as the

Theorem 1. Convergence for constant smoothing factor [39]: If the smoothing sequence is a constant, i.e., αt = α, α ∈ (0, 1], then the sequence of probability mass functions f (x; pt ), t ≥ 1, converges with probability 1 to a unit mass located at some n (random) candidate x ∈ {0, 1} . Furthermore, the probability that an optimal solution is generated can be made arbitrarily close to 1 by selecting a sufficiently small value of α.

i,j

number of samples goes to infinity [40]. Our modified CE algorithm is presented in Algorithm 1. The stopping criterion we use is the number of iterations. This gives the flexibility to control the running time, depending on the computational capability of the system.

The candidate solutions are generated randomly following the function f . The learning parameter α creates a trade-off between achieving the optimal solution with high probability, and obtaining a fast rate of convergence 1 of the sampling distribution. αt = (t+1) and αt = β 1 , β > 0 are smoothing sequences that leads (t+1) log(t+1)β to optimal solution with probability 1, see [39] for details. Solving the subset anomaly detection problem in Lemma 1 involves two aspects which we need to consider:

C. Computational complexity The runtime complexity of the CE algorithm we developed for anomaly detection and attribution is defined by the number of samples N , the number of flows K, the number of state-paths T and the stopping criterion. In Algorithm 1, the steps (2) and (3) are the most expensive and dominant ones. The runtime complexity of these steps, and hence of the algorithm, is O(N × R × K × T ), where R is the number of iterations. As we process traffic in fixed in time-windows, the number of statepaths T is typically small (in a few tens) and can be considered a constant. The number of flows, or to be precise, the number of instantaneous flows, can range from a few hundreds to a few thousands for a network of considerable size. Therefore, we can safely assume that the running time of the algorithm is governed mostly on the number of samples N and the number of iterations R. These two are also the only free parameters that we can control. The values of N and R are dictated by the balance between the computational budget allowed, the required detection performance, as well as the difference between Qn and Qa . We highlight here that, in a naive bruteforce algorithm the number of samples considered for K flows is 2K , i.e., the entire population. In CE algorithm presented above, N 2K . In Section VII, where we evaluate the performance of the algorithms, we present the values for N and R.

1) The objective function contains a nuisance highdimensional parameter in the form of the alternative transition matrix Qa . We are therefore unable to directly apply the CE method on the test statistics in Lemma 1, since it is a mixed integer (K) and continuous (Qa ) optimization problem. However, the estimation of Qa can be obtained since its Maximum Likelihood Estimator (MLE) can be efficiently derived, as we present next. Therefore, we can embed the MLE of Qa into the CE algorithm and solve the remaining combinatorial problem efficiently. 2) The optimization problem over the subset of flows requires us to specify a parametrized random mechanism to generate samples θ ∈ Θ. In order to make this sampling mechanism efficient and of low computational complexity, we choose to generate independent sample components, rather than dependent ones and note that other mechanisms could be considered. 6

Algorithm 1 Cross Entropy Algorithm

a binary decision A = {normal, anomaly}. We begin by defining a summary statistic which is based on two quantities: an Aggregate Count Vector (ACV) and a Flow Transition Matrix (FTM). 1) Aggregate Count Vector (ACV): we define the vector Zt ∈ RJ×1 to denote the aggregate number of flows occupying each of the states at time t:   Number of flows in TCP state 1 at time t  Number of flows in TCP state 2 at time t    Zt :=   ..   . Number of flows in TCP state J at time t

(1:K) S1:T ,

Require: U (·) = Λ (·) as per Lemma 1, learning parameter α , quantile level ρ 0. Initialize Γ := γ1:N,1:K = 0.5 (all elements set to 0.5). while stopping criterion 6= TRUE do 1. Generate N independent samples of the binary set Γn = {γn,1 , γn,2 , . . . , γn,K }, where Γn ∼ Ber (Γn , π) =

K Y

γ

πk n,k (1 − πk )

γn,k

,

k=1

where π is a vector of parameters belonging to the K K-cube (i.e., [0, 1] ). b a, 2. Calculate the MLE of the transition matrix Q (·) where the observations S1:T contains only the flows which have been selected according to the values of the binary vector Γn . b a = Λ S(1:K) |Γn for all the N 3. Evaluate U Γn ; Q 1:T samples, according to Lemma 1. ba . 4. Calculate β = (1 − ρ) quantile values of U Γ; Q 5. Update π as follows: N P

πk = α n=1

(j)

Mathematically, the j-th entry of Zt , denoted Zt given by: (j)

Zt

K X (k) 1 St = j , ∀j ∈ {1, 2, . . . , J} ,

(6)

k=1

and we have the constraint

J P

(j)

Zt

= K.

j=1

2) Flow Transition Matrix (FTM): we define Ft ∈ RJ×J , where [Ft ]i,j is the number of flows that have transitioned from state i to state j at time t:

1 (U (Γn ) ≥ β) 1 (Γn,k = 1) N P

=

is

[Ft ]i,j :=

1 (U (Γn ) ≥ β)

K X (k) (k) 1 St−1 = i 1 St = j , k=1

n=1

i, j ∈ {1, 2, . . . , J}

+ (1 − α)πk , ∀k ∈ {1, . . . , K} . end while 6. For each element in πk perform: ( 1, πk ≥ Ψ Γ= 0, Otherwise

(7)

A. Anomaly Detection via Flows Aggregation Method We now present a novel anomaly detection algorithm which is based on our previous definitions. To this end, we evaluate the log of the test statistic in Eq. (3), which is presented in the following Lemma.

where Ψ is a pre-defined threshold which controls the attribution detection performance. This binary vector Γ is our CE estimator of Ka in Lemma 1. (1:K) ba . 7. Calculate test statistic in Lemma 1 Λ S1:T ; Γ, Q 8. Make a binary decision H0 (1:K) ba ≷ γ Λ S1:T ; Γ, Q

Lemma 2. The test statistic of the optimal anomaly detection via flow aggregation is given by: Λ (Z1 , F2:T ) = log

H1

+

Pr (Z1 = z1 |Qn ) ba Pr Z1 = z1 |Q

T X J X J X t=2

[Qn ]i,k [Ft ]i,k log h i , ba Q i=1 k=1 i,k

b a is derived in a similar way to the unrestricted MLE where Q of the transition matrix Qa , given in (5), but instead of using (1:K) S1:T , we use the FTM sequence F2:T . It is straightforward b a , and to show that F2:T is sufficient statistic for the MLE of Q as such incurs no loss of information.

VI. A LGORITHM II: O PTIMAL A NOMALY D ETECTION VIA F LOWS A GGREGATION In this section, we develop an alternative formulation to the problem presented in Eq. (3). Our approach is based on the observation that we do not need to store all the (1:K) raw information in S1:T , but instead we can use a highly (1:K) summarized version of S1:T . This results in a significant dimensionality reduction, leading to a more computationally efficient algorithm. However, this summary leads to a loss in the ability to identify anomalous flows, while retaining the ability to detect an anomalous event (that is, at least one of the flows is anomalous). our As such, (1:K) goal here is to design a decision rule δ S1:T , where its input is the state-paths of K flows (each consisting of T samples), and to map this observation space S into

Proof. See Appendix ?? in the supplementary file. Algorithm II: The optimal flow aggregation based anomaly detection algorithm is given by: H0

Λ (Z1 , F2:T ) ≷ γ. H1

Next, we present the anomaly detection algorithm using flow aggregation, in Algorithm 2. The main difference between the Cross Entropy (CE) algorithm we developed in Section V and the Anomaly 7

Algorithm 2 Flow Aggregation Algorithm

flows (connections) in count. The first hour of traffic was used for training while the last 15 minutes was used for testing. To ensure that the training data was free from attacks, we segregated the traffic by type of maliciousness and removed all attack traffic that we found. This data segregation was performed in two stages. In the first stage, we identified and removed flows that did not conform to the standard TCP protocol. The flows that remain would be free from common attacks such as TCP network scans, TCP port scans, and TCP floods. In the second stage, we processed the remaining TCP flows, and removed a number of attacks by identifying signatures and commonly repeated patterns of certain attacks (such as SSH brute force [43]), aided by visualizing tools. This process incurred manual inspection. Internet traffic (say, of enterprise network) can be split into a number of time-windows, such that in each timewindow the traffic is assumed to be stationary [44], [45]. The length of such time-windows are in the range of minutes, and not more than an hour or two. In practice, we need a model for each such time-window; and these time-windows will be used across days. For example, if we have a model for each hour of the day, the model for 9:00-10:00 will be used across days of the week. However, separate models are also required for weekdays and weekends. For testing datasets, we used the last 15 minutes of clean data as base, and injected different attacks with varying intensities, by sampling the attack traffic that was segregated earlier. The interval of attack for each of the four testing data lasts approximately one minute; the different attack intensities are given below in Section VII-C3.

(1:K) S1:T ,

Require: Qn , γ b a, 1: Obtain the Maximum Likelihood estimate of Q given in Eq. (5). 2: Calculate the ACV Z1 as per Eq.(6). 3: Calculate the FTM sequence F2:T as per Eq. (7). 4: Calculate the test statistic Λ (Z1 , F2:T ) as per Lemma 2. 5: If Λ (Z1 , F2:T ) ≥ γ, then declare normal, else declare anomaly.

Detection via Flows Aggregation algorithm is that the CE algorithm is able to detect which of the subset of flows is anomalous, while the Flows Aggregation based algorithm is unable to do that, since we compress all the states into a single chain. The advantage of the Flows Aggregation algorithm is the reduced computational complexity of the algorithm. This means that these two methods complement each other in the sense that a system operator may choose to activate the Flows Aggregation algorithm first in order to quickly detect anomalies in the system. Once an anomaly has been detected, the system operator can investigate which specific flows are anomalous by running the CE based algorithm. B. Computational complexity The complexity of Algorithm 2 is defined in terms of the number of flows and the number of states. The construction of the flow transition matrix takes O(K) time; the state transition of a given flow can be computed in constant time as flows are uniquely identified and indexed using hashes. The runtime of the algorithm is governed by computational complexity of the test statistic Λ (Z1 , F2:T ) as per Lemma 2; the computation of the test statistic takes O(T × J 2 ) steps. Note that the number of flows does not affect the computational time of the test statistic. The computation complexity of the algorithm is therefore O(K + T J 2 ).

B. Scenarios for experiments We consider three scenarios differing in the types of data used for the experiments: 1) Scenario 1: Synthetic data: We generate synthetic data for the transition matrices Qn and Qa . That is, this scenario does not use real network traffic traces for experimentation. This allows us to control the parameters of the system and understand the performance of our algorithms. The values of the elements of the transition matrices were chosen randomly from a uniform distribution, and normalized in order to obtain a valid transition probability matrix. Experiments on this set of data allows us to evaluate the detection performance of the GLRT approach in Eq. (4) compared to the optimal and ideal case of LRT in Eq. (3), and quantify the (loss in) detection performance. In addition, we quantify the impact of a model mismatch under H0 , between the true transition matrix under normal conditions Qn , b n . As mentioned earlier, and the assumed one, say Q the transition matrix Qn can be learned from labelled normal traffic, and the covariates can be used to incorporate time-varying non-stationary features. b n is the estimated value of Qn , However, since Q these matrices could be different. Such a problem is very much likely with applications generating huge amounts of data, because (with increasing data) it

VII. P ERFORMANCE E VALUATIONS In this section, we present the performance of the algorithms we developed in sections V and VI. We conduct extensive experiments for this purpose. Below, we describe the real traffic data used in the experiments. We also use synthetic data in our experiments. In Section VII-B, we define the different scenarios considered for experiments, based on the data. We present and discuss the results in Section VII-C. A. Traffic dataset We used network traffic data obtained from the MAWI repository [41] which archives traffic datasets collected from the upstream ISP link of the WIDE consortium (a Japanese academic network connecting universities and research institutes [42]). Specifically, we used the 75minute TCP traffic of 10th December 2014. This data was used in (the last) two of the three scenarios described below. The processed dataset had more than 370, 000 8

is practically challenging to label each and every data item correctly. Even a moderate-size network generates tens of gigabytes of traffic per minute. It is therefore important to study the impact of model mismatch on the overall performance of the detection algorithms. 2) Scenario 2: Semi-synthetic data: In this scenario, traffic models for both normal and attack flows are learned and built from the MAWI network traces. The transition matrices Qn and Qa are obtained from an off-line learning phase (maximum likelihood estimator as per Eq. (5)); subsequently, based on these two matrices, the dataset for this scenario is generated. These experiments allow us to compare the performance of our algorithms under realistic, yet controlled conditions, and as a function of various system parameters, such as length of state-paths, number of flows etc. In addition, we also evaluate the convergence of the Cross Entropy method and its ability to correctly detect which flows are anomalous. 3) Scenario 3: Real dataset: The experiments in this scenario use real network traffic flows. This traffic dataset contains both normal and anomalous TCP traffic flows from the MAWI dataset, as described in Section VII-A. The labelled anomalous traffic we have used in this scenario are port scan and network scans, TCP SYN flooding and brute force attacks on the following application protocols: SSH (secured shell), TELNET, RDP (remote desktop protocol), MySQL and SMTP (mail). We will test the detection performance of our algorithms and make recommendations regarding their suitability to handle real TCP traffic.

0.9

0.8

0.8

0.7

0.7

Probability of Detection

Probability of Detection

0.9

0.6

0.5

0.4

0.3

LRT: Cross Entropy LRT: Flow Aggregation GLRT: Cross Entropy GLRT: Flow Aggregation

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.6

0.5

0.4

0.3

LRT: Cross Entropy LRT: Flow Aggregation GLRT: Cross Entropy GLRT: Flow Aggregation

0.2

0.1

0.9

0

1

0

0.1

0.2

Probability of False Alarm

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability of False Alarm

(a) D (Qn || Qa ) = 0.01 T=100

1

1

0.9

0.9

0.8

0.8

Probability of Detection

Probability of Detection

T=50

0.7

0.6

0.5

0.4

LRT: Cross Entropy LRT: Flow Aggregation GLRT: Cross Entropy GLRT: Flow Aggregation

0.3

0.2

0.1

0

0.7

0.6

0.5

0.4

LRT: Cross Entropy LRT: Flow Aggregation GLRT: Cross Entropy GLRT: Flow Aggregation

0.3

0.2

0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Probability of False Alarm

1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Probability of False Alarm

(b) D (Qn || Qa ) = 0.04

Fig. 1: Performance comparison between the Cross Entropy method and the Flow Aggregation algorithm for two KL divergence values

We study these important aspects as a function of the KL divergence between the transition matrices, Qn , Qa [46], given by ! X [Qn ]i,j D (Qn || Qa ) = πi [Qn ]i,j log , [Qa ]i,j (i,j)∈Ω

where πi , i = {1, . . . , J} is the stationary distribution of Qn . We consider a set of K = 10 flows, out of which three are anomalous, and consider state-paths lengths of T = {50, 100}. The Markov chains have J = 4 states and we consider two sets of transition matrices, for which the KL divergence values between Qn and Qa are x and 4x. This means that, the increased difference between Qn and Qa in the second case, should ideally reflect in delivering better detection performance. GLRT vs. LRT detection performance: Figures 1(a) and 1(b) present the ROC comparisons between the LRT and the GLRT, for varying lengths of state-paths, and two values of the KL divergence. It is clear that the performance gap between the algorithms is quite significant in both cases, but becomes smaller as the length of state-path increases. This is due to the fact that under the GLRT, we calculate the MLE of the unknown matrix Qa as given in Eq. (5), and the quality of the estimator improves with increasing length of the state-paths. As pointed out earlier, as the number of samples goes to infinity, the MLE converges to the true value of the transition matrix, and therefore, the GLRT is asymptotically optimal [35]. We also note that the

The detection performance is quantified via the Receiver Operating Characteristics (ROC) curves, which depict the probability of detection against the probability of false alarm for various threshold settings γ, such that, (1:K) • Probability of detection := Pr Λ S1:T ≥ γ|H1 , (1:K) • Probability of false alarm := Pr Λ S1:T ≥ γ|H0 . 1) Scenario 1 - Synthetic data: In this section, we study and compare the performance of the two algorithms, namely the Cross Entropy method (Section V) and the Flow Aggregation algorithm (Section VI), for a generic temporarily correlated data. The purpose of this study is to understand the effects on the detection performance for the following cases:

•

1

0.2

C. Results

•

T=100

T=50 1

What is the performance loss incurred by using the GLRT in Eq. (4), compared to the ideal and impractical case of LRT in Eq (3). Since the LRT is the UMP test (according to Neyman-Pearson Lemma), it acts as an upper bound on the detection performance. When there is a model mismatch, or in other words, when the transition matrix Qn learned is not accurate, what is the performance loss incurred? 9

1

T=100

T=50

0.8

LRT, T=100

Probability of Detection

0.7

1

1

0.9

0.9

0.8

0.8

Probability of False Alarm=0.1

LRT, T=50 0.6

Probability of Detection

0.9

LRT: Cross Entropy, T=50 LRT: Flow Aggregation, T=50 GLRT: Cross Entropy, T=50 GLRT: Flow Aggregation, T=50 LRT: Cross Entropy, T=100 LRT: Flow Aggregation, T=100 GLRT: Cross Entropy, T=100 GLRT: Flow Aggregation, T=100

Probability of Detection

1

0.7

0.6

Flow Aggregation: no mismatch Cross Entropy: no mismatch Flow Aggregation: mismatch Cross Entropy: mismatch

0.5

0.4

0.3

0.7

0.6

Flow Aggregation: no mismatch Cross Entropy: no mismatch Flow Aggregation: mismatch Cross Entropy: mismatch

0.5

0.4

0.3

0.2

0.2

GLRT, T=100 0.5 0.1

0.1

0

0 0

0.4

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Probability of False Alarm

0.3

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Probability of False Alarm

b n = 0.088 Fig. 4: Effect of model mismatch, D Qn || Q

0.2 GLRT, T=50 0.1 0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

algorithms. We quantify the distance between the true and estimated matrices via the KL divergence and set it b n = 0.088. The distance between Qn and to D Qn || Q b n . Fig. 4 Qa was kept at close to double of D Qn || Q presents the ROC curves for this scenario, as a function of the state-path length. Observe that, as the length of state-paths increases, the performance gap due to the error in estimating the transition matrix Qn gets larger. This example illustrates the importance of obtaining a correct estimate of the transition matrix Qn . 2) Scenario 2 - Semi-synthetic data: In this section, we present the detection performance of our algorithms, where we generate TCP state-path realizations based on the transition matrices of the real dataset. To illustrate this, we first estimated the transition matrix Qn from the real dataset. We chose the SSH brute force attack for the anomalous traffic. We identified 19 states of the TCP state machine, and they include the standard states such as SYN, SYN+ACK, FIN, DATA, etc. The states in forward and reverse direction are identified as different. We introduced an artificial state OTHER, to include all new states in the testing phase (that were not observed in the training phase). Therefore we have, Ω = {1, . . . , 19}. We then emulated these traffic patterns 1000 times, for K = 100 flows and various lengths of state-paths T = {5, 10, 20} states, with varying rates of intensity of attack |KKa | = {5%, 10%}. For the CE method, we used N = 100 Monte Carlo samples, quantile level ρ = 0.9, learning parameter α = 0.1 and R = 50 iterations of the algorithm. We note that the KL divergence between the normal transition matrix and anomalous transition matrix is D (Qn || Qa ) = 1.62. We consider two attack intensities (percentage of anomalous flows): 5% and 10%. The ROC curves are presented in Fig. 5 for different values of T . In addition, we have also implemented a detection algorithm which works on a flow-by-flow approach (i.e., independent analysis of each flow) and is based on the Hoeffding Test [16]. Such an approach was recently used for network anomaly detection [11], [47]. For a fair comparison, the feature used in our implementation of the flow-by-flow approach is TCP’s FSM, as is in our model. Similarly, the important parameter T takes the same value as in our algorithms. In addition, our implementation of the flow-by-flow approach compares the theoretical stationary probabilities of the Markov chain (corresponding to the state-paths of

0.1

D(Qn ||Qa )

Fig. 2: Effect of model mismatch on detection probabilities when false alarm probability is set to 0.1

Probability of False Alarm=0.4

1 LRT, T=100

0.9

Probability of Detection

0.8 LRT, T=50

0.7

GLRT, T=100 0.6

LRT: Cross Entropy, T=50 LRT: Flow Aggregation, T=50 GLRT: Cross Entropy, T=50 GLRT: Flow Aggregation, T=50 LRT: Cross Entropy, T=100 LRT: Flow Aggregation, T=100 GLRT: Cross Entropy, T=100 GLRT: Flow Aggregation, T=100

0.5 GLRT, T=50

0.4 0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

D(Qn ||Qa )

Fig. 3: Effect of model mismatch on detection probabilities when false alarm probability is set to 0.4

detection performance increases as the gap between the transitions matrices Qn and Qa increases. Figs. 2-3 present the detection probabilities as a funcb n for two choices of probabilities of tion of D Qn || Q false alarm, Pf a = {0.1, 0.4}. The results clearly show that with the increase of the distance between Qn and Qa , the detection probabilities improve for all values of probabilities of false alarm. Sensitivity Analysis to model mismatch in Qn : We had assumed that the values of the transition matrix under the normal behaviour are know a-priori (e.g., from historical data). It is important to quantify the impact that a mismatch between the true transition matrix Qn b n would have on the performance and the assumed one Q of the algorithms. To this end, we generate the samples bn from Qn , but use a different transition matrix Q (hence, a model mismatch) to evaluate the GLRT in our 10

1

T=5

1

1

0.9 0.9

0.8 0.8

Probability of Detection

Probability of Detection

0.7

0.6

0.5

0.4

Cross Entropy: 10% abnormal flows Flow Aggregation: 10% abnormal flows Cross Entropy: 5% abnormal flows Flow Aggregation: 5% abnormal flows Independent Flows Detection: 10% abnormal flows

0.3

0.2

0.1

0.7

0.6

0.5

Cross Entropy: 20% abnormal flows Flow Aggregation: 20% abnormal flows Cross Entropy: 10% abnormal flows Flow Aggregation: 10% abnormal flows Cross Entropy: 5% abnormal flows Flow Aggregation: 5% abnormal flows

0.4

0.3

0.2

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.1

1

Probability of False Alarm 0 0

0.1

0.2

0.3

0.5

0.6

0.7

0.8

0.9

1

Fig. 6: Detection performance under SSH attack

T=10

1

0.4

Probability of False Alarm

(a) State-path length T = 5

0.9 1

0.8 0.9

0.8

0.6

Probability of Detection

Probability of Detection

0.7

0.5

0.4

Cross Entropy: 10% abnormal flows Flow Aggregation: 10% abnormal flows Cross Entropy: 5% abnormal flows Flow Aggregation: 5% abnormal flows Independent Flows Detection: 10% abnormal flows

0.3

0.2

0.1

0.7

0.6

Cross Entropy: 20% abnormal flows Flow Aggregation: 20% abnormal flows Cross Entropy: 10% abnormal flows Flow Aggregation: 10% abnormal flows Cross Entropy: 5% abnormal flows Flow Aggregation: 5% abnormal flows

0.5

0.4

0.3

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.2

Probability of False Alarm 0.1

(b) State-path length T = 10

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability of False Alarm T=20

1

Fig. 7: Detection performance under RDP attack

0.9

0.8

rate of 10%, for state-path length of 20. We also observe that the proposed algorithms perform much better than the flow-by-flow approach. This is due to the fact that the estimation of the transition matrix Qn in the flow-byflow approach cannot properly capture the variance in state-paths of flows, and therefore performs poorly. 3) Scenario 3 - Real dataset: We now present the detection performance of our algorithms on real dataset obtained from MAWI dataset (see sections VII-A and VII-B, for details on the traffic dataset used). Tracking and storing the state information of an entire flow has practical limitations; besides it also slows down the time to detect. In this scenario, we experiment on partial flows— flows are segregated into multiple non-overlapping partial flows. Partial flows are also natural as we process flows in time-windows. A single flow can be spread across multiple time-windows; hence segregation of flows into partial flows is essential. With a little abuse of notation, T , the length of state-paths of partial flows is set to three. This means, a flow that has a state-path of length nine is partitioned into three partial flows, each with a statepath length of three. In all the experiments, we used K = 100. We vary the rates of intensity of attack to be |Ka | = {5%, 10%, 20%}. In Figures 6-10, we present the K ROC curves for the brute force attacks on the different

Probability of Detection

0.7

0.6

0.5

0.4

Cross Entropy: 10% abnormal flows Flow Aggregation: 10% abnormal flows Cross Entropy: 5% abnormal flows Flow Aggregation: 5% abnormal flows Independent Flows Detection: 10% abnormal flows

0.3

0.2

0.1

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability of False Alarm

(c) State-path length T = 20

Fig. 5: ROC curves under SSH brute force attack flows) under the H0 , with the empirical estimate of the stationary probabilities of the Markov chain, on a per-flow basis. Compared to the previous scenario, we have used smaller values for T in this scenario, because in practice the length of the state-paths of attack flows is quite short. We observe that as before, the Cross Entropy is superior to the Flow Aggregation algorithm for all values of T considered, and that the overall performance improves with longer state-path. The detection probability of both the algorithms are more than 0.95 at a small false-alarm 11

1

0.9

0.9

0.8

0.8

0.7

Probability of Detection

Probability of Detection

1

0.6

0.5

Cross Entropy: 20% abnormal flows Flow Aggregation: 20% abnormal flows Cross Entropy: 10% abnormal flows Flow Aggregation: 10% abnormal flows Cross Entropy: 5% abnormal flows Flow Aggregation: 5% abnormal flows

0.4

0.3

0.2

0.7

0.6

0.5

Cross Entropy: 20% abnormal flows Flow Aggregation: 20% abnormal flows Cross Entropy: 10% abnormal flows Flow Aggregation: 10% abnormal flows Cross Entropy: 5% abnormal flows Flow Aggregation: 5% abnormal flows

0.4

0.3

0.2

0.1 0.1

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

Probability of False Alarm

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability of False Alarm

Fig. 8: Detection performance under MySQL attack

Fig. 10: Detection performance under Telnet attack Intensity of attack = 5%

1

1 SSH TELNET RDP MYSQL SMTP SCANS SYN FLOOD

0.8 Probability

0.9

0.8

0.6 0.4

0

False alarm

Detection Intensity of attackh = 10%

0.6

1 SSH TELNET RDP MYSQL SMTP SCANS SYN FLOOD

0.8

0.4

Probability

0.5

Cross Entropy: 20% abnormal flows Flow Aggregation: 20% abnormal flows Cross Entropy: 10% abnormal flows Flow Aggregation: 10% abnormal flows Cross Entropy: 5% abnormal flows Flow Aggregation: 5% abnormal flows

0.3

0.2

0.1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.4

0

False alarm

Detection Intensity of attack = 20%

1 SSH TELNET RDP MYSQL SMTP SCANS SYN FLOOD

0.8

0 0

0.6

0.2

Probability

Probability of Detection

0.2 0.7

0.9

1

Probability of False Alarm

0.6 0.4 0.2 0

False alarm

Detection

Fig. 9: Detection performance under SMTP attack Fig. 11: Attribution of anomalous events to flows: performance of the Cross Entropy method for various attack intensities

protocols mentioned earlier, for the two algorithms we have developed. The other anomalies, namely, network and port scans and TCP SYN flooding were easily detected with negligible false alarms; due to limitation of space, we do not plot the corresponding ROCs here. We observe that the performance of the CE algorithm is slightly better than the Flow Aggregation algorithm in all cases. We also observe that for intensity of attack of 10% and above, the algorithms have a very high detection rate with low probability of false alarm. Attribution detection performance: Next, Fig. 11 summarises the performance of the Cross entropy method in attributing anomalous flows successfully, for three intensity rates of attacks. These results were obtained for a specific threshold value Ψ, that gave the best performance in terms of accuracy and false positives. It shows that, for all types of attacks considered, the probability of detection is quite high. We also observe that, with increasing attack intensity, the probability of false alarm decreases to around 0.2. Computational time of the algorithms: Finally, we compare the execution time of our algorithms. To this end, we implemented the algorithms in Matlab 8.1, and used the Parallel Computing Toolbox which allows to execute Steps 1-3 of Algorithm 1 in parallel. We performed

experiments on a server running Intel Xeon Processor E51630 v4 (4 cores, 3.7GHz) with 16GB RAM. The results are presented in Fig. 12. Evidently, the computational time of the Flow Aggregation algorithm is significantly smaller (by orders of magnitude) than the Cross Entropy method. For Cross Entropy method, the computational time increases (approximately) by an order of magnitude, when the number of flows is increased from 100 to 1000, demonstrating a linear relationship of time on the number of flows. D. Discussion Based on the extensive experiments conducted using synthetic, semi-synthetic and real traffic data, we make the following conclusions regarding the two algorithms: 1) Both algorithms perform very well (high detection and low false alarm probabilities) even for short state-path lengths, as well as when the portion of flows under attack is small. 2) The Cross Entropy based algorithm not only performs better than the flow aggregation based algorithm, it is also able to attribute which subset of flows is anomalous with high accuracy (see Fig. 11). 12

In this work, we modeled TCP’s FSM as it naturally aligns with a Markov Chain. However, our solution can also be used to model other features that are helpful in detecting different network attacks. For example, traffic rate can be modeled using Markov Chain, as the current rate of flows (or sessions) is dependent on the previous rate (assuming discrete time slots). Such a feature is useful in detecting DDoS attacks. The feature can be broken down to protocol levels, to detect attacks specific to protocols, such as DNS for DNS-based DDoS reflection attack. Hence, one potential future direction is to explore and study the effectiveness of other features (that can be modeled using Markov Chain). We are also interested in extending our algorithms to the case where the traffic characteristics of normal traffic is not known (or not available). This would translate to the transition matrix H0 being unknown, and therefore needs to be estimated from the data (as described in Section IV-B).

1

10

0

Computation time [Sec]

10

LRT: Cross Entropy, T=50 GLRT: Cross Entropy, T=50 LRT: Cross Entropy, T=100 GLRT: Cross Entropy, T=100 LRT: Flow Aggregation, T=50 GLRT: Flow Aggregation, T=50 LRT: Flow Aggregation, T=100 GLRT: Flow Aggregation, T=100

−1

10

Cross entropy −2

10

−3

10

Flow aggregation −4

10

100

200

300

400

500

600

700

800

900

1000

Number of flows

Fig. 12: Computational time of the algorithms, as a function of number of flows

A CKNOWLEDGMENT This material is based on research work supported by the Singapore National Research Foundation under NCR Award No. NRF2014NCR-NCR001-034.

This of course comes at the price of higher computational complexity, as presented in Section V-C. 3) The performance of the CE based algorithm heavily depends on the number of Monte Carlo samples N . While we do not make specific recommendations, we empirically observed that in order to maintain the same performance for different number of flows, N should grow linearly with the number of flows. 4) In all the real data simulations, very good detection performance was achieved when the intensity of attacks was 10% or more. The two algorithms complement each other; therefore, the flow aggregation algorithm can be used to detect anomalies when the computational resources do not scale in proportion to the network size. Cross Entropy method can then be triggered, whenever there is an anomaly, to detect the anomalous flows related to the event.

R EFERENCES [1] “Cisco Annual Security Report,” 2016, http://www.cisco.com/c/m/en us/offers/sc04/2016-annualsecurity-report/index.html. [2] D. Turner, K. Levchenko, A. C. Snoeren, and S. Savage, “California fault lines: understanding the causes and impact of network failures,” ACM SIGCOMM CCR, vol. 41, no. 4, pp. 315–326, 2011. [3] I. Nevat, G. W. Peters, and I. B. Collings, “Distributed detection in sensor networks over fading channels with multiple antennas at the fusion centre,” IEEE Transactions on signal processing, vol. 62, no. 1-4, pp. 671–683, 2014. [4] J. Y. Koh, I. Nevat, D. Leong, and W.-C. Wong, “Geo-spatial location spoofing detection for internet of things,” IEEE Internet of Things Journal, vol. 3, no. 6, pp. 971–978, 2016. [5] R. J. Bolton and D. J. Hand, “Statistical fraud detection: A review,” Statistical science, pp. 235–249, 2002. [6] S. Sridhar, A. Hahn, and M. Govindarasu, “Cyber–physical system security for the electric power grid,” Proceedings of the IEEE, vol. 100, no. 1, pp. 210–224, 2012. [7] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM computing surveys (CSUR), vol. 41, no. 3, p. 15, 2009. [8] A. K. Marnerides, A. Schaeffer-Filho, and A. Mauthe, “Traffic anomaly diagnosis in Internet backbone networks: a survey,” Computer Networks, vol. 73, pp. 224–243, 2014. [9] F. Iglesias and T. Zseby, “Analysis of network traffic features for anomaly detection,” Machine Learning, pp. 1–26, 2014. [10] I. C. Paschalidis and G. Smaragdakis, “Spatio-temporal network anomaly detection by assessing deviations of empirical measures,” IEEE/ACM Trans. on Netw. (TON), vol. 17, no. 3, pp. 685–697, 2009. [11] J. Wang and I. C. Paschalidis, “Statistical traffic anomaly detection in time-varying communication networks,” IEEE Transactions on Control of Network Systems, vol. 2, no. 2, pp. 100–111, 2015. [12] H. Van Trees, Detection, estimation, and modulation theory, Part 1. Wiley New York, 1968. [13] S. Kay, Fundamentals of Statistical Signal Processing, Volume 2: Detection Theory. Prentice Hall PTR, 1998. [14] R. Rubinstein, “The cross-entropy method for combinatorial and continuous optimization,” Methodology and computing in applied probability, vol. 1, no. 2, pp. 127–190, 1999. [15] R. Y. Rubinstein and D. P. Kroese, The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning. Springer Science & Business Media, 2013. [16] W. Hoeffding, “Asymptotically optimal tests for multinomial distributions,” The Annals of Mathematical Statistics, pp. 369–401, 1965.

VIII. C ONCLUSIONS In this work, we developed a new statistical framework for anomaly detection in temporally correlated traffic communication networks, via Markov Chain modelling of the monitored traffic features. We formulated the optimal anomaly detection problem as the NeymanPearson Likelihood Ratio Test and developed two optimal detection algorithms. The first algorithm based on Cross Entropy, not only detects the existence of an anomaly, but also attributes it to the subset of flows are anomalous. The second algorithm is based on flow aggregation which allows for a compact low dimensional representation of the raw traffic flows. We evaluated the detection performance of our algorithms via extensive simulations using synthetic, semi-synthetic and real data, and demonstrated that good performance (in terms of high detection probability and low false alarm probability) is obtained even for short state-paths, as well as when the portion of anomalous flows is small. 13

[17] C. E. Shannon, “A mathematical theory of communication,” ACM SIGMOBILE Mobile Computing and Communications Review, vol. 5, no. 1, pp. 3–55, 2001. [18] Y. Gu, A. McCallum, and D. Towsley, “Detecting anomalies in network traffic using maximum entropy estimation,” in Proc. IMC 2005, pp. 345–350. [19] A. Wald, “Contributions to the theory of statistical estimation and testing hypotheses,” The Annals of Mathematical Statistics, vol. 10, no. 4, pp. 299–326, 1939. [20] F. Simmross-Wattenberg, J. I. Asensio-Perez, P. Casaseca-de-la Higuera, M. Martin-Fernandez, I. A. Dimitriadis, and C. AlberolaLopez, “Anomaly detection in network traffic based on statistical inference and alpha-stable modeling,” IEEE Transactions on Dependable and Secure Computing, vol. 8, no. 4, pp. 494–509, 2011. [21] G. Thatte, U. Mitra, and J. Heidemann, “Parametric methods for anomaly detection in aggregate traffic,” IEEE/ACM Transactions on Networking (TON), vol. 19, no. 2, pp. 512–525, 2011. [22] A. G. Tartakovsky, A. S. Polunchenko, and G. Sokolov, “Efficient computer network anomaly detection by changepoint detection methods,” IEEE Journal on Selected Topics in Signal Processing,, vol. 7, no. 1, pp. 4–11, 2013. [23] A. Wald, Sequential analysis. Courier Corporation, 1973. [24] C. Callegari, S. Vaton, and M. Pagano, “A new statistical method for detecting network anomalies in TCP traffic,” European Transactions on Telecommunications, vol. 21, no. 7, pp. 575–588, 2010. [25] A. Coluccia, A. DAlconzo, and F. Ricciato, “Distribution-based anomaly detection via generalized likelihood ratio test: A general maximum entropy approach,” Computer Networks, vol. 57, no. 17, pp. 3446–3462, 2013. [26] A. Dembo and O. Zeitouni, Large deviations techniques and applications. Springer Science & Business Media, 2009, vol. 38. [27] B. C. Levy, Principles of Signal Detection and Parameter Estimation, 1st ed. Springer Publishing Company, Incorporated, 2008. [28] J. Zhang and I. C. Paschalidis, “An improved composite hypothesis test for Markov models with applications in network anomaly detection,” in IEEE CDC, 2015, pp. 3810–3815. [29] L. Muscariello, M. Mellia, M. Meo, M. A. Marsan, and R. L. Cigno, “Markov models of internet traffic and a new hierarchical MMPP model,” Computer Commun., vol. 28, no. 16, pp. 1835–1851, 2005. [30] L. Guo, M. Crovella, and I. Matta, “How does TCP generate pseudo-self-similarity?” in Proc. MASCOTS, 2001, pp. 215–223. [31] J. M. Estevez-Tapiador, P. Garcia-Teodoro, and J. E. Diaz-Verdejo, “Stochastic protocol modeling for anomaly based network intrusion detection,” in Proc. IEEE International Workshop on Information Assurance, IWIAS 2003, pp. 3–12. [32] A. A. Markov, “Rasprostranenie zakona bolshih chisel na velichiny, zavisyaschie drug ot druga,” Izvestiya Fiziko-matematicheskogo obschestva pri Kazanskom universitete, vol. 15, no. 135-156, p. 18, 1906. [33] A. Markov, “Extension of the limit theorems of probability theory to a sum of variables connected in a chain,” Dynamic probabilistic systems, vol. Markov models, pp. 552–577, 1971. [34] J. Neyman and E. S. Pearson, “On the use and interpretation of certain test criteria for purposes of statistical inference: Part i,” Biometrika, pp. 175–240, 1928. [35] O. Zeitouni, J. Ziv, and N. Merhav, “When is the generalized likelihood ratio test optimal?” IEEE Transactions on Information Theory,, vol. 38, no. 5, pp. 1597–1602, 1992. [36] J. Ziv, “On classification with empirically observed statistics and universal data compression,” IEEE Transactions on Information Theory, vol. 34, no. 2, pp. 278–286, 1988. [37] J. Ziv and N. Merhav, “Estimating the number of states of a finitestate source,” IEEE Trans. on Information Theory, vol. 38, no. 1, pp. 61–65, 1992. [38] J. Hammersley and D. Handscomb, “Monte carlo methods. methuens monographs on applied probability and statistics,” Methuen, London, 1964. [39] A. Costa, O. D. Jones, and D. Kroese, “Convergence properties of the cross-entropy method for discrete optimization,” Operations Research Letters, vol. 35, no. 5, pp. 573–580, 2007. [40] T. W. Anderson and L. A. Goodman, “Statistical inference about Markov chains,” The Annals of Mathematical Statistics, pp. 89–110, 1957. [41] MAWI Working Group Traffic Archive. [Online]. Available: http://mawi.wide.ad.jp/mawi [42] The WIDE Project. [Online]. Available: http://www.wide.ad.jp [43] R. Hofstede, L. Hendriks, A. Sperotto, and A. Pras, “SSH Compromise Detection Using NetFlow/IPFIX,” ACM SIGCOMM CCR, vol. 44, no. 5, pp. 20–26, Oct. 2014.

[44] P. Borgnat, G. Dewaele, K. Fukuda, P. Abry, and K. Cho, “Seven Years and One Day: Sketching the Evolution of Internet Traffic,” in IEEE INFOCOM 2009, pp. 711–719. [45] F. Silveira, C. Diot, N. Taft, and R. Govindan, “ASTUTE: Detecting a Different Class of Traffic Anomalies,” in Proc. ACM SIGCOMM, 2010, pp. 267–278. [46] Z. Rached, F. Alajaji, and L. L. Campbell, “The Kullback-Leibler divergence rate between Markov sources,” IEEE Transactions on Information Theory, vol. 50, no. 5, pp. 917–921, 2004. [47] J. Zhang and I. C. Paschalidis, “Statistical anomaly detection via composite hypothesis testing for markov models,” arXiv preprint arXiv:1702.08435, 2017.

14

Anomaly Detection and Attribution in Networks with ...

AbstractâAnomaly detection in communication networks is the first step in the challenging task of securing a net- work, as anomalies may indicate suspicious behaviors, attacks, network malfunctions or failures. In this work, we address the problem of not only detecting the anomalous events, but also of attributing the ...

Download PDF

634KB Sizes 3 Downloads 448 Views

Report

Anomaly Detection and Attribution in Networks with ...

Recommend Documents