Dynamic Anomalography: Tracking Network Anomalies via Sparsity and Low Rank^\dag

# Dynamic Anomalography: Tracking Network Anomalies via Sparsity and Low Rank†

## Abstract

In the backbone of large-scale networks, origin-to-destination (OD) traffic flows experience abrupt unusual changes known as traffic volume anomalies, which can result in congestion and limit the extent to which end-user quality of service requirements are met. As a means of maintaining seamless end-user experience in dynamic environments, as well as for ensuring network security, this paper deals with a crucial network monitoring task termed dynamic anomalography. Given link traffic measurements (noisy superpositions of unobserved OD flows) periodically acquired by backbone routers, the goal is to construct an estimated map of anomalies in real time, and thus summarize the network ‘health state’ along both the flow and time dimensions. Leveraging the low intrinsic-dimensionality of OD flows and the sparse nature of anomalies, a novel online estimator is proposed based on an exponentially-weighted least-squares criterion regularized with the sparsity-promoting -norm of the anomalies, and the nuclear norm of the nominal traffic matrix. After recasting the non-separable nuclear norm into a form amenable to online optimization, a real-time algorithm for dynamic anomalography is developed and its convergence established under simplifying technical assumptions. For operational conditions where computational complexity reductions are at a premium, a lightweight stochastic gradient algorithm based on Nesterov’s acceleration technique is developed as well. Comprehensive numerical tests with both synthetic and real network data corroborate the effectiveness of the proposed online algorithms and their tracking capabilities, and demonstrate that they outperform state-of-the-art approaches developed to diagnose traffic anomalies.

Submitted:

## 1Introduction

Communication networks have evolved from specialized, research and tactical transmission systems to large-scale and highly complex interconnections of intelligent devices. Thus, ensuring compliance to service-level agreements and quality-of-service (QoS) guarantees necessitates ground-breaking management and monitoring tools providing operators with a comprehensive and updated view of the network landscape. Situational awareness provided by such tools will be a key enabler for effective information dissemination, routing and congestion control, network health management, and security assurance.

In the backbone of large-scale networks, origin-to-destination (OD) traffic flows experience abrupt unusual changes which can result in congestion, and limit QoS provisioning of the end users. These so-termed traffic volume anomalies could be due to unexpected failures in networking equipment, cyberattacks (e.g., denial of service (DoS) attacks), or, intruders which hijack the network services [35]. Unveiling such anomalies in a promptly manner is a crucial monitoring task towards engineering network traffic. This is a challenging task however, since the available data are usually high-dimensional, noisy and possibly incomplete link-load measurements, which are the superposition of unobservable OD flows. Several studies have experimentally demonstrated the low intrinsic dimensionality of the nominal traffic subspace, that is, the intuitive low-rank property of the traffic matrix in the absence of anomalies, which is mainly due to common temporal patterns across OD flows, and periodic behavior across time [21]. Exploiting the low-rank structure of the anomaly-free traffic matrix, a landmark principal component analysis (PCA)-based method was put forth in [20] to identify network anomalies; see also [27] for a distributed implementation. A limitation of the algorithm in [20] is that it cannot identify the anomalous flows. Most importantly, [20] has not exploited the sparsity of anomalies across flows and time – anomalous traffic spikes are rare, and tend to last for short periods of time relative to the measurement horizon.

Capitalizing on the low-rank property of the traffic matrix and the sparsity of the anomalies, the fresh look advocated here permeates benefits from rank minimization [11], and compressive sampling [12], to perform dynamic anomalography. The aim is to construct a map of network anomalies in real time, that offers a succinct depiction of the network ‘health state’ across both the flow and time dimensions (Section 2). Special focus will be placed on devising online (adaptive) algorithms that are capable of efficiently processing link measurements and track network anomalies ‘on the fly’; see also [3] for a ‘model-free’ approach that relies on the kernel recursive LS (RLS) algorithm. Accordingly, the novel online estimator entails an exponentially-weighted least-squares (LS) cost regularized with the sparsity-promoting -norm of the anomalies, and the nuclear norm of the nominal traffic matrix. After recasting the non-separable nuclear norm into a form amenable to online optimization (Section 3.1), a real-time algorithm for dynamic anomalography is developed in Section 4 based on alternating minimization. Each time a new datum is acquired, anomaly estimates are formed via the least-absolute shrinkage and selection operator (Lasso), e.g, [17], and the low-rank nominal traffic subspace is refined using RLS [34]. Convergence analysis is provided under simplifying technical assumptions in Section 4.2. For situations were reducing computational complexity is critical, an online stochastic gradient algorithm based on Nesterov’s accelaration technique [5] is developed as well (Section 5.1). The possibility of implementing the anomaly trackers in a distributed fashion is further outlined in Section 5.2, where several directions for future research are also delineated.

Extensive numerical tests involving both synthetic and real network data corroborate the effectiveness of the proposed algorithms in unveiling network anomalies, as well as their tracking capabilities when traffic routes are slowly time-varying, and the network monitoring station acquires incomplete link traffic measurements (Section 6). Different from [40] which employs a two-step batch procedure to learn the nominal traffic subspace first, and then unveil anomalies via -norm minimization, the approach here estimates both quantities jointly and attains better performance as illustrated in Section 6.2. Concluding remarks are given in Section 7, while most technical details relevant to the convergence proof in Section 4.3 are deferred to the Appendix.

: Bold uppercase (lowercase) letters will denote matrices (column vectors), and calligraphic letters will be used for sets. Operators , , , , and , will denote transposition, matrix trace, minimum eigenvalue, projection onto the nonnegative orthant, and expectation, respectively; will be used for the cardinality of a set, and the magnitude of a scalar. The positive semidefinite matrix will be denoted by . The -norm of is for . For two matrices , denotes their trace inner product. The Frobenious norm of matrix is , is the spectral norm, is the -norm, and is the nuclear norm, where denotes the -th singular value of . The identity matrix will be represented by , while will stand for the vector of all zeros, and .

## 2Modeling Preliminaries and Problem Statement

Consider a backbone Internet protocol (IP) network naturally modeled as a directed graph , where and denote the sets of nodes (routers) and physical links of cardinality and , respectively. The operational goal of the network is to transport a set of OD traffic flows (with ) associated with specific source-destination pairs. For backbone networks, the number of network layer flows is much larger than the number of physical links . Single-path routing is adopted here, that is, a given flow’s traffic is carried through multiple links connecting the corresponding source-destination pair along a single path. Let , denote the flow to link assignments (routing), which take the value one whenever flow is carried over link , and zero otherwise. Unless otherwise stated, the routing matrix is assumed fixed and given. Likewise, let denote the unknown traffic rate of flow at time , measured in e.g., Mbps. At any given time instant , the traffic carried over link is then the superposition of the flow rates routed through link , i.e., .

It is not uncommon for some of the flow rates to experience unusual abrupt changes. These so-termed traffic volume anomalies are typically due to unexpected network failures, or cyberattacks (e.g., DoS attacks) which aim at compromising the services offered by the network [35]. Let denote the unknown traffic volume anomaly of flow at time . In the presence of anomalous flows, the measured traffic carried by link over a time horizon is then given by

where the noise variables account for measurement errors and unmodeled dynamics.

In IP networks, traffic volume can be readily measured on a per-link basis using off-the-shelf tools such as the simple network management protocol (SNMP) supported by most routers. Missing entries in the link-level measurements may however skew the network operator’s perspective. SNMP packets may be dropped for instance, if some links become congested, rendering link count information for those links more important, as well as less available [32]. To model missing link measurements, collect the tuples associated with the available observations in the set . Introducing the matrices , and , the (possibly incomplete) set of measurements in can be expressed in compact matrix form as

where the sampling operator sets the entries of its matrix argument not in to zero, and keeps the rest unchanged. Matrix contains the nominal traffic flows over the time horizon of interest. Common temporal patterns among the traffic flows in addition to their periodic behavior, render most rows (respectively columns) of linearly dependent, and thus typically has low rank. This intuitive property has been extensively validated with real network data; see e.g, [21]. Anomalies in are expected to occur sporadically over time, and last shortly relative to the (possibly long) measurement interval . In addition, only a small fraction of the flows is supposed to be anomalous at a any given time instant. This renders the anomaly traffic matrix sparse across both rows (flows) and columns (time).

Given measurements adhering to and the binary-valued routing matrix , the main goal of this paper is to accurately estimate the anomaly matrix , by capitalizing on the sparsity of and the low-rank property of . Special focus will be placed on devising online (adaptive) algorithms that are capable of efficiently processing link measurements and tracking network anomalies in real time. This critical monitoring task is termed dynamic anomalography, and the resultant estimated map offers a depiction of the network’s ‘health state’ along both the flow and time dimensions. If , the -th flow at time is deemed anomalous, otherwise it is healthy. By examining the network operator can immediately determine the links carrying the anomalous flows. Subsequently, planned contingency measures involving traffic-engineering algorithms can be implemented to address network congestion.

## 3Unveiling Anomalies via Sparsity and Low Rank

Consider the nominal link-count traffic matrix , which inherits the low-rank property from . Since the primary goal is to recover , the following observation model

can be adopted instead of . A natural estimator leveraging the low rank property of and the sparsity of will be sought next. The idea is to fit the incomplete data to the model in the least-squares (LS) error sense, as well as minimize the rank of , and the number of nonzero entries of measured by its -(pseudo) norm. Unfortunately, albeit natural both rank and -norm criteria are in general NP-hard to optimize [28]. Typically, the nuclear norm and the -norm are adopted as surrogates, since they are the closest convex approximants to and , respectively [30]. Accordingly, one solves

where are rank- and sparsity-controlling parameters. When an estimate of the noise variance is available, guidelines for selecting and have been proposed in [42]. Being convex (P1) is appealing, and it is known to attain good performance in theory and practice [25]. Also and its estimator (P1) are quite general, as discussed in the ensuing remark.

Albeit convex, (P1) is a non-smooth optimization problem (both the nuclear and -norms are not differentiable at the origin). In addition, scalable algorithms to unveil anomalies in large-scale networks should effectively overcome the following challenges: (c1) the problem size can easily become quite large, since the number of optimization variables is ; (c2) existing iterative solvers for (P1) typically rely on costly SVD computations per iteration; see e.g., [25]; and (c3) different from the Frobenius and -norms, (columnwise) nonseparability of the nuclear-norm challenges online processing when new columns of arrive sequentially in time. In the remainder of this section, the ‘big data’ challenges (c1) and (c2) are dealt with to arrive at an efficient batch algorithm for anomalography. Tracking network anomalies is the main subject of Section Section 4.

To address (c1) and reduce the computational complexity and memory storage requirements of the algorithms sought, it is henceforth assumed that an upper bound is a priori available [ is the estimate obtained via (P1)]. As argued next, the smaller the value of , the more efficient the algorithm becomes. Small values of are well motivated due to the low intrinsic dimensionality of network flows [20]. Because , (P1)’s search space is effectively reduced and one can factorize the decision variable as , where and are and matrices, respectively. It is possible to interpret the columns of (viewed as points in ) as belonging to a low-rank ‘nominal traffic subspace’, spanned by the columns of . The rows of are thus the projections of the columns of onto the traffic subspace.

Adopting this reparametrization of in (P1), one arrives at an equivalent optimization problem

which is non-convex due to the bilinear terms . The number of variables is reduced from in (P1), to in (P2). The savings can be significant when is small, and both and are large. Note that the dominant -term in the variable count of (P2) is due to , which is sparse and can be efficiently handled even when both and are large.

### 3.1A separable low-rank regularization

To address (c2) [along with (c3) as it will become clear in Section 4], consider the following alternative characterization of the nuclear norm [30]

The optimization is over all possible bilinear factorizations of , so that the number of columns of and is also a variable. Leveraging , the following reformulation of (P2) provides an important first step towards obtaining an online algorithm:

As asserted in [24], adopting the separable Frobenius-norm regularization in (P3) comes with no loss of optimality relative to (P1), provided . By finding the global minimum of (P3) [which could have considerably less variables than (P1)], one can recover the optimal solution of (P1). However, since (P3) is non-convex, it may have stationary points which need not be globally optimum. Interestingly, the next proposition shows that under relatively mild assumptions on and the noise variance, every stationary point of (P3) is globally optimum for (P1). For a proof, see [24].

Let be a stationary point of (P3). If , then is the globally optimal solution of (P1).

The qualification condition captures tacitly the role of . In particular, for sufficiently small the residual becomes large and consequently the condition is violated [unless is large enough, in which case a sufficiently low-rank solution to (P1) is expected]. The condition on the residual also implicitly enforces , which is necessary for the equivalence between (P1) and (P3). In addition, the noise variance affects the value of , and thus satisfaction of the said qualification inequality.

### 3.2Batch block coordinate-descent algorithm

The block coordinate-descent (BCD) algorithm is adopted here to solve the batch non-convex optimization problem (P3). BCD is an iterative method which has been shown efficient in tackling large-scale optimization problems encountered with various statistical inference tasks, see e.g., [6]. The proposed solver entails an iterative procedure comprising three steps per iteration

[S1]

Update the anomaly map:

[S2]

Update the nominal traffic subspace:

[S3]

Update the projection coefficients:

To update each of the variable groups, the cost of (P3) is minimized while fixing the rest of the variables to their most up-to-date values. The minimization in [S1] decomposes over the columns of . At iteration , these columns are updated in parallel via Lasso

where and respectively denote the -th column of and , while the diagonal matrix contains a one on its -th diagonal entry if is observed, and a zero otherwise. In practice, each iteration of the proposed algorithm minimizes inexactly, by performing one pass of the cyclic coordinate-descent algorithm in [17]; see Algorithm ? for the detailed iterations. As shown at the end of this section, this inexact minimization suffices to claim convergence to a stationary point of (P3).

Similarly, in [S2] and [S3] the minimizations that give rise to and are separable over their respective rows. For instance, the -th row of the traffic subspace matrix is updated as the solution of the following ridge-regression problem

where and represent the -th row of and , respectively. The -th diagonal entry of the diagonal matrix is an indicator variable testing whether measurement is available. A similar regularized LS problem yields , ; see Algorithm ? for the details and a description of the overall BCD solver. The novel batch scheme for unveiling network anomalies is less complex computationally than the accelerated proximal gradient algorithm in [25], since Algorithm ?’s iterations are devoid of SVD computations.

Despite being non-convex and non-differentiable, (P3) has favorable structure which facilitates convergence of the iterates generated by Algorithm ?. Specifically, the resulting cost is convex in each block variable when the rest are fixed. The non-smooth -norm is also separable over the entries of its matrix argument. Accordingly, [37] guarantees convergence of the BCD algorithm to a stationary point of (P3). This result together with Proposition ? establishes the next claim.

If a subsequence of iterates generated by Algorithm ? satisfies , then it converges to the optimal solution set of (P1) as .

In practice, it is desirable to monitor anomalies in real time and accomodate time-varying traffic routes. These reasons motivate devising algorithms for dynamic anomalography, the subject dealt with next.

## 4Dynamic Anomalography

Monitoring of large-scale IP networks necessitates collecting massive amounts of data which far outweigh the ability of modern computers to store and analyze them in real time. In addition, nonstationarities due to routing changes and missing data further challenge identification of anomalies. In dynamic networks routing tables are constantly readjusted to effect traffic load balancing and avoid congestion caused by e.g., traffic anomalies or network infrastructure failures. To account for slowly time-varing routing tables, let denote the routing matrix at time 1. In this dynamic setting, the partially observed link counts at time adhere to [cf. ]

where the link-level traffic , for from the (low-dimensional) traffic subspace. In general, routing changes may alter a link load considerably by e.g., routing traffic completely away from a specific link. Therefore, even though the network-level traffic vectors live in a low-dimensional subspace, the same may not be true for the link-level traffic when the routing updates are major and frequent. In backbone networks however, routing changes are sporadic relative to the time-scale of data acquisition used for network monitoring tasks. For instance, data collected from the operation of Internet-2 network reveals that only a few rows of change per week [1]. It is thus safe to assume that still lies in a low-dimensional subspace, and exploit the temporal correlations of the observations to identify the anomalies.

On top of the previous arguments, in practice link measurements are acquired sequentially in time, which motivates updating previously obtained estimates rather than re-computing new ones from scratch each time a new datum becomes available. The goal is then to recursively estimate at time from historical observations , naturally placing more importance to recent measurements. To this end, one possible adaptive counterpart to (P3) is the exponentially-weighted LS estimator found by minimizing the empirical cost

in which is the so-termed forgetting factor. When data in the distant past are exponentially downweighted, which facilitates tracking network anomalies in nonstationary environments. In the case of static routing () and infinite memory , the formulation coincides with the batch estimator (P3). This is the reason for the time-varying factor weighting .

### 4.1Tracking network anomalies

Towards deriving a real-time, computationally efficient, and recursive solver of , an alternating minimization method is adopted in which iteration coincides with the time scale of data acquisition. A justification in terms of minimizing a suitable approximate cost function is discussed in detail in Section 4.2. Per time instant , a new datum is drawn and are jointly estimated via

It turns out that can be efficiently solved. Fixing to carry out the minimization with respect to first, one is left with an -norm regularized LS (ridge-regression) problem

Note that is an affine function of , and the update rule for is not well defined until is replaced with . Towards obtaining an expression for , define for notational convenience, and substitute back into to arrive at the Lasso estimator

where . The diagonal matrix was defined in Section 3.2, see the discussion after .

In the second step of the alternating-minimization scheme, the updated subspace matrix is obtained by minimizing with respect to , while the optimization variables are fixed and take the values . This yields

Similar to the batch case, decouples over the rows of which are obtained in parallel via

where denotes the -th diagonal entry of . For , subproblems can be efficiently solved using the RLS algorithm [34]. Upon defining , , and , with one simply updates

and forms , for .

However, for the regularization term in makes it impossible to express in terms of plus a rank-one correction. Hence, one cannot resort to the matrix inversion lemma and update with quadratic complexity only. Based on direct inversion of , , the overall recursive algorithm for tracking network anomalies is tabulated under Algorithm ?. The per iteration cost of the inversions (each , which could be further reduced if one leverages also the symmetry of ) is affordable for moderate number of links, because is small when estimating low-rank traffic matrices. Still, for those settings where computational complexity reductions are at a premium, an online stochastic gradient descent algorithm is described in Section 5.1.

### 4.2Convergence Analysis

This section studies the convergence of the iterates generated by Algorithm ?, for the infinite memory special case i.e., when . Upon defining the function

in addition to , the online solver of Section 4.1 aims at minimizing the following average cost function at time

Normalization (by ) ensures that the cost function does not grow unbounded as time evolves. For any finite , it is essentially identical to the batch estimator in (P3) up to a scaling, which does not affect the value of the minimizers. Note that as time evolves, minimization of becomes increasingly complex computationally. Even evaluating is challenging for large , since it entails solving Lasso problems to minimize all and define the functions , . Hence, at time the subspace estimate is obtained by minimizing the approximate cost function

in which are obtained based on the prior subspace estimate after solving [cf. ]

Obtaining this way resembles the projection approximation adopted in [39], and can only be evaluated after is obtained [cf. ]. Since is a smooth convex function, the minimizer is the solution of the quadratic equation .

So far, it is apparent that the approximate cost function overestimates the target cost , for . However, it is not clear whether the dictionary iterates converge, and most importantly, how well can they optimize the target cost function . The good news is that asymptotically approaches , and the subspace iterates null as well, both as . The latter result is summarized in the next proposition, which is proved in the next section.

Assume that: a1) and are independent and identically distributed (i.i.d.) random processes; a2) is uniformly bounded; a3) iterates are in a compact set ; a4) is positive definite, namely for some ; and a5) the Lasso has a unique solution. Then almost surely (a.s.), i.e., the subspace iterates asymptotically coincide with the stationary points of the batch problem (P3).

To clearly delineate the scope of the analysis, it is worth commenting on the assumptions a1)-a5) and the factors that influence their satisfaction. Regarding a1), the acquired data is assumed statistically independent across time as it is customary when studying the stability and performance of online (adaptive) algorithms [34]. Still, in accordance with the adaptive filtering folklore, as the upshot of the analysis based on i.i.d. data extends accurately to the pragmatic setting whereby the link-counts and missing data patterns exhibit spatiotemporal correlations. Uniform boundedness of [cf. a2)] is satisfied in practice, since the traffic is always limited by the (finite) capacity of the physical links. The bounded subspace requirement in a3) is a technical assumption that simplifies the arguments of the ensuing proof, and has been corroborated via computer simulations. It is apparent that the sampling set plays a key role towards ensuring that a4) and a5) are satisfied. Intuitively, if the missing entries tend to be only few and somehow uniformly distributed across links and time, they will not markedly increase coherence of the regression matrices , and thus compromise the uniqueness of the Lasso solutions. This also increases the likelihood that holds. As argued in [23], if needed one could incorporate additional regularization terms in the cost function to enforce a4) and a5). Before moving on to the proof, a remark is in order.

### 4.3Proof of Proposition

The main steps of the proof are inspired by [23], which studies convergence of an online dictionary learning algorithm using the theory of martingale sequences; see e.g., [22]. However, relative to [23] the problem here introduces several distinct elements including: i) missing data with a time-varying pattern ; ii) a non-convex bilinear term where the tall subspace matrix plays a role similar to the fat dictionary in [23], but the multiplicative projection coefficients here are not sparse; and iii) the additional bilinear terms which entail sparse coding of as in [23], but with a known regression (routing) matrix. Hence, convergence analysis becomes more challenging and demands, in part, for a new treatment. Accordingly, in the sequel emphasis will be placed on the novel aspects specific to the problem at hand.

The basic structure of the proof consists of three preliminary lemmata, which are subsequently used to establish that a.s. through a simple argument. The first lemma deals with regularity properties of functions and , which will come handy later on; see Appendix A for a proof.

If a2) and a5) hold, then the functions: i) , ii) , iii) , and iv) are Lipschitz continuous for ( is a compact set), with constants independent of .

The next lemma (proved in Appendix B) asserts that the distance between two subsequent traffic subspace estimates vanishes as , a property that will be instrumental later on when establishing that a.s.

If a2)-a5) hold, then .

The previous lemma by no means implies that the subspace iterates converge, which is a much more ambitious objective that may not even hold under the current assumptions. The final lemma however, asserts that the cost sequence indeed converges with probability one; see Appendix C for a proof.

If a1)-a5) hold, then converges a.s. Moreover, a.s.

Putting the pieces together, in the sequel it is shown that the sequence converges a.s. to zero, and since by algorithmic construction, the subspace iterates coincide with the stationary points of the target cost function . To this end, it suffices to prove that every convergent subsequence nulls the gradient asymptotically, which in turn implies that the entire sequence converges to the set of stationary points of the batch problem (P3).

Since is compact by virtue of a3), one can always pick a convergent subsequence whose limit point is , say2. Consider the positive-valued decreasing sequence (that necessarily converges to zero), and recall that for any . From the mean-value theorem and for arbitrary , expanding both sides of the inequality around the point one arrives at

for some and all . Taking limit as and applying Lemma ? it follows that

One can readily show that is bounded since is uniformly bounded [cf. a2)]. Consequently, . Furthermore, since is Lipschitz as per Lemma ?, is Lipschitz as well and it follows that . All in all, the second term in vanishes and one is left with

Because is arbitrary, can only hold if a.s., which completes the proof.

## 5Further Algorithmic Issues

For completeness, this section outlines a couple of additional algorithmic aspects relevant to anomaly detection in large-scale networks. Firstly, a lightweight first-order algorithm is developed as an alternative to Algorithm ?, which relies on fast Nesterov-type gradient updates for the traffic subspace. Secondly, the possibility of developing distributed algorithms for dynamic anomalography is discussed.

Reduction of the computational complexity in updating the traffic subspace is the subject of this section. The basic alternating minimization framework in Section 4.1 will be retained, and the updates for will be identical to those tabulated under Algorithm ?. However, instead of solving an unconstrained quadratic program per iteration to obtain [cf. ], the refinements to the subspace estimate will be given by a (stochastic) gradient algorithm.

As discussed in Section 4.2, in Algorithm ? the subspace estimate is obtained by minimizing the empirical cost function , where

By the law of large numbers, if data are stationary, solving yields the desired minimizer of the expected cost , where the expectation is taken with respect to the unknown probability distribution of the data. A standard approach to achieve this same goal – typically with reduced computational complexity – is to drop the expectation (or the sample averaging operator for that matter), and update the nominal traffic subspace via a stochastic gradient iteration [34]

where is a stepsize, , and . In the context of adaptive filtering, stochastic gradient algorithms such as are known to converge typically slower than RLS. This is expected since RLS can be shown to be an instance of Newton’s (second-order) optimization method [34].

Building on the increasingly popular accelerated gradient methods for (batch) smooth optimization [29], the idea here is to speed-up the learning rate of the estimated traffic subspace , without paying a penalty in terms of computational complexity per iteration. The critical difference between standard gradient algorithms and the so-termed Nesterov’s variant, is that the accelerated updates take the form , which relies on a judicious linear combination of the previous pair of iterates . Specifically, the choice , where , has been shown to significantly accelerate batch gradient algorithms resulting in convergence rate no worse than ; see e.g., [5] and references therein. Using this acceleration technique in conjunction with a backtracking stepsize rule [6], a fast online stochastic gradient algorithm for unveiling network anomalies is tabulated under Algorithm ?. Different from Algorithm ?, no matrix inversions are involved in the update of the traffic subspace . Clearly, a standard (non accelerated) stochastic gradient descent algorithm with backtracking stepsize rule is subsumed as a special case, when ,

Convergence analysis of Algorithm ? is beyond the scope of this paper, and will only be corroborated using computer simulations in Section 6. It is worth pointing out that since a non-diminishing stepsize is adopted, asymptotically the iterates generated by Algorithm ? will hover inside a ball centered at the minimizer of the expected cost, with radius proportional to the noise variance.

### 5.2In-network anomaly trackers

Implementing Algorithms ?- ? presumes that network nodes continuously communicate their local link traffic measurements to a central monitoring station, which uses their aggregation in to unveil network anomalies. While for the most part this is the prevailing operational paradigm adopted in current network technologies, it is fair to say there are limitations associated with this architecture. For instance, collecting all this information centrally may lead to excessive protocol overhead, especially when the rate of data acquisition is high at the routers. Moreover, minimizing the exchanges of raw measurements may be desirable to reduce unavoidable communication errors that translate to missing data. Performing the optimization in a centralized fashion raises robustness concerns as well, since the central monitoring station represents an isolated point of failure.

These reasons motivate devising fully-distributed iterative algorithms for dynamic anomalography in large-scale networks, embedding the network anomaly detection functionality to the routers. In a nutshell, per iteration nodes carry out simple computational tasks locally, relying on their own link count measurements (a few entries of the network-wide vector corresponding to the router links). Subsequently, local estimates are refined after exchanging messages only with directly connected neighbors, which facilitates percolation of local information to the whole network. The end goal is for network nodes to consent on a global map of network anomalies, and attain (or at least come close to) the estimation performance of the centralized counterpart which has all data available.

Relying on the alternating-directions method of multipliers (AD-MoM) as the basic tool to carry out distributed optimization, a general framework for in-network sparsity-regularized rank minimization was put forth in a companion paper [24]. In the context of network anomaly detection, results therein are encouraging yet there is ample room for improvement and immediate venues for future research open up. For instance, the distributed algorithms of [24] can only tackle the batch formulation (P3), so extensions to a dynamic network setting, e.g., building on the ideas here to devise distributed anomaly trackers seems natural. To obtain desirable tradeoffs in terms of computational complexity and speed of convergence, developing and studying algorithms for distributed optimization based on Nesterov’s acceleration techniques emerges as an exciting and rather pristine research direction; see [19] for early work dealing with separable batch optimization.

## 6Performance Tests

Performance of the proposed batch and online estimators is assessed in this section via computer simulations using both synthetic and real network data.

### 6.1Synthetic network data tests

Synthetic network example. A network of nodes is considered as a realization of the random geometric graph model with agents randomly placed on the unit square, and two agents link if their Euclidean distance is less than a prescribed communication range of ; see Fig. ?. The network graph is bidirectional and comprises links, and OD flows. For each candidate OD pair, minimum hop count routing is considered to form the routing matrix . Entries of are i.i.d., zero-mean, Gaussian with variance ; i.e., . Flow-traffic vectors are generated from the low-dimensional subspace with i.i.d. entries , and projection coefficients such that . Every entry of is randomly drawn from the set , with . Entries of are sampled uniformly at random with probability to form the diagonal sampling matrix . The observations at time instant are generated according to . Unless otherwise stated, , , and are used throughout. Different values of , and are tested.

Performance of the batch estimator. To demonstrate the merits of the batch BCD algorithm for unveiling network anomalies (Algorithm ?), simulated data are generated for a time interval of size . For validation purposes, the benchmark estimator (P1) is iteratively solved by alternating minimization over (which corresponds to Lasso) and . The minimizations with respect to can be carried out using the iterative singular-value thresholding (SVT) algorithm [7]. Note that with full data, SVT requires only a single SVD computation. In the presence of missing data however, the SVT algorithm may require several SVD computations until convergence, rendering the said algorithm prohibitively complex for large-scale problems. In contrast, Algorithm ? only requires simple inversions. Fig. ? (a) depicts the convergence of the respective algorithms used to solve (P1) and (P3), for different amounts of missing data (controlled by ). It is apparent that both estimators attain identical performance after a few tens of iterations, as asserted by Proposition ?. To corroborate the effectiveness of Algorithm ? in unveiling network anomalies across flows and time, Fig. ? (b) maps out the magnitude of the true and estimated anomalies when and . To discard spurious estimates, consider the hypothesis test , with anomalous and anomaly-free hypotheses and , respectively. The false alarm and detection rates achieved are then and , respectively.

Performance of the online algorithms. To confirm the convergence and effectiveness of the online Algorithms ? and ?, simulation tests are carried out for infinite memory and invariant routing matrix . Figure ? (a) depicts the evolutions of the average cost in for different amounts of missing data when the noise level is . It is evident that for both online algorithms the average cost converges (possibly within a ball) to its batch counterpart in (P3) normalized by the window size . Impressively, this observation together with the one in Fig. ? (a) corroborate that the online estimators can attain the performance of the benchmark estimator, whose stable/exact recovery performance is well documented e.g., in [25]. It is further observed that the more data are missing, the more time it takes to learn the low-rank nominal traffic subspace, which in turn slows down convergence.

To examine the tracking capability of the online estimators, Fig. ? (b) depicts the estimated versus true anomalies over time as Algorithm ? evolves for three representative flows indicated on Fig. ?, namely corresponding to the -th rows of . Setting the detection threshold to the value as before, for the flows Algorithm ? attains detection rate at false alarm rate , respectively. The quantification error per flow is also around , respectively. As expected, more false alarms are declared at early iterations as the low-rank subspace has not been learnt accurately. Upon learning the subspace performance improves and almost all anomalies are identified. Careful inspection of Fig. ? (b) reveals that the anomalies for are better identified visually than those for . As shown in Fig. ?, is carried over links each one carrying additional flows, respectively, whereas is aggregated over link with only additional flows. Hence, identifying ’s anomalies from the highly-superimposed load of links is a more challenging task relative to link . This simple example manifests the fact that the detection performance strongly depends on the network topology and the routing policy implemented, which determine the routing matrix. In accordance with [25], the coherence of sparse column subsets of the routing matrix plays an important role in identifying the anomalies. In essence, the more incoherent the column subsets of are, the better recovery performance one can attain. An intriguing question left here to address in future research pertains to desirable network topologies giving rise to incoherent routing matrices.

Tracking routing changes. The measurement model in has two time-varying attributes which challenge the identification of anomalies. The first one is missing measurement data arising from e.g., packet losses during the data collection process, and the second one pertains to routing changes due to e.g., network congestion or link failures. It is thus important to test whether the proposed online algorithm succeeds in tracking these changes. As discussed earlier, missing data are sampled uniformly at random. To assess the impact of routing changes on the recovery performance, a simple probabilistic model is adopted where each time instant a single link fails, or, returns to the operational state. Let denote the adjacency matrix of the network graph , where if there exists a physical link joining nodes and , and zero otherwise. Similarly, the active links involved in routing the data at time are represented by the effective adjacency matrix . At time instant , a biased coin is tossed with small success probability , and one of the links, say , is chosen uniformly at random and removed from while ensuring that the network remains connected. Likewise, an edge is added with the same probability . The resulting adjacency matrix is then , where the indicator function equals one when , and zero otherwise; and are i.i.d. Bernoulli random variables. The minimum hop-count algorithm is then applied to , to update the routing matrix . Note that with probability