Anomaly Detection in Dynamic Networks of Varying Size
Dynamic networks, also called network streams, are an important data representation that applies to many real-world domains. Many sets of network data such as e-mail networks, social networks, or internet traffic networks are best represented by a dynamic network due to the temporal component of the data. One important application in the domain of dynamic network analysis is anomaly detection. Here the task is to identify points in time where the network exhibits behavior radically different from a typical time, either due to some event (like the failure of machines in a computer network) or a shift in the network properties. This problem is made more difficult by the fluid nature of what is considered "normal" network behavior. The volume of traffic on a network, for example, can change over the course of a month or even vary based on the time of the day without being considered unusual. Anomaly detection tests using traditional network statistics have difficulty in these scenarios due to their Density Dependence: as the volume of edges changes the value of the statistics changes as well making it difficult to determine if the change in signal is due to the traffic volume or due to some fundamental shift in the behavior of the network. To more accurately detect anomalies in dynamic networks, we introduce the concept of Density-Consistent network statistics. These statistics are designed to produce results that reflect the state of the network independent of the volume of edges. On synthetically generated graphs anomaly detectors using these statistics show a a 20-400% improvement in the recall when distinguishing graphs drawn from different distributions. When applied to several real datasets Density-Consistent statistics recover multiple network events which standard statistics failed to find, and the times flagged as anomalies by Density-Consistent statistics have subgraphs with radically different structure from normal time steps.
Network analysis is a broad field but one of the more important applications is in the detection of anomalous or critical events. These anomalies could be a machine failure on a computer network, an example of malicious activity, or the repercussions of some event on a social network’s behavior [16, 12]. In this paper, we will focus on the task of anomaly detection in a dynamic network where the structure of the network is changing over time. For example, each time step could represent one day’s worth of activity on an e-mail network. The goal is then to identify any time steps where the pattern of those communications seems abnormal compared to those of other time steps.
As comparing the communication pattern of two network examples directly is complex, one simple approach is to summarize each network using a network statistic then compare the statistics. A number of anomaly detection methods rely on these statistics [15, 2, 8]. Another method is to use a network model such as ERGM . However, both these methods often encounter difficulties when the properties of the network are not static.
A typical real-world network experiences many changes in the course of its natural behavior, changes which are not examples of anomalous events. The most common of these is variation in the volume of edges. In the case of an e-mail network where the edges represent messages, the total number of messages could vary based on the time or there could be random variance in the number of messages sent each day. The statistics used to measure the network properties are usually intended to capture some other effect of the network than simply the volume of edges: for example, the clustering coefficient is usually treated as a measure of the transitivity. However, the common clustering coefficient measure is statistically inconsistent as the density of the network changes. Even on an Erdos-Renyi network, which does not explicitly capture transitive relationships, the clustering coefficient will be greater as the density of the network increases.
When statistics vary with the number of edges in the network, it is not valid to compare different network time steps using those statistics unless the number of edges is constant in each time step. A similar effect occurs with network models that employ features or statistics which are size sensitive:  show that ERGM models learn different parameters given subsets of the same graph, so even if the network properties are identical observing a smaller portion of the network leads to learning a different set of parameters.
The purpose of this work is to analytically characterize statistics by their sensitivity to network density, and offer principled alternatives that are consistent estimators, which empirically give more accurate results on networks with varying densities.
The major contributions of this paper are:
We prove that several commonly used network statistics are Density Dependent and poorly reflect the network behavior if the network size is not constant.
We offer alternative statistics that are Density Consistent which measure changes to the distribution of edges regardless of the total number of observed edges.
We demonstrate through theory and synthetic trials that anomaly detection tests using Density Consistent statistics are better at identifying when the distribution of edges in a network has changed.
We apply anomaly detection tests using both types of statistics to real data to show that Density Consistent statistics recover more major events of the network stream while Density Dependent statistics flag many time steps due to a change in the total edge count rather than an identifiable anomaly.
We analyze the subgraphs that changed the most in the anomalous time steps and demonstrate that Density Consistent statistics are better at finding local features which changed radically during the anomaly.
2 Statistic-Based Anomaly
A statistic-based anomaly detection method is any method which makes its determination of anomalous behavior using network statistics calculated on the graph examples. The actual anomaly detection process can be characterized in the form of a hypothesis test. The network statistics calculated on examples demonstrating normal network behavior form the null distribution, while the statistic on the network being tested for anomalies forms the test statistic. If the test statistic is not likely to have been drawn from the null distribution, we can reject the null hypothesis (that the test network reflects normal behavior) and conclude that it is anomalous.
Let be a multigraph that represents a dynamic network, where is the node set and is the set of edges at time , with the number of edges between nodes and at time . The edges represent the number of interactions that occurred between the nodes observed within a discrete window of time. As the number of participating nodes is relatively static compared to the number of communications, we will assume that the node set is a constant; in time steps where a node has no communications we will treat it as being part of the network but having zero edges.
Let us define some network statistic designed to measure network property which we will use as the test statistic (e.g., clustering coefficient). Given some set of time steps such that the set of graphs in those times are all examples of normal network behavior, is calculated on each of these learning set examples to estimate an empirical null distribution. If is the time step we are testing for abnormality, then the value is the test statistic. Given a specified -value (referred to as ), we can find threshold(s) that reject percentage of the null distribution, then draw our conclusion about whether to reject the null hypothesis (conclude an anomaly is present) if the test statistic falls out of those bounds. For this work we will use a two-tailed Z-test with a -value of with the thresholds and . Anomalous test cases where the null hypothesis is rejected correspond to true positives; normal cases where the null hypothesis is rejected correspond to false positives. Likewise anomalous cases where the null is not rejected correspond to false negatives and normal cases where the null is not rejected correspond to true negatives.
Deltacon  is an example of the statistic-based approach, as are many others [7, 8, 9, 14]. As these models also often incorporate network statistics, we will focus on the statistics themselves in this paper. Moreover, not all methods rely solely on statistics calculated from single networks: there are also delta statistics which measure the difference between two network examples. Netsimile is an example of such a network comparison statistic . For a dynamic network, these delta statistics are usually calculated between the networks in two consecutive time steps. In practical use for hypothesis testing, these delta statistics function the same as their single network counterparts.
If the network properties being tested are not static with respect to the time this natural evolution may cause to change over time regardless of anomalies which makes the null distribution invalid. It is useful in these instances to replace the statistic with a detrended version where the function is some fit to the original statistic values. The paper by  describes how to do dynamic anomaly detection using a linear detrending function, but other functions can be used for the detrending. This detrending operation does not change the overall properties of the statistic so for the remainder of the paper assume refers to a detrended version, if appropriate.
2.1 Common Network Statistics
Listed here are some of the more commonly used network statistics for the anomaly detection.
Graph Edit Distance:
The graph edit distance (GED)  is defined as the number of edits that one network needs to undergo to transform into another network and is a measure of how different the network topologies are. In the context of a dynamic network the GED is applied as a delta measure to report how quickly the network is changing.
Degree Distribution Difference:
Define to be the degree of , the total number of messages to or from the node. The degree distribution is then a histogram of the number of nodes with a particular degree. Typically real-world network exhibit a power-law degree distribution  but others are possible. To compare the degree distributions of and , one option is to take the squared difference between the degree counts for each possible degree. We will call this the degree distribution difference (DD):
Other measures can be used to compare the two distributions but the statistical properties of the squared distance described later extend to other distance measures.
Weighted Clustering Coefficient:
Clustering coefficient is intended as a measure of the transitivity of the network. It is a critical property of many social networks as the odds of interacting with friends of your friends often lead to these triangular relationships. As the standard clustering coefficient is not designed for weighted graphs we will be analyzing a weighted clustering coefficient, specifically the Barrat weighted clustering coefficient (CB):
Where , , and . Other weighted clustering coefficients exists but they behave similarly to the Barrat coefficient.
3 Density Dependence
To illustrate why dependency of the network statistic on the edge count affects the conclusion of hypothesis tests, we will us first investigate statistics that are density dependent.
A statistic is density dependent if the value of is dependent on the density of (i.e., ).
False Positives for Density Dependent Statistics
Let be a learning set of graphs and be the test graph. If is monotonically dependent on and is bounded by finite and , there is some that will cause the test case to be rejected regardless of whether the network is an example of an anomaly with respect to property .
Let be a network statistic that is a monotonic and divergent function with respect to the number of edges in . Given a set of learning graphs the values of are bounded by and , so the critical points and of a hypothesis test using this learning set will be within these bounds. Since an increasing implies increases or decreases, then there exists a such that is not within and and will be rejected by the test. ∎
If changing the number of edges in an observed network changes the output of the statistic, then if the test network differs sufficiently in its number of edges compared to the learning examples the null hypothesis will be rejected regardless of the other network properties. As for why it is not sufficient to simply label these times as anomalies (due to unusual edge volume) the hypothesis test is designed to test for abnormality in a specific network property. If edge count anomalies also flag anomalies on other network properties, we cannot disambiguate the case where both are anomalous or just the message volume. If just the message volume is unusual, this might simply be an example of an exceptionally busy day where the pattern of communication is roughly the same just elevated. This is a very different case from where both the volume and the distribution of edges are unusual.
A second problem occurs when the edge counts in the learning set have high variance. If the statistic is dependent on the number of edges, noise in the edge counts translates to noise in the statistic values which lowers the statistical power of the test.
False Negatives for Density Dependent Statistics
For any calculated on a network that is anomalous with respect to property , if is dependent on there is some value of the variance of the learning network edge counts such that is not detected as an anomaly.
Let be the test statistic and be the set of learning graphs where the edge count of any learning graph be drawn according to distribution . If is a monotonic divergent function of then as the variance of increases the variance of increases as well. For a given , the hypothesis test thresholds and will widen as the variance increases to incorporate learning set instances. Therefore, for a given there is some value of such that . ∎
With a sufficient amount of edge count noise, the statistical power of the anomaly detector drops to zero.
These theorems have been defined using a statistic calculated on a single network, but some statistics are delta measures which are measured on two networks. In these cases, the edge counts of either or both of the networks can cause edge dependency issues.
For some delta statistic if the statistic is dependent on , , or both, Theorems 3.1 and 3.2 apply to any edge count which influences the statistic. If depends on then as either or change the statistic produced changes leading to the problems described in theorems 3.1 and 3.2. If depends on then if both and increase or decrease the statistic is affected leading to the same types of errors. ∎
These theorems show that dependency on edge counts can lead to both false positives and false negatives from anomaly detectors that look for unusual network properties other than edge count. In order to distinguish between the observed edge counts in each time step and the other network properties, we need a more specific data model to represent the network generation process with components for each.
4 Density Independence
Now that we have established the problems with density dependence, we need to define the properties that we would prefer our network statistics to have. To do this we need a more detailed model of how the graph examples were generated.
4.1 Data Model
Let the number of edges in any time step be a random variable drawn from distribution in times where there is a normal message volume and distribution in times where there is anomalous message volume. Now let the distribution of edges amongst the nodes of the graph be represented by a matrix where the value of any cell is the probability of observing a message between two particular node pairs at a particular time. This is a probability distribution so the total matrix mass sums to 1. Like edge count, treat this matrix as drawn from distribution in normal times and in anomalous times where is the network parameter that is anomalous (for example, an atypical degree distribution). Any observed network slice can be treated as having been generated by a multinomial sampling process where edges are selected independently from with probabilities . Denote the sampling procedure for a graph with . In the next section, we will detail how this decomposition into the count of edges and their distribution allows us to define statistics which are not sensitive to the number of edges in the network.
4.2 Density Consistency and Unbiasedness
In the above data model, is the distribution of edges in the network, thus any property of the network aside from the volume of edges is encapsulated by the matrix. Therefore, a network statistic designed to capture some network property other than edge count should be a function of alone.
Let be some test statistic designed to capture a network property of , that is independent of the density of . Since is not directly observable we can estimate the statistic with the empirical statistic where is used to estimate , with .
A statistic is density consistent if is a consistent estimator of .
If is a consistent estimator of the true value of , then observing more edge examples should cause the estimated statistics to converge to the true value given . More specifically it is asymptotically consistent with respect to the true value as the number of observed edges increases. Another way to describe this property is that has some bias term dependent on the edge count , but the bias converges to zero as the edge count increases: . Density consistent statistics allow us to perform accurate hypothesis tests as long as a sufficient number of edges are observed in the networks. To begin, we will prove that the rate of false positives does not exceed the selected p-value .
False Positive Rate for Density Consistent Statistics
As if converges to , the probability of a false positive when testing a time with an edge count anomaly approaches .
Let all learning set graphs be drawn from non-anomalous and distributions and the test instance be drawn from an anomalous distribution but a non-anomalous distribution. If is a consistent estimator of , . Then as both and increase, all learning set instances and the test set instance approach the distribution . As any threshold is as likely to reject a learning set instance as the test instance, the false positive rate approaches . ∎
As the bias converges to zero, graphs created with the same underlying properties will produce statistic values within the same distribution, making the test case come from the same distribution as the null. Even if the test case has an unusual number of edges, as long as the number of edges is not too small there will not be a false positive. Density consistency is also beneficial in the case of false negatives.
False Negative Rate for Density Consistent Statistics
As let converge to and converge to . If and are separable then the probability of a false negative 0 as increases.
Let all learning set graphs be drawn from non-anomalous and distributions and the test instance be drawn from a non-anomalous distribution but a distribution that is anomalous on the network property being tested. Let be a consistent estimator of . If and are separable, then
converge to two non-overlapping distributions and the probability of rejection approaches 1 for any . ∎
The statistical power of a density consistent statistic depends only on whether the matrices of normal and anomalous graphs are separable using the true statistic value: as long as is sufficiently large the bias is small enough that it is not a factor in the rate of false negatives.
A special case of density consistency is density consistent and unbiased, which refers to statistics where in addition to consistency the statistic is also an unbiased statistic of the true .
A statistic is density unbiased if is a unbiased estimator of .
Unbiasedness is a desirable property because a density consistent statistic without it may produce errors due to bias when the number of observed edges is low.
4.3 Proposed Density-Consistent Statistics
We will now define a set of Density-Consistent statistics designed to measure network properties similar to the previously described dependent statistics, but without the sensitivity to total network edge count.
Probability Mass Shift:
The probability mass shift (MS) is a parallel to GED as a measure of how much change has occurred between the two networks examined. Mass Shift, however, attempts to measure the change in the underlying distributions and avoids being directly dependent on the edge counts. The probability mass shift between time steps and is
The MS can be thought of as the total edge probability that changes between the edge distributions in and .
Probabilistic Degree Shift:
We will now propose a counterpart to the degree distribution which is density consistent. Define the probabilistic degree of a node to be . Then, let the probabilistic degree shift of a particular node in be defined as the squared difference of the probabilistic degree of that node in times and . The total probabilistic degree shift (DS) of is then:
This is a measure of how much the total probability mass of nodes in the graph change across a single time step. If the shape of the degree distribution is changing, the probabilistic degree of nodes will be changing as well.
As the name suggests, the triangle probability (TP) statistics is an approach to capturing the transitivity of the network and an alternative to traditional clustering coefficient measures. Define the triangle probability as:
5 Properties of Network Statistics
Now that we have described the different categories of network statistics and their relationship to the network density we will characterize several common network statistics as density dependent or consistent, comparing them to our proposed alternatives. Table 1 summarizes our findings.
5.1 Graph Edit Distance
GED is a density dependent statistic.
When the edge counts of the two time steps are the same, the GED (Eq. 2.1) can be thought of as the difference in the distribution of edges in the network. However, the GED is sensitive to in two ways: the change in the number of edges from to : , and the minimum number of edges in each time step . In both cases the statistic is density dependent, and in fact it diverges as the number of edges increases. The first case is discussed in Theorem 5.1, the second in Theorem 5.2.
As , , regardless of and .
Let be . Since the GED corresponds to , the minimum edit distance between and occurs when their edge sets overlap maximally and is equal to . Therefore, as increases even the minimum (i.e., best case) also increases. ∎
As , if .
Let and , with . Select two nodes such that . The edit distance contributed by those two nodes is . Let increase but remain constant. As increases the edit distance of the two nodes converges to . Since every pair of nodes with differing edge probabilities in the two time steps will have increasing edit distance as increases, the global edit distance will also increase. ∎
Since the GED measure is the literal count of edges and nodes that differ in each graph, the statistic is dependent on the difference in size between the two graphs. Even if the graphs are the same size, comparing two large graphs is likely to produce more differences than two very small graphs due to random chance. In addition, even when small non-anomalous differences occur between the probability distributions of edges in two time steps, variation in the edge count can result in large differences in GED.
|Graph edit distance||✓|
5.2 Degree Distribution Difference
DD is a density dependent statistic.
The degree distribution is naturally very dependent on the total degree of a network: the average degree of nodes is larger in networks with many edges. The DD measure (Eq. 2) is again sensitive to via and . In both cases the statistic is density dependent. The first case, in Theorem 5.3, shows that as increases, the DD measure will also increase even if the graphs were generated with the same probabilities. The second case, in Theorem 5.4, shows that as increases, small variations in will increase the DD measure.
As , , regardless of and .
Pick any node in and in . If increases, and stays the same, the expected degree of increases, while stays the same; likewise the inverse is true if increases and stays the same. Thus, as increases the probability of any two nodes having the same degree approaches zero, so the degree distribution difference of the two networks increases with greater . ∎
As , if .
Let be the probabilistic degree of node for . Pick any node in and in such that . For a constant , as increases, the expected degrees of and converge to and respectively. This means that the probability of the two nodes having the same degree approaches zero. Since every pair of nodes with differing edge probabilities in the two time steps will have unique degrees, the degree distribution difference will also increase. ∎
If a node has very similar edge probabilities in matrices and , when few edges are sampled it is likely to have the same degree in both time steps, and thus the impact on the degree distribution difference will be low. However, as the number of edges increases, even small non-anomalous difference in the matrices will become more apparent (i.e., the node is likely to be placed in different bins in the degree distribution difference calculation), and the impact on the measure will be larger.
5.3 Weighted Clustering Coefficient
CB is a density consistent statistic.
As shown in Theorem 5.5 below, the weighted Barrat clustering coefficient (CB, Eq. 3) is in fact density consistent. However, we will show later that the triangle probability statistic is also density unbiased, which gives more robust results, even on very sparse networks.
is a consistent estimator of , with a bias that converges to 0.
For the number of edges observed any pair of nodes can be represented using a multinomial distribution . As , the rate of sampling a particular node pair is , so:
Let refer to the clustering coefficent of node . Then, in the limit of it converges to:
where . Since this limit can be expressed in terms of alone, and CB is a sum of the clustering over all nodes, the Barrat clustering coefficient is a density consistent statistic. ∎
If we calculate the expectation of the Barrat clustering, we obtain:
As this does not simplify to the limit of CB, it is not an unbiased estimator, and is thus density consistent but not unbiased. Other weighted clustering coefficients are also available but they have the same properties as the Barrat statistic.
5.4 Probability Mass Shift
MS is a density consistent statistic.
As the true is unobserved, we cannot calculate the Mass Shift (MS, Eq. 4) statistic exactly and must use the empirical Probability Mass Shift: . As shown in Theorem 5.6 below, the bias of this estimator approaches 0 as and increase, making this a density consistent statistic.
is a consistent estimator of , with a bias that converges to 0.
The expectation of the empirical Mass Shift can be calculated with
As the expectation of for any node pair can be written as:
The expected empirical mass shift can then be written as:
As the two additional bias terms converge to 0 as and increase, the empirical mass shift is a consistent estimator of the true mass shift, and is density consistent. ∎
We can improve the rate of convergence as well by using our empirical estimates of the probabilities to subtract an estimate of the bias from the statistic. We use the following bias-corrected version of the empirical statistic in all experiments:
5.5 Probabilistic Degree Shift
DS is a density consistent statistic.
Again, since the true is unobserved, we cannot calculate the Degree Shift (DS, Eq. 5) statistic exactly and must use the empirical Probability Degree Shift:
. As shown in Theorem 5.7 below, the bias of this estimator approaches 0, making this a density consistent statistic.
is a consistent estimator of , with a bias that converges to 0.
The expectation of the empirical degree shift can be calculated with
Which is density-consistent because the two additional bias terms converge to 0 as increases. ∎
Since the bias converges to 0, the statistic is density consistent. By subtracting out the empirical estimate of this bias term we can hasten the convergence. We use the following bias-corrected empirical degree shift in our experiments:
5.6 Triangle Probability
TP is a density consistent and density unbiased statistic.
Again, since the true is unobserved, we cannot calculate the Triangle Probability (TP, Eq. 6) statistic exactly and must use the empirical Triangle Probability
which is an unbiased estimator of the true statistic (shown below in Theorem 5.8). This means that there is no minimum number of edges necessary to attain an unbiased estimate of the true triangle probability.
is an consistent and unbiased estimator of .
The expectation of the empirical Triangle Probability can be written as
As the number of edges on any node pair can be represented with a multinomial, the expectation of each is . This lets us rewrite the triangle probability as
Therefore the empirical triangle probability is an unbiased estimator of the true triangle probability and is a density consistent statistic. ∎
Now that we have established the properties of density-consistent and -dependent statistics we will show the tangible effects of these properties using both synthetic datasets as well as data from real networks. The purpose of the synthetic data experiments is to show the ability of hypothesis tests using various statistics to distinguish networks that have different distributions of edges but also a random number of observed edges. The real data experiments demonstrate the types of events that generate anomalies as well as the characteristics of the anomalies that hypothesis tests using each statistic are most likely to