Changepoint Detection over Graphs with the
Spectral Scan Statistic
We consider the change-point detection problem of deciding, based on noisy measurements, whether an unknown signal over a given graph is constant or is instead piecewise constant over two connected induced subgraphs of relatively low cut size. We analyze the corresponding generalized likelihood ratio (GLR) statistics and relate it to the problem of finding a sparsest cut in a graph. We develop a tractable relaxation of the GLR statistic based on the combinatorial Laplacian of the graph, which we call the spectral scan statistic, and analyze its properties. We show how its performance as a testing procedure depends directly on the spectrum of the graph, and use this result to explicitly derive its asymptotic properties on few significant graph topologies. Finally, we demonstrate both theoretically and by simulations that the spectral scan statistic can outperform naive testing procedures based on edge thresholding and testing.
In this article we are concerned with the basic but fundamental task of deciding whether a given graph, over which a noisy signal is observed, contains a cluster of anomalous or activated nodes comprising an induced connected subgraph. Such a problem is highly relevant in a variety of scientific areas, such as community detection in social networks, surveillance, disease outbreak detection, biomedical imaging, sensor network detection, gene network analysis, environmental monitoring and malware detection. Recent theoretical contributions in the statistical literature (see, e.g., castro:05; arias2008searching; arias2011detection; addario2010combinatorial) have detailed the inherent difficulty of such a testing problem in relatively simplified settings and under specific conditions on the graph topology. From a practical standpoint, the natural algorithm for detection of anomalous clusters of activity in graphs is the the generalized likelihood ratio test (GLRT) or scan statistic, a computationally intensive procedure that entails scanning all well connected clusters and testing individually for anomalous activation. Unfortunately, its performance over general graphs is not well understood, and little attention has been paid to determining alternative, computationally tractable, procedures.
In this article we assume that the class of clusters of activation consists of sub-graphs of small cut size. We believe this is a natural and realistic assumption which, as we demonstrate below, allows us to explicitly incorporate into the detection problem the properties of the graph topology through its spectrum. In particular, we show that the GLRT is an integer program with a term in the objective that corresponds to the sparsest cut in a graph, a known NP-hard problem. With this in mind, we propose a relaxation of the GLRT, called the spectral scan statistic, which is based on the combinatorial Laplacian of the graph and, importantly, is a tractable program. As our main result, we derive theoretical guarantees for the performance of the spectral scan statistic, which hold for any graph and are based on the spectral measure of the combinatorial Laplacian. For comparison purposes, we derive theoretical guarantees for two simple estimators, the edge thresholding and the test. We conclude our study by applying the main result to balanced binary trees, the lattice, and Kronecker graphs, giving us precise asymptotic results. We find that, modulo logarithm terms, the spectral scan statistic has nearly optimal power for balanced binary trees. Simulations for these models verify that the spectral scan statistic dominates the simple estimators.
Contributions. Our contributions are as follows. (1) We define a new class of activation patterns based on the notion of small cut size that reflects in a natural way the topological properties of the graph. (2) We analyze the corresponding GLR statistics and show that it is indeed related to the problem of finding sparest cuts. We then develop a computationally tractable relaxation of the GLR statistic, called the spectral scan statistic and analyze its properties. In our main theoretical result, we show show that the performance of the spectral scan statistic depends explicitly on the spectral properties of the graph. (3) Using such results we are able to characterize in a very explicit form the performance of the spectral scan statistic on a few notable graph topologies and demonstrate its superiority over naive detectors, such as the edge thresholding and the test. (4) Finally, we have formulated the detection problem under more general and realistic scenarios, which involve composite null and alternative hypotheses as opposed to simple hypotheses as is customary in the theoretical statistical literature on this subject.
Related Work. Normal means testing in high-dimensions is a well established and fundamental problem in statistics (see, e.g., ingster2003nonparametric). A significant portion of the recent work in this area (castro:05; arias2008searching; arias2011detection; addario2010combinatorial) has focused on incorporating structural assumptions on the signal, as a way to mitigate the effect of high-dimensionality and also because many real-life problems can be represented as instances of the normal means problem with graph-structured signals (see, for an example, jacob2010gains). These contributions have considered the generalized likelihood ratio test of means when the alternative hypothesis takes on the form of a combinatorial space. However, the performance of such test has been analyzed only for certain types of graphs, and it is unclear to what extent those analyses extend to general graph topologies. Moreover, while much is known about the theoretical performance of the GLRT, no mention is made about its computational feasibility. Another line of research relevant to our problem is the optimal fail detection with nuisance parameters and matched subspace detection in the signal processing literature: see, e.g. scharf94; baygun.hero:95; fouladirad2005optimal; fouladirad2008optimal. Though our problem can be cast as a special case of the more general problem of optimal testing of a linear subspace under nuisance parameters considered in that line of work, the focus on a graph-structured signal, as well as the type of analysis based on the interplay between the scan statistics and the spectral properties of the graph contained in our work, are novel.
1.1 Problem Setup
We now formalize the problem of detecting a change of signal over the vertices of a graph from noisy observations in the high-dimensional setting. For a given connected, undirected, possibly weighted graph on nodes, we observe one realization of the random vector
where and , with known. We will assume that there are two groups of constant activation for the signal , namely that there exists a subset such that is constant within both and it complement . We formalize this assumption by writing
where are unknown parameters, is a -dimensional vector of ones and is the indicator function of the subset . The parameter can be thought of as the magnitude of the background signal and is a nuisance parameter, while quantifies the the gap in signal between the two clusters. Setting , we will use to measure the energy of the signal (note that this quantity is independent of ), and we will define the signal-to-noise ratio (SNR) to be
We will not assume any knowledge of the true clustering , other than that it belongs to a given class of bi-partitions of such that and are both large and can be easily disconnected, in that they have low cut size . Formally, we define, for some ,
where is the boundary of . Note that is a symmetric class in the sense that if and only if . We are interested in the problem of testing whether the gap parameter in equation (2) is zero (i.e. the signal is constant) or it is non-zero for some , regardless of the value of . Thus, we can naturally cast our structured change-point detection problem as the following composite hypothesis testing problem:
where and . Notice that the alternative can be written as the join over of disjoint composite alternatives of the form , .
To make our analysis meaningful, we measure the difficulty of the detection problem in terms of the energy parameter by assuming that, for some , . Thus, we can think of as the minimal degree of separation between the null and alternative hypotheses. Below we will analyze asymptotic conditions under which the hypothesis testing problem described above is feasible, in a sense made precise in the next definition, when the size of the graph increases unboundedly. To this end, we will further assume that the relevant parameters of the model, , , and change with as well, even though we will not make such dependence explicit in our notation for ease of readability. Our results establish conditions for asymptotic disinguishability as a function of the SNR and and the spectrum of the graph .
Let denote the distribution of induced by the model (1), where . For a given statistic and threshold , let be if and otherwise. We say that the hypotheses and are asymptotically distinguished by the test if
where the limit is taken as . We say that and are asymptotically indistinguishable if there does not exist any test for which the above limits hold.
Notation. We will need some mathematical terminology from algebraic graph theory (godsil2001algebraic). A central object to our analysis is the combinatorial Laplacian matrix , where is the adjacency matrix of the graph and is the diagonal matrix of node degrees, , . If the graph is weighted then reflects this. We will denote the eigenvalues of with , which we will always take in increasing order. Since is connected, the smaller eigenvalue , with corresponding eigenvector, . is known as the algebraic connectivity and is lower bounded by where is the diameter of the graph. Throughout this study we use Bachmann-Landau notation for asymptotic statements: if then and .
The hypothesis testing problem at hand presents two challenges: (1) the model contains an unbounded nuisance parameter and (2) the alternative hypothesis is comprised of a finite disjoint union of composite hypotheses indexed by . These features set our problem apart from virtually all existing work of structured normal means problems (see, e.g. castro:05; arias2008searching; arias2011detection; addario2010combinatorial), which does not consider nuisance parameters and relies on a simplified framework consisting of a simple null hypothesis and a composite hypothesis consisting of disjoint unions of simple alternatives. Having nuisance parameters and composite hypothesis require a more sophisticated analysis.
We will eliminate the interference caused by the nuisance parameter by considering test procedures that are independent of . The formal justification for this choice is based on the theory of optimal invariant hypothesis testing (see, e.g., lehmann2005testing) and of uniformly best constant power tests (see wald:43). Due to space limitations we will not provide the details and refer the reader to fouladirad2008optimal; fouladirad2005optimal; fillatre2007non; fillatre2012; scharf94; baygun.hero:95 and references therein for in depth-treatments of these issues related to the model a hand.
For the simpler problem of testing versus for some , the optimal test is based on the likelihood ratio (LR) statistic (see the proof of Lemma 2 below for a derivation)
where and is the Lebesgue density of . This test rejects for large values of . Optimality follows from the fact that the statistical model we consider has the monotone likelihood ratio property.
When testing against composite alternatives, like in our case, it is customary to consider instead the generalized likelihood ratio (GLR) statistic, which in our case reduces to
Through manipulations of the likelihoods, we find that the GLR statistic has a very convenient form which is tied to the spectral properties of the graph via its Laplacian.
Let and . Then
where is the combinatorial Laplacian of the graph .
The proof is provided in the appendix. The savvy reader will notice the connection between (7) and the graph sparsest cut program. By Lagrangian duality, we see that the program (7) is equivalent to (for some Lagrangian parameter )
the first term of which is precisely the sparsest cut objective, and the second term drives the solution to have positive within cluster empirical correlations. The sparsest cut program is known to be NP-hard, with poly-time algorithms known for trees and planar graphs(matula1990sparsest). Because of this fact, approximate algorithms have been proposed over the past two decades, most notably the uniform multicommodity flow approach of (leighton1988approximate; shmoys1997cut) and the semi-definite relaxation of the cut metric (arora2009expander). hagen1992new observed that the minimum cut sparsity is bounded by the algebraic connectivity (), suggesting the Fiedler vector (i.e. the second eignenvector of ) to be an appropriate relaxation of the characteristic vector of the cut. Moreover, the well known Cheeger inequality shows that the minimum cut sparsity (in a regular graph) is bounded by the algebraic connectivity (see chung2004discrete). We will follow the tradition of bounding sparsity with the algebraic connectivity, and provide a surrogate estimator to the scan statistic based on this simple spectral relaxation.
Define the Spectral Scan Statistic (SSS) as
Then the GLR statistic is bounded by the SSS: .
First let us notice that is the projection onto the subspace orthogonal to . Because is thus idempotent, , and we can rewrite
So, we have the following relaxation,
By Lagrangian duality and the Courant-Fischer theorem, the spectral scan statistic can be written as
where is the maximum non-zero eigenvalue of the matrix .
Notice that because the domain is symmetric around the origin, this is precisely the square of the solution to
where we have used the fact that because within . This previous formulation shows that the SSS is related to the supremum of a Gaussian process over . This fact will turn out to be extremely convenient, as we show next.
3 Theoretical Analysis
We first derive a simple condition for asymptotic indistinguishability based on testing the null versus a single component in the alternative. A more refined analysis of the lower bound for the general hypothesis (4) is beyond the scope of this article.
Suppose that there exists such that . Then and are asymptotically indistinguishable if .
The proof is in the appendix. We will analyze the performance of the SSS statistic by relying on its representation (8) as the square of the supremum of a Gaussian process. We draw heavily on the theory of the generic chaining, perfected in talagrand2005generic, which essentially reduces the problem of computing bounds on the expected supremum of Gaussian processes to geometric properties of its index space. Recall that, under alternative hypothesis, uniformly over .
The following hold with probability at least . Under the null
while the alternative
We use generic chaining to control the process appearing in the SSS. First, we notice that the index set is the intersection of an ellipsoid and the unit ball, which is the intuition behind the following lemma.
Let have spectrum . Then under ,
The proof is provided in the appendix. We then can use the well known phenomena, that the supremum of a Gaussian process concentrates around it’s expectation (see the appendix). Hence, by Lemma 14 the first statement in Theorem 6 holds. The second statement follows by applying standard concentration results to the univariate Gaussian and noticing that and under . ∎
As a corollary we will provide sufficient conditions for asymptotic distinguishability that depend on the spectrum of the Laplacian . As we will show in the next section, these conditions can be applied to a number of graph topologies whose spectral properties are known.
The null and alternative, as described in Thm. 6, are asymptotically distinguished by and if
Other stronger sufficient conditions are
if is large enough that .
Interestingly, there are no logarithmic terms in (9) that usually accompany uniform bounds of this type, which is attributed to the generic chaining. Notice that the left hand side of (9) is always less than , which we will see characterizes the performance of the naive estimator .
For comparison, we consider the performance of two naive procedure for detection: the energy detector, which reject if is too large and the edge thresholding detector, which reject if is large.
and are asymptotically distinguished by if and only if
The proof (given in the appendix) is a standard analysis. In sharpnacksparsistency the authors examined the problem of exact recovery of cluster boundaries in the graph-structured normal means problem by taking differences between observations corresponding to adjacent nodes. The following result stems from Theorem 2.1 of sharpnacksparsistency, and the fact that scales like up to a factor of .
and are asymptotically distinguished by if
If contains balanced clusters, i.e. bipartitions such that , then this result matches the scaling in Theorem 9 up to a log factor.
4 Specific Graph Models
In this section we demonstrate the power and flexibility of Theorem 6 by analyzing in detail the performance of the spectral scan statistic over three important graph topologies: balanced binary trees, the s-dimensional lattice and the Kronecker graphs (see leskovec2007scalable; leskovec2010kronecker).
4.1 Balanced Binary Trees
We begin the analysis of the spectral scan statistic by applying it to the balanced binary tree (BBT) of depth . The class of signals that we will consider have clusters of constant signal which are subtrees of size at least for . Hence, the cut size of the signals are and .
For the balanced binary tree with vertices, the spectral scan statistic can asymptotically distinguish from signals with if the SNR is stronger than
We simulate the probability of correct discovery of change-points (rejecting when the truth is ) versus the probability of false alarm (falsely rejecting ). These are given for the four estimators in Figure 1 and for the SSS as increases. In these simulations a subtree at level (of size ) was chosen as , the gap-to-noise ratio is fixed at , and . We see that even in the low regime, exploiting the graph structure is essential to improve the power of testing against . As increases with fixed the performance of the SSS dramatically increases.
We will analyze the performance guarantees of the SSS over the 2-dimensional lattice graph with vertices along each dimension (). We will assume that , as this is the cut sparsity of rectangles that have a low surface area to volume ratio. By a simple Fourier analysis (see sharpnack2010identifying), we know that the Laplacian eigenvalues are for all . We will appeal to (10). Because for , if we rewrite for then . Hence,
Then by choosing the term in the root of the LHS of (10) is bounded by, modulo lower order terms. We arrive at the following conclusion,
For the square lattice, the spectral scan statistic can asymptotically distinguish from signals with cut size if the SNR is stronger than,
4.3 Kronecker Graphs
Much of the research in complex networks has focused on observing statistical phenomena that is common across many data sources. The most notable of these are that the degree distribution obeys a power law (faloutsos1999power) and networks are often found to have small diameter (milgram1967small). A class of graphs that satisfy these, while providing a simple modelling platform are the Kronecker graphs (see leskovec2007scalable; leskovec2010kronecker). Let and be graphs on vertices with Laplacians and edge sets respectively. The Kronecker product, , is the graph over vertices such that there is an edge if and or and . We will construct graphs that have a multi-scale topology using the Kronecker product. Let the multiplication of a graph by a scalar indicate that we multiply each edge weight by that scalar. First let be a connected graph with vertices. Then the graph for levels is defined as
The choice of multipliers ensures that it is easier to make cuts at the more coarse scale. Notice that all of the previous results have held for weighted graphs.
For be the Kronecker product graph described above with vertices, the spectral scan statistic can asymptotically distinguish from signals with cuts within the coarsest scale (), if the SNR is stronger than,
The proof and an explanation of is in the appendix. Again, we demonstrate the improvement of the SSS over competing tests in Figure 1. For these simulations the base graph was chosen to be two triangles () connected by a single edge (p = 6). At the coarsest scale one of the subgraphs was chosen to be with .
We studied the heretofore unaddressed problem of how to tractably detect change-points in networks under Gaussian noise. To this end we developed the spectral scan statistic, suggesting it as a computationally feasible alternative to the GLRT. We completely characterized the performance of the SSS for any graph in terms of the spectrum of the combinatorial Laplacian. For comparison purposes, we developed theoretical guarantees for two simple estimators. We applied the main result to three graph models: binary balanced trees, the lattice and Kronecker graph. We see that not only is it statistically inadmissible to ignore graph structure, but for the balanced tree the SSS gives near optimal performance. This claim is backed by both simulation and theory.
This research is supported in part by AFOSR under grant FA9550-10-1-0382 and NSF under grant IIS-1116458.
6.1 Proofs in Section 2
Proof of Lemma 2.
To expedite the proof, we express the LR statistics in terms of the sufficient statistics and for and . Then, we obtain
where is the MLE under . (The likelihood under the alternative balances with the normalizing constant of the null likelihood.) Thus,
Now we let , making the statistic above
The result now follows by considering all the indicator functions corresponding to the sets in . ∎
6.2 Proofs in Section 3
Proof of Theorem 5.
Let the true be known. The performance of the optimal test with known, which by the Neyman-Pearson Lemma is based on , bounds the performance of that with unknown. To this end, note that, under , the LR statistic (6) has a , while under the alternative it has a distribution with non-centrality parameter
which is the square of the SNR. For fixed , asymptotically indistinguishable of versus follows by considering any threshold and noticing that the associated type 1 and type 2 errors are non-vanishing under the SNR scaling assumed in the statement. Since the risk of testing versus is no smaller than the risk of testing versus , the result follows. ∎
We remark that the proof of the previous result shows that when distinguishing from , the power of the test is maximal when for a fixed value of the SNR.
Proof of Lemma 7.
Without loss of generality, let . We recall that, since is connected, the combinatorial Laplacian is symmetric, its smallest eigenvalue is zero and the remaining eigenvalues are positive. By the spectral theorem, we can write , where is a diagonal matrix containing the positive eigenvalues of in increasing order and the columns of the matrix are the associated eigenvectors. Then, since each vector with can be written as for a unique vector , we have
where in the third identity we have used the fact that . Letting , we see that
where and denotes equality in distribution.
Next, we show that the set , which is the intersection of an ellipsoid with the unit ball in , is contained in an enlarged ellipsoid. The supremum of the Gaussian process over will then be bounded by the supremum of the same process over this larger but simpler set, which we will be able to bound using directly a result from talagrand2005generic based on chaining. To this end, let and . For for a vector set , , and . Then, we observe the following chain of implications, holding for vectors :
Hence, we have the bound
Recalling that , for , where is the th eigenvalue of , by Proposition 2.2.1 in talagrand2005generic the right hand side of the previous expression is bounded by . ∎
Supplement to the proof of Theorem 6.
The following property of Gaussian processes effectively reduces the study of their supremum to the study of its expectation. It was established by borell1975brunn and cirel1976norms and can be found in ledoux2001concentration.
Consider a Gaussian process where is compact with respect to metric
and let . We have that with probability at least
Notice that the natural distance is given by for . ∎
Proof of Theorem 9.
Recall that , where is the orthogonal projection matrix into the -dimensional linear subspace of vectors orthogonal to . Under , , and, therefore, , since . On the other hand, under for a fixed , , where is given in as in (2). Thus, under , , where the non-centrality parameter is given by
where the second identity is due to the fact that is symmetric and idempotent and the last inequality to our assumption on the minimal separation between and any of the alternatives. Thus, if , then . Hence, using standard chi-square tail bounds (see for example proposition 2 of MartinSSP12) and since the bound (12) holds uniformly over all , it follows that the null and alternate are asymptotically distinguishable using the test statistic if and only if . ∎
6.3 Proof in Section 4
Proof of Corollary 11.
The study of the spectra of trees really began in earnest with the work of fiedler1975eigenvectors. Notably, it became apparent that tree have eigenvalues with high multiplicities, particularly the eigenvalue . molitierno2000tight gave a tight bound on the algebraic connectivity of balanced binary trees (BBT). They found that for a BBT of depth , the reciprocal of the smallest eigenvalue () is
rojo2002spectrum gave a more exact characterization of the spectrum of a balanced binary tree, providing a decomposition of the Laplacian’s characteristic polynomial. Specifically, the characteristic polynomial of is given by
where is a polynomial of degree and are polynomials of degree with the smallest root satisfying the bound in (13) with replaced with . In rojo2005spectra, they extended this work to more general balanced trees.
By (14) we know that at most eigenvalues have reciprocals larger than . Let , then we have ensured that at most eigenvalues are smaller than . For large enough
Proof of Corollary 13.
The Kronecker product of two matrices is defined as such that . Some matrix algebra shows that if and are graphs on vertices with Laplacians then the Laplacian of their Kronecker product, , is given by (merris1998laplacian). Hence, if are eigenvectors, viz. and , then , where is the usual tensor product. This completely characterizes the spectrum of Kronecker products of graphs.
We should argue the choice of , by showing that it is the results of cuts at level . We say that an edge has scale if . Furthermore, a cut has scale if each of its constituent edges has scale at least . Each edge at scale has weight and there are such edges, so cuts at scale have total edge weight bounded by
Cuts at scale leave components of size intact, meaning that for large enough .
We now control the spectrum of the Kronecker graph. Let the eigenvalues of the base graph be in increasing order. The eigenvalues of are precisely the sums
for . The eigenvalue distribution stochastically bounds
where . Notice that if is chosen uniformly at random then has a geometric distribution with probability of success . Also if , so
This followed from the geometric probability mass function. We also know that the algebraic connectivity, , is bounded from below by , so the following result holds.