Sketching for Sequential ChangePoint Detection
Abstract
We study sequential changepoint detection procedures based on linear sketches of highdimensional signal vectors using generalized likelihood ratio (GLR) statistics. The GLR statistics allow for an unknown postchange mean that represents an anomaly or novelty. We consider both fixed and timevarying projections, derive theoretical approximations to two fundamental performance metrics: the average run length (ARL) and the expected detection delay (EDD); these approximations are shown to be highly accurate by numerical simulations. We further characterize the relative performance measure of the sketching procedure compared to that without sketching and show that there can be little performance loss when the signal strength is sufficiently large, and enough number of sketches are used. Finally, we demonstrate the good performance of sketching procedures using simulation and realdata examples on solar flare detection and failure detection in power networks.
I Introduction
Online changepoint detection from highdimensional streaming data is a fundamental problem arising from applications such as realtime monitoring of sensor networks, computer network anomaly detection and computer vision (e.g., [2, 3]). To reduce data dimensionality, a conventional approach is sketching (see, e.g., [4]), which performs random projection of the highdimensional data vectors into lowerdimensional ones. Sketching has now been widely used in signal processing and machine learning to reduce dimensionality and algorithm complexity, and achieve various practical benefits [5, 6, 7, 8, 9, 10, 11].
We consider changepoint detection using linear sketches of highdimensional data vectors. Sketching reduces the computational complexity of the detection statistic from to , where is the original dimensionality and is the dimensionality of sketches. Since we would like to perform realtime detection, any reduction in computational complexity (without incurring much performance loss) is highly desirable. Sketching also offers practical benefits. For instance, for large sensor networks, it reduces the burden of data collection and transmission. It may be impossible to collect data from all sensors and transmit them to a central hub in realtime, but this can be done if we only select a small subset of sensors to collect data at each time. Sketching also reduces data storage requirement. For instance, changepoint detection using the generalized likelihood ratio statistic, although robust, it is nonrecursive, and one has to store historical data. Using sketching, we only need to store the much lower dimensional sketches rather than the original highdimensional vectors.
In this paper, we present a new sequential sketching procedure based on the generalized likelihood ratio (GLR) statistics. In particular, suppose we may choose an matrix with to project the original data: , . Assume the prechange vector is zeromean Gaussian distributed and the postchange vector is Gaussian distributed with an unknown mean vector while the covariance matrix is unchanged. Here we assume the mean vector is unknown since it typically represents an anomaly. The GLR statistic is formed by replacing the unknown with its maximum likelihood ratio estimator (e.g., [12]). Then we further generalize to the setting with timevarying projections of dimension . We demonstrate the good performance of our procedures by simulations, a realdata example of solar flare detection, and a synthetic example of power network failure detection with data generated using realworld power network topology.
Our theoretical contribution are mainly in two aspects. We obtain analytic expressions for two fundamental performance metrics for the sketching procedures: the average run length (ARL) when there is no change and the expected detection delay (EDD) when there is a changepoint, for both fixed and timevarying projections. Our approximations are shown to be highly accurate using simulations. These approximations are quite useful in determining the threshold of the detection procedure to control falsealarms, without having to resort to the onerous numerical simulations. Moreover, we characterize the relative performance of the sketching procedure compared to that without sketching. We examine the EDD ratio when the sketching matrix is either a random Gaussian matrix or a sparse 01 matrix (in particular, an expander graph). We find that, as also verified numerically, when the signal strength and are sufficiently large, the sketching procedure may have little performance loss. When the signal is weak, the performance loss can be large when is too small. In this case, our results can be used to find the minimum such that performance loss is bounded, assuming certain worst case signal and for a given target ARL value.
To the best of our knowledge, our work is the first to consider sequential changepoint detection using the generalized likelihood ratio statistic, assuming the postchange mean is unknown to represent an anomaly. The only other work [22] that consider changepoint detection using linear projections assume the postchange mean is known and further, to be sparse. Our results are more general since we do not make such assumptions. Assuming the postchange mean to be unknown provides a more useful procedure since in changepoint detection, the postchange setup is usually unknown. Moreover, [22] considers ShiryaevRobert’s procedure, which is based on a different kind of detection statistic than the generalized likelihood ratio statistic considered here. The theoretical analyses therein consider slightly different performance measures, the Probability of False Alarm (PFA) and Average Detection Delay (ADD) and our analyses are completely different.
Our notations are standard: denotes the Chisquare distribution with degreeoffreedom , denotes an identity matrix of size ; denotes the pseudoinverse of a matrix ; denotes the th coordinate of a vector ; denotes the th element of a matrix ; denotes the transpose of a vector or matrix .
The rest of the sections are organized as follows. We first review some related work. Section II sets up the formulation of the sketching problem for sequential changepoint detection. Section III presents the sketching procedure. Section IV and Section V contain the performance analysis of the sketching procedures. Section VI and Section VII demonstrate good performance of our sketching procedures using simulation and realworld examples. Section VIII concludes the paper. All proofs are delegated to the appendix.
Ia Related work
Changepoint detection problems are closely related to industrial quality control and multivariate statistical control charts (SPC), where an observed process is assumed initially to be in control and at a changepoint becomes out of control. The idea of using random projections for change detection has been explored for SPC in the pioneering work [13] based on multivariate control chart, the followup work [14] for cumulative sum (CUSUM) control chart and the exponential weighting moving average (EWMA) schemes, and in [15, 16] based on the Hotelling statistic. These works provide a complementary perspective from SPC design, while our method takes a different approach and is based on sequential hypothesis testing, treating both the changepoint location and the postchange mean vector as unknown parameters. By treating the changepoint location as an unknown parameter when deriving the detection statistic, the sequential hypothesis testing approach overcomes the drawback of some SPC methods due to a lackofmemory, such as the Shewhart chart and the Hotelling chart, since they cannot utilize the information embedded in the entire sequence [17]. Moreover, our sequential GLR statistic may be preferred over the CUSUM procedure in the setting when it is difficult to specify the postchange mean vector. Besides the above distinctions from the SPC methods, other novelty of our methods also include: (1) we developed new theoretical results for the sequential GLR statistic; (2) we consider the sparse 01 and timevarying projections (the sparse 01 projection corresponds to downsampling the dimensions); (3) we study the amount of dimensionality reduction can be performed (i.e., the minimum ) such that there is little performance loss.
This paper extends on our preliminary work reported in [1] with several important extensions. We have added (1) timevarying sketching projections and their theoretical analysis; (2) extensive numerical examples to verify our theoretical results; (3) new realdata examples of solar flare detection and power failure detection.
Our work is related to compressive signal processing [18], where the problem of interest is to estimate or detect (in the fixedsample setting) a sparse signal using compressive measurements. In [19], an offline test for a nonzero vector buried in Gaussian noise using linear measurements is studied; interestingly, a conclusion similar to ours is drawn that the task of detection within this setting is much easier than the tasks of estimation and support recovery. Another related work is [20], which considers a problem of identifying a subset of data streams within a larger set, where the data streams in the subset follow a distribution (representing anomaly) that is different from the original distribution; the problem considered therein is not a sequential changepoint detection problem as the “changepoint” happens at the onset (). In [21], an offline setting is considered and the goal is to identify out of samples whose distributions are different from the normal distribution . They use a “temporal” mixing of the samples over the finite time horizon. This is different from our setting since we project over the signal dimension at each time. Other related work include kernel methods [23] and [24] that focus on offline changepoint detection. Finally, detecting transient changes in power systems has been studied in [25].
Another common approach to dimensionality reduction is principal component analysis (PCA) [26], which achieves dimensionality reduction by projecting the signal along the singular space of the leading singular values. In this case, or corresponds to the signal singular space. Our theoretical approximation for ARL and EDD can also be applied in these settings. It may not be easy to find the signal singular space when the dimensionality is high, since computing singular value decomposition can be expensive [27].
Ii Formulation
Consider a sequence of observations with an open time horizon , , where and is the signal dimension. Initially, the observations are due to noise. There can be a time such that an unknown changepoint occurs and it changes the mean of the signal vector. Such a problem can be formulated as the following hypothesis test:
(1) 
where the unknown mean vector is defined as
Without loss of generality, we have assumed the noise variance is 1. Our goal is to detect the changepoint as soon as possible after it occurs, subjecting to the false alarm constraint. Here, we assume the covariance of the data to be an identity matrix and the change only happens to the mean.
To reduce data dimensionality, we linearly project each observation into a lower dimensional space, which we refer to as sketching. We aim to develop procedures that can detect a changepoint using the lowdimensional sketches. In the following, we consider two types of linear sketching: the fixed projection and the timevarying projection.
Fixed projection. Choose an projection matrix with . We obtain low dimensional sketches via:
(2) 
Then the hypothesis test for the original problem (1), becomes the following hypothesis test based on the sketches (2)
(3) 
Note that both mean and covariance structures are affected by the projections.
Timevarying projection. In certain applications, one may use different sketching matrices at each time. The projections are denoted by and the number of sketches can change as well. The hypothesis test for sketches becomes:
(4) 
The above models also capture several cases:
(i) (Pairwise comparison.)
In applications such as social network data analysis and computer vision, we are interested in a pairwise comparison of variables [28, 29]. This can be modeled as observing the difference between a pair of variables, i.e., at each time , the measurements are , for a set of . There are a total of possible comparisons, and we may select out of such comparisons to observe. The pairwise comparison model leads to a structured fixed projection with only entries, for instance:
(ii) (Missing data.) In various applications we may only observe a subset of entries at each time (e.g., due to sensor failure), and the locations of the observed entries also vary with time [30]. This corresponds to being a submatrix of an identity matrix by selecting rows from an index set at time .
Iii Sketching procedure
Iiia Fixed projection
IiiA1 Derivation of GLR statistic
We now derive the likelihood ratio statistic for the hypothesis test in (3). Define the sample mean within a window
(5) 
Since the observations are i.i.d. over time, for an assumed changepoint , for the hypothesis test in (3), the loglikelihood ratio (logLR) of observations accumulated up to time can be shown to be
(6) 
where denotes the probability density function of under the null, i.e., and denotes the probability density function of under the alternative, i.e., .
Since is unknown, we replace it with a maximum likelihood estimator for fixed values of and in the likelihood ratio (6) to obtain the loggeneralized likelihood ratio (logGLR) statistic. Taking the derivative of in (6) with respect to and setting it to zero, we obtain an equation that the maximum likelihood estimator of the postchange mean vector needs to satisfy:
(7) 
or equivalently
Note that since is of dimension by, this defines an underdetermined system of equations for the maximum likelihood estimator . In other words, any that satisfies
for a vector that lies in the null space of , , is a maximum likelihood estimator for the postchange mean. In particular, we may choose the zero vector , then the corresponding maximum estimator satisfies the equation below:
(8) 
Substituting such a into (6), we form the logGLR. Using (8), the first and second terms in (6) become, respectively,
Combining above, from (6) we have that the logGLR statistic is given by
(9) 
Since the changepoint location is unknown, when forming the detection statistic, we take the maximum over a set of possible locations, i.e., the most recent samples from to , where is the window size. Then we define the sketching procedure, which is a stopping time that stops whenever the logGLR statistic raises above a threshold :
(10) 
Here the role of the window is twofold: it reduces the data storage when implement the detection procedure, and it establishes a minimum level of change that we want to detect.
IiiA2 Equivalent formulation of fixed projection sketching procedure
We can further simplify the logGLR statistic in (9) using the singular value decomposition (SVD) of . This will facilitates the performance analysis in Section IV and leads into some insights about the structure of the logGLR statistic. Let the SVD of be given by
(11) 
where , are the left and right singular spaces, is a diagonal matrix containing all nonzero singular values. Then . Thus, we can write the logGLR statistic (9) as
(12) 
Substitution of the sample average (5) into (12) results in
Now define transformed data
Since under the null hypothesis , we have . Similarly, under the alternative hypothesis , we have . Combing above, we obtain the following equivalent form for the sketching procedure in (10):
(13) 
This form of the detection procedure has one intuitive explanation: the sketching procedure essentially projects the data to form (less than ) independent data streams, and then form a logGLR statistic for these independent data streams.
IiiB Timevarying projection
IiiB1 GLR statistic
Similarly, we can derive the logLR statistic for the timevarying projections. For an assumed changepoint , using all observations from to time , we find the log likelihood ratio statistic similar to (6):
(14) 
Similarly, we replace the unknown postchange mean vector by its maximum likelihood estimator using data in . Taking the derivative of in (14) with respect to and setting it to zero, we obtain an equation that the maximum likelihood estimator needs to satisfy
(15) 
To solve from the above, one needs to discuss the rank of the matrix on the lefthandside of (15). Define the SVD of with and being the eigenspaces, and being a diagonal matrix that contains all the singular values. We have that
(16) 
where and . Consider the rank of for two cases below:

s are independent Gaussian random matrices. The columns within each in (16) are linearly independent with probability one. Moreover, the columns in different blocks are independent since s are independent, and their entries are drawn as independent continuous random variables. Therefore, the columns of are linearly independent and rank() = with probability one. Hence, we have that if , is rank deficient with probability 1; if , is full rank with probability one;

s are independent random matrices with entries drawn from a discrete distribution. In this case, we can claim that if , is not full rank and if , is full rank with high probability, however, with probability less than one.
From above we can see that this matrix is rank deficient when is small. However, this is generally the case since we want to detect quickly. Since the matrix in (16) is noninvertible in general, we use the pseudoinverse of the matrix. Define
From (15), we obtain an estimate of the maximum likelihood estimator for the postchange mean
Substituting such a into (14), we obtain the logGLR statistic for timevarying projection:
(17) 
IiiB2 Timevarying 01 project matrices
To further simplify the expression of GLR in (17), we focus on a special case when has  entries only. This means that at each time we only observe a subset of the entries.
In this case, is an by identity matrix, and is a diagonal matrix. For a diagonal matrix with diagonal entries , the pseudoinverse of is a diagonal matrix with diagonal entries if and with diagonal entries if . Let the index set of the observed entries at time be . Define indicator variables
(18) 
Then the logGLR statistic in (17) becomes
(19) 
Hence, for 01 matrices, the sketching procedure based on logGLR statistic is given by
(20) 
where is the prescribed threshold and is the window length. Note that the logGLR statistic essentially computes the sum of each entry within the time window , and then averages the squaredsum.
Iv Performance analysis
In this section, we present theoretical approximations to two performance metrics, the averagerunlength (ARL), which captures the falsealarmrate, and the expected detection delay (EDD), which captures the power of the detection statistic.
Iva Performance metrics
We first introduce some necessary notations. Under the null hypothesis in (1), the observations are zero mean. Denote the probability and expectation in this case by and , respectively. Under the alternative hypothesis, there exists a changepoint , such that the observations have mean for all . Probability and expectation in this case are denoted by and , respectively.
The choice of the threshold involves a tradeoff between two standard performance metrics that are commonly used for analyzing changepoint detection procedures [32]: (i) the ARL, defined to be the expected value of the stopping time when there is no change; (ii) the EDD, defined to be the expected stopping time in the extreme case where a change occurs immediately at , which is denoted as .
The following argument from [33] explains why we consider . When there is a change at , we are interested in the expected delay until its detection, i.e., the conditional expectation , which is a function of . When the shift in the mean only occurs in the positive direction , it can be shown that . It is not obvious that this remains true when can be either positive or negative. However, since is certainly of interest and reasonably easy to analyze, it is common to consider in the literature and we also adopt this as a surrogate.
IvB Fixed projection
Define a special function (cf. [34], page 82)
where denotes the cumulative probability function (CDF) for the standard Gaussian with zero mean and unit variance. For numerical purposes, a simple and accurate approximation is given by (cf. [35])
where denotes the probability distribution function (PDF) for standard Gaussian. We obtain an approximation to the ARL of the sketching procedure with a fixed projection as follows:
Theorem 1 (ARL, fixed projection).
Assume that , with and fixed. Then for for some positive integer , we have that the ARL of the sketching procedure defined in (10) is given by
(21) 
where
(22) 
This theorem gives an explicit expression for ARL as a function of the threshold , the dimension of the sketches , and the window length . As we will show below, the approximation to ARL given by Theorem 1 is highly accurate. On a higher level, this theorem characterizes the mean of the stopping time, when the detection statistic is driven by noise. The requirement for for some positive integer comes from [32] that our results are based on; this ensures the correct scaling when we pass to the limit. This essentially requires that the window length be large enough when the threshold increases. In practice, has to be large enough so that it does not cause a miss detection: has to be longer than the anticipated expected detection delay as explained in [32].
Moreover, we obtain an approximation to the EDD of the sketching procedure with a fixed projection as follows. Define
(23) 
Let be a random walk where the increments are independent and identically distributed with mean and variance . Since the random walk has a positive drift, its minimum value is lower bounded by zero and we can find the expected value of the minimum is given by [35]
where , and the infinite series converges and can be evaluated easily numerically. Define a stopping time
Let . It can be shown that [32]
Theorem 2 (EDD, fixed projection).
Suppose with other parameters held fixed. Then for a given matrix with the right singular vectors , the EDD of the sketching procedure (10) when is given by
(24) 
The theorem finds an explicit expression for the EDD as a function of threshold , the number of sketches , and the signal strength captured through which depends on the post mean vector and the projection subspace .
The proofs for the above two theorems utilize the equivalent form of in (13) and draw a connection of the sketching procedure to the socalled mixture procedure (cf. in [32]) when sensors are affected by the change, and the postchange mean vector is given by .
IvB1 Accuracy of theoretical approximations
Consider a generated as a Gaussian random matrix, with entries i.i.d. . Using the expression in Theorem 1, we can find the threshold such that the corresponding ARL is equal to 5000. This can be done conveniently, since the ARL is an increasing function of the threshold , we can use bisection to find such a threshold . Then we compare it with a threshold found from the simulation.
As shown in Table I, the threshold found using Theorem 1 is very close to that obtained from simulation. Therefore, even if the theoretical ARL approximation is derived for tends to infinity, it is still applicable when is large but finite. Theorem 1 is quite useful in determining a threshold for a targeted ARL, as simulations for large and can be quite timeconsuming, especially for a large ARL (e.g., 5000 or 10,000).
Moreover, we simulate the EDD for detecting a signal such that the postchange mean vector has all entries equal to a constant . As also shown in Table I, the approximation for EDD using Theorem 2 is quite accurate.
We have also verified that the theoretical approximations are accurate for the expander graphs and details omitted here since they are similar.
(theo)  (simu)  EDD (theo)  EDD (simu)  

100  84.65  84.44  3.4  4.3 (0.9) 
70  64.85  64.52  4.0  5.1 (1.2) 
50  51.04  50.75  4.8  5.9 (1.6) 
30  36.36  36.43  7.7  7.6 (2.5) 
10  19.59  19.63  19.8  17.4 (9.8) 
IvB2 Consequence
Remark 1.
For a fixed large ARL, when increases, the ratio is bounded between and . This is a property quite useful for establishing results in Section V. This is demonstrated numerically in Fig. 1 when , , for a fixed ARL being 5000. The corresponding threshold is found using Theorem 1, when increases from 10 to 100. More precisely, Theorem 1 leads to the following corollary:
Corollary 1.
Assume a large constant . Let . For any large enough , the threshold such that the corresponding ARL satisfies . In other words, .
Note that is on the order of , hence, it means that ARL can be very large, however, it is still bounded above (this means that the corollary holds for an nonasymptotic regime).
Remark 2.
As , the first order approximation to the EDD in Theorem 2 is given by , i.e., the threshold divided by the KullbackLeibler (KL) divergence (see, e.g., [32] shows that is the KL divergence between and ). This is consistent with our intuition since the expected increment of the detection statistics is roughly the KL divergence of the test. For finite , especially when the signal strength is weak and when the number of sketches is not large enough, the other terms than will play a significant role in determining the EDD.
IvC Timevarying 01 random projection matrices
Below, we obtain approximations to ARL and EDD for , i.e., when 01 sketching matrices are used. We assume a fixed dimension . We also assume that at each time , we randomly select out of dimensions to observe. Hence, at each time, each signal dimension has a probability
to be observed. The sampling scheme is illustrated in Fig. 2, when and (the number of the dots in each column is ) over 17 consecutive time periods from time to time .
For such a sampling scheme, we have the following result:
Theorem 3 (ARL, timevarying 01 random projection).
Moreover, we can obtain an approximation to EDD of , as justified by the following arguments. First, relax the deterministic constraint that at each time we observe exactly out of entries. Instead, assume a random sampling scheme that at each time we observe one entry of with probability , . Consider i.i.d. Bernoulli random variables with parameter for and . Define
Based on this, we define a procedure whose behavior is arguably similar to :
where is the prescribed threshold. Then, using the arguments in Appendix B, we can show that the approximation to EDD of this procedure is given by
(26) 
and we use this to approximate the EDD of .
Table II shows the accuracy of the approximations for ARL in (25) and for EDD in (26) with various s when , , and all entries of . The results show that the thresholds obtained using the theoretical approximations, and that the EDD approximations are both very accurate.
(theo)  (simu)  EDD (theo)  EDD (simu)  

100  84.65  84.44  2.8  3.3 (0.8) 
70  83.72  83.41  3.8  4.5 (1.2) 
50  82.84  83.02  5.3  6.1 (1.5) 
30  81.46  82.48  8.7  9.8 (2.4) 
10  78.32  79.27  23.4  26.6 (6.4) 
V Bounding relative performance
In this section, we characterize the relative performance of the sketching procedure compared to that without sketching (i.e., using the original logGLR statistic). We show that the performance loss due sketching can be small, when the signaltonoise ratio and are both sufficiently large. In the following, we focus on fixed projection to illustrate this point.
Va Relative performance metric
We consider a relative performance measure, which is the ratio of EDD using the original data (denoted as EDD(), which corresponds to ), versus the EDD using the sketches (denoted as EDD()
We will show that this ratio depends critically on the following quantity
(27) 
which is the ratio of the KL divergence after and before the sketching.
We start by deriving the relative performance measure using theoretical approximations we obtained in the last section. Recall the expression for ARL approximation in (24). Define
(28) 
From Theorem 2, we obtain that the EDD of the sketching procedure is proportional to
Let and be the thresholds such that the corresponding ARLs are , for the procedure without sketching and with sketches, respectively. Define , and
(29) 
Using the definitions above, we have
(30) 
VB Discussion of factors in (30)
We can show that for sufficiently large and large signal strength. This can be verified numerically. Since all quantities that depends on can be computed explicitly: the thresholds and can be found from Theorem 1 once we set a target ARL, the function can be evaluated using (28) which depends explicitly on and . Fig. 3 shows the value of when and all the entries of the postchange mean vector are equal to a constant value that varies across the axis. Note that is less than one only when the signal strength are small and is small. Thus, we have,
for sufficiently large and signal strength .
Using Corollary 1, we have that and , and hence, an lower bound of the ratio is between and , for large or large signal strength.
Next, we will bound when is a Gaussian matrix and an expander graph, respectively.
VC Bounding
VC1 Gaussian matrix
Consider whose entries are i.i.d. Gaussian with zero mean and variance equal to . First, we have the following lemma
Lemma 1 ([36]).
Let have i.i.d. entries. Then for any fixed vector , we have that
(31) 
More related results can be found in [37]. Since the distribution has a mean , we have that
We may also show that, provided and grow proportionally, converges to its mean value at a rate exponential in . Define to be
(32) 
We have the following result.
Theorem 4 (Gaussian ).
Let have entries i.i.d. . Let such that (32) holds. Then for , we have that
(33) 
at a rate exponential in . Hence, for Gaussian , is approximately with probability .
VC2 Sparse sketching matrix
We can show that for certain sparse 01 matrices (in particular, the expander graphs), is also bounded. This holds for the “onesided” changes, i.e., the postchange mean vector is elementwise positive. Such a scenario is encountered in environmental monitoring (see, e.g., [32, 38]). These sparse sketching matrices enable efficient sketching schemes, as each entry in the sketching vector only requires collecting information from few dimensions of the original data vector.
Assume for all . Let be consisting of binary entries, which corresponds to a bipartite graph, illustrated in Fig. 4. We further consider a bipartite graph with regular left degree (i.e., the number of edges from each variable node is ), and regular right degree (i.e., the number of edges from each parity check node is ), as illustrated in Fig. 4. Hence, this requires .
Expander graphs satisfy the above requirements, and they have been used in compressed sensing to sense a sparse vector (e.g., [39]). In particular, a matrix corresponds to a expander graph with regular right degree if and only if each column of has exactly “1”s, and for any set of right nodes with , the set of neighbors of the left nodes has size . If it further holds that each row of has “1”s, we say corresponds to a expander with regular right degree and regular left degree . The existence of such expander graphs is established in [40]:
Lemma 2 ([40]).
For any fixed and , when is sufficiently large, there always exists an expander with a regular right degree and a regular left degree for some constants , and .
Theorem 5 (Expander ).
If corresponds to a expander with regular degree and regular left degree , for any nonnegative vector , we have that
Hence, for expander graphs, is approximately greater than , where is a small number.
VD Consequence
Combine the results above, we shown that for the regime where and the signal strength are sufficiently large, the performance loss can be small (as indeed observed from our numerical examples). In this regime, when is a Gaussian random matrix, the relative performance measure is a constant, under the conditions in Corollary 1. Moreover, when is a sparse 01 matrix with nonzero entries on each row (in particular, an expander graph), the ratio (30) is lower bounded by for some small number , when Corollary 1 holds.
There is one intuitive explanation. Unlike in compressed sensing, where the goal is to recover a sparse signal and one needs the projection to preserve norm up to a factor through the restricted isometry property (RIP) [41], our goal is to detect a nonzero vector in Gaussian noise, which is a much simpler task than compressed sensing. Hence, even though the projection reduces the norm of the vector, as long as the projection does not diminish the signal normal below the noise floor.
On the other hand, when the signal is weak, and is not large enough, there can be significant performance loss (as indeed observed in our numerical examples) and we cannot lower bound the relative performance measure. Fortunately, in this regime, we can use our theoretical results in Theorem 1 and Theorem 2 to design the number of sketches for an anticipated worstcase signal strength , or determine the infeasibility of the problem, i.e., it is better not to use sketching since the signal is too weak.
Vi Numerical examples
In this section, we present numerical examples to demonstrate the performance of the sketching procedure. We focus on comparing the sketching procedure with the GLR procedure without sketching (by letting in the sketching procedure). We also compare the sketching procedures with a standard multivariate CUSUM using sketches.
In the subsequent examples, we select ARL to be 5000 to represent a low false detection rate (similar choice has been made in other sequential changepoint detection work such as [32]). In practice, however, the target ARL value depends on how frequent we can tolerate false detection (e.g., once a month or once a year). Below, EDD denotes the EDD when (i.e., no sketching is used). All simulated results are obtained from repetitions. We also consider the minimum number of sketches
(34) 
such that the corresponding sketching procedure is only sample slower than the full procedure. Below, we focus on the delay loss .
Via Fixed projection, Gaussian random matrix
First consider Gaussian with and different number of sketches .
ViA1 EDD versus signal magnitude
Assume the postchange mean vector has entries with equal magnitude: , to simplify our discussion. Fig. 5(a) shows EDD versus an increasing signal magnitude . Note that when and are sufficiently large, the sketching procedure can approach the performance of the procedure using the full data as predicted by our theory. When signal is weak, we have to use a much larger to prevent a significant performance loss (and when signal is too weak we cannot use sketching). Table III shows for each signal strength; we find that when is sufficiently large, we may even use less than for an to have little performance loss. Note that here we do not require signals to be sparse.
(a) 
(b) 
ViA2 EDD versus signal sparsity
Now assume that the postchange mean vector is sparse: only entries being 1 and the other entries being 0. Fig. 5(b) shows EDD versus an increasing . Note that as increases, the signal strength also increases, thus the sketching procedure will approach the performance using the full data. Similarly, the required is listed in Table IV. For example, when , we find that one can use for an with little performance loss.
300  150  100  50  30 
300  200  150  100  50 
ViB Fixed projection, expander graph
Now assume being an expander graph with and different number of sketches . We run the simulations with the same settings as those in Section VIA.
ViB1 EDD versus signal magnitude
Assume the postchange mean vector . Fig. 6(a) shows EDD with an increasing . Note that the simulated EDDs are smaller than those for the Gaussian random projections in Fig. 5. A possible reason is that the expander graph is better at aggregating the signals when are all positive. However, when are can be either positive or negative, the two choices of have similar performance, as shown in Fig. 7, where are drawn i.i.d. uniformly from .
ViB2 EDD versus signal sparsity
Assume that the postchange mean vector has only entries being one and the other entries being zero. Fig. 6(b) shows the simulated EDD versus an increasing . As tends to , the sketching procedure approaches the performance using the full data.
(a) 
(b) 
ViC Timevarying projections with 01 matrices
To demonstrate the performance of the procedure (20) using timevarying projection with 01 entries, again, we consider two cases: the postchange mean vector and the postchange mean vector has entries being one and the other entries being zero. The simulated EDDs are shown in Fig. 8. Note that can detect change quickly with a small subset of observations. Although EDDs of are larger than those for the fixed projections in Fig. 5 and Fig. 6, this example shows that projection with 01 entries can have little performance loss in some cases, and it is still a viable candidate since such projection means a simpler measurement scheme.
(a) 
(b) 
ViD Comparison with multivariate CUSUM
We compare our sketching method with a benchmark adapted from the conventional multivariate CUSUM procedure [42] for the sketches. A main difference is that in multivariate CUSUM, one needs a prescribed postchange mean vector (which is set to be an allone vector in our example), rather than estimate it as the GLR statistic does. Hence, its performance may be affected by parameter misspecification. We compare the performance again in two settings, when all are equal to a constant, and when entries of the postchange mean vector are positive valued. In Fig. 9, the logGLR based sketching procedure performs much better than the multivariate CUSUM.
(a) 
(b) 
Vii Examples for real applications
Viia Solar flare detection
We use our method to detect a solar flare in a video sequence from the Solar Data Observatory (SDO)^{1}^{1}1The video can be found at http://nislab.ee.duke.edu/MOUSSE/. The Solar Object Locator for the original data is SOL20110430T214549L061C108.. Each frame is of size pixels, which results in an ambient dimension . In this example, the normal frames are slowly drifting background sun surfaces, and the anomaly is a much brighter transient solar flare emerges at . Fig. 10(a) is a snapshot of the original SDO data at before the solar flare emerges, and Fig. 10(b) is a snapshot at when the solar flare emerges as a brighter curve in the middle of the image. We preprocess the data by tracking and removing the slowly changing background with the MOUSSE algorithm [43] to obtain tracking residuals. The Gaussianity for the residuals, which corresponds to our , is verified by the KolmogorovSmirnov test. For instance, the pvalue is for the signal at , which indicates that the Gaussianity is a reasonable assumption.
We apply the sketching procedure with fixed projection to the MOUSSE residuals. Choosing the sketching matrix to be an by Gaussian random matrix with entries i.i.d. . Note that the signal is deterministic in this case. To evaluate our method, we run the procedure times, each time using a different random Gaussian matrix as the fixed projection . Fig. 11 shows the errorbars of the EDDs from runs. As increases, both the means and standard deviations of the EDDs decrease. When is larger than , EDD is often less than , which means that our sketching detection procedure can reliably detect the solar flare with only sketches. This is a significant reduction and the dimensionality reduction ratio is .
(a)  (b) 
ViiB Changepoint detection for power systems
Finally, we present a synthetic example based on the real power network topology. We consider the Western States Power Grid of the United States, which consists of nodes and edges. The minimum degree of a node in the network is , as shown in Fig. 12^{2}^{2}2The topology of the power network can be downloaded at http://networkdata.ics.uci.edu/data/power/ [44].. The nodes represent generators, transformers, and substations, and edges represent highvoltage transmission lines between them [44]. Note that the graph is sparse and that there are many “communities” which correspond to densely connected subnetworks.
In this example, we simulate a situation for power failure over this large network. Assume that at each time we may observe the real power injection at an edge. When the power system is in a steady state, the observation is the true state plus Gaussian observation noise [45]. We may estimate the true state (e.g., using techniques in [45]), subtract it from the observation vector, and treat the residual vector as our signal , which can be assumed to be i.i.d. standard Gaussian. When a failure happens in a power system, there will be a shift in the mean for a small number of affected edges, since in practice, when there is a power failure, usually only a small part of the network is affected simultaneously.
To perform sketching, at each time, we randomly choose nodes in the network and measure the sum of the quantities over all attached edges as shown in Fig. 13. This corresponds to s with and various . Note that in this case, our projection matrix is a 01 matrix whose structure is constrained by the network topology. Our example is a simplified model for power networks and aims to shed some light on the potential of our method applied to monitoring real power networks.
In the following experiment, we assume that on average of the edges in the network increase by . Set the threshold such that the ARL is . Fig. 14 shows the simulated EDD versus an increasing signal strength . Note that the EDD from using a small number of sketches is quite small if is sufficiently large. For example, when , one may detect the change by observing from only sketches (when the EDD is increased only by one sample), which is a significant dimesionality reduction with a ratio of .