Sketching for Sequential Change-Point Detection

# Sketching for Sequential Change-Point Detection

Yang Cao, Andrew Thompson, Meng Wang, and Yao Xie Yang Cao (Email: caoyang@gatech.edu) and Yao Xie (Email: yao.xie@isye.gatech.edu) are with the H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, USA. Andrew Thompson (Email: thompson@maths.ox.ac.uk) is with the Mathematical Institute, University of Oxford, Oxford, UK. Meng Wang (Email: wangm7@rpi.edu) is with the Department of Electrical, Computer & Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY, USA. Authors are listed alphabetically. This paper was presented [in part] at the GlobalSIP 2015 [1].
###### Abstract

We study sequential change-point detection procedures based on linear sketches of high-dimensional signal vectors using generalized likelihood ratio (GLR) statistics. The GLR statistics allow for an unknown post-change mean that represents an anomaly or novelty. We consider both fixed and time-varying projections, derive theoretical approximations to two fundamental performance metrics: the average run length (ARL) and the expected detection delay (EDD); these approximations are shown to be highly accurate by numerical simulations. We further characterize the relative performance measure of the sketching procedure compared to that without sketching and show that there can be little performance loss when the signal strength is sufficiently large, and enough number of sketches are used. Finally, we demonstrate the good performance of sketching procedures using simulation and real-data examples on solar flare detection and failure detection in power networks.

## I Introduction

Online change-point detection from high-dimensional streaming data is a fundamental problem arising from applications such as real-time monitoring of sensor networks, computer network anomaly detection and computer vision (e.g., [2, 3]). To reduce data dimensionality, a conventional approach is sketching (see, e.g., [4]), which performs random projection of the high-dimensional data vectors into lower-dimensional ones. Sketching has now been widely used in signal processing and machine learning to reduce dimensionality and algorithm complexity, and achieve various practical benefits [5, 6, 7, 8, 9, 10, 11].

We consider change-point detection using linear sketches of high-dimensional data vectors. Sketching reduces the computational complexity of the detection statistic from to , where is the original dimensionality and is the dimensionality of sketches. Since we would like to perform real-time detection, any reduction in computational complexity (without incurring much performance loss) is highly desirable. Sketching also offers practical benefits. For instance, for large sensor networks, it reduces the burden of data collection and transmission. It may be impossible to collect data from all sensors and transmit them to a central hub in real-time, but this can be done if we only select a small subset of sensors to collect data at each time. Sketching also reduces data storage requirement. For instance, change-point detection using the generalized likelihood ratio statistic, although robust, it is non-recursive, and one has to store historical data. Using sketching, we only need to store the much lower dimensional sketches rather than the original high-dimensional vectors.

In this paper, we present a new sequential sketching procedure based on the generalized likelihood ratio (GLR) statistics. In particular, suppose we may choose an matrix with to project the original data: , . Assume the pre-change vector is zero-mean Gaussian distributed and the post-change vector is Gaussian distributed with an unknown mean vector while the covariance matrix is unchanged. Here we assume the mean vector is unknown since it typically represents an anomaly. The GLR statistic is formed by replacing the unknown with its maximum likelihood ratio estimator (e.g., [12]). Then we further generalize to the setting with time-varying projections of dimension . We demonstrate the good performance of our procedures by simulations, a real-data example of solar flare detection, and a synthetic example of power network failure detection with data generated using real-world power network topology.

Our theoretical contribution are mainly in two aspects. We obtain analytic expressions for two fundamental performance metrics for the sketching procedures: the average run length (ARL) when there is no change and the expected detection delay (EDD) when there is a change-point, for both fixed and time-varying projections. Our approximations are shown to be highly accurate using simulations. These approximations are quite useful in determining the threshold of the detection procedure to control false-alarms, without having to resort to the onerous numerical simulations. Moreover, we characterize the relative performance of the sketching procedure compared to that without sketching. We examine the EDD ratio when the sketching matrix is either a random Gaussian matrix or a sparse 0-1 matrix (in particular, an expander graph). We find that, as also verified numerically, when the signal strength and are sufficiently large, the sketching procedure may have little performance loss. When the signal is weak, the performance loss can be large when is too small. In this case, our results can be used to find the minimum such that performance loss is bounded, assuming certain worst case signal and for a given target ARL value.

To the best of our knowledge, our work is the first to consider sequential change-point detection using the generalized likelihood ratio statistic, assuming the post-change mean is unknown to represent an anomaly. The only other work [22] that consider change-point detection using linear projections assume the post-change mean is known and further, to be sparse. Our results are more general since we do not make such assumptions. Assuming the post-change mean to be unknown provides a more useful procedure since in change-point detection, the post-change set-up is usually unknown. Moreover, [22] considers Shiryaev-Robert’s procedure, which is based on a different kind of detection statistic than the generalized likelihood ratio statistic considered here. The theoretical analyses therein consider slightly different performance measures, the Probability of False Alarm (PFA) and Average Detection Delay (ADD) and our analyses are completely different.

Our notations are standard: denotes the Chi-square distribution with degree-of-freedom , denotes an identity matrix of size ; denotes the pseudoinverse of a matrix ; denotes the th coordinate of a vector ; denotes the th element of a matrix ; denotes the transpose of a vector or matrix .

The rest of the sections are organized as follows. We first review some related work. Section II sets up the formulation of the sketching problem for sequential change-point detection. Section III presents the sketching procedure. Section IV and Section V contain the performance analysis of the sketching procedures. Section VI and Section VII demonstrate good performance of our sketching procedures using simulation and real-world examples. Section VIII concludes the paper. All proofs are delegated to the appendix.

### I-a Related work

Change-point detection problems are closely related to industrial quality control and multivariate statistical control charts (SPC), where an observed process is assumed initially to be in control and at a change-point becomes out of control. The idea of using random projections for change detection has been explored for SPC in the pioneering work [13] based on multivariate control chart, the follow-up work [14] for cumulative sum (CUSUM) control chart and the exponential weighting moving average (EWMA) schemes, and in [15, 16] based on the Hotelling statistic. These works provide a complementary perspective from SPC design, while our method takes a different approach and is based on sequential hypothesis testing, treating both the change-point location and the post-change mean vector as unknown parameters. By treating the change-point location as an unknown parameter when deriving the detection statistic, the sequential hypothesis testing approach overcomes the drawback of some SPC methods due to a lack-of-memory, such as the Shewhart chart and the Hotelling chart, since they cannot utilize the information embedded in the entire sequence [17]. Moreover, our sequential GLR statistic may be preferred over the CUSUM procedure in the setting when it is difficult to specify the post-change mean vector. Besides the above distinctions from the SPC methods, other novelty of our methods also include: (1) we developed new theoretical results for the sequential GLR statistic; (2) we consider the sparse 0-1 and time-varying projections (the sparse 0-1 projection corresponds to downsampling the dimensions); (3) we study the amount of dimensionality reduction can be performed (i.e., the minimum ) such that there is little performance loss.

This paper extends on our preliminary work reported in [1] with several important extensions. We have added (1) time-varying sketching projections and their theoretical analysis; (2) extensive numerical examples to verify our theoretical results; (3) new real-data examples of solar flare detection and power failure detection.

Our work is related to compressive signal processing [18], where the problem of interest is to estimate or detect (in the fixed-sample setting) a sparse signal using compressive measurements. In [19], an offline test for a non-zero vector buried in Gaussian noise using linear measurements is studied; interestingly, a conclusion similar to ours is drawn that the task of detection within this setting is much easier than the tasks of estimation and support recovery. Another related work is [20], which considers a problem of identifying a subset of data streams within a larger set, where the data streams in the subset follow a distribution (representing anomaly) that is different from the original distribution; the problem considered therein is not a sequential change-point detection problem as the “change-point” happens at the onset (). In [21], an offline setting is considered and the goal is to identify out of samples whose distributions are different from the normal distribution . They use a “temporal” mixing of the samples over the finite time horizon. This is different from our setting since we project over the signal dimension at each time. Other related work include kernel methods [23] and [24] that focus on offline change-point detection. Finally, detecting transient changes in power systems has been studied in [25].

Another common approach to dimensionality reduction is principal component analysis (PCA) [26], which achieves dimensionality reduction by projecting the signal along the singular space of the leading singular values. In this case, or corresponds to the signal singular space. Our theoretical approximation for ARL and EDD can also be applied in these settings. It may not be easy to find the signal singular space when the dimensionality is high, since computing singular value decomposition can be expensive [27].

## Ii Formulation

Consider a sequence of observations with an open time horizon , , where and is the signal dimension. Initially, the observations are due to noise. There can be a time such that an unknown change-point occurs and it changes the mean of the signal vector. Such a problem can be formulated as the following hypothesis test:

 H0:xt∼N(0,IN),t=1,2,…H1:xt∼N(0,IN),t=1,2,…,κ,xt∼N(μ,IN),t=κ+1,κ+2,… (1)

where the unknown mean vector is defined as

 μ≜[μ1,…,μN]⊺∈RN.

Without loss of generality, we have assumed the noise variance is 1. Our goal is to detect the change-point as soon as possible after it occurs, subjecting to the false alarm constraint. Here, we assume the covariance of the data to be an identity matrix and the change only happens to the mean.

To reduce data dimensionality, we linearly project each observation into a lower dimensional space, which we refer to as sketching. We aim to develop procedures that can detect a change-point using the low-dimensional sketches. In the following, we consider two types of linear sketching: the fixed projection and the time-varying projection.

Fixed projection. Choose an projection matrix with . We obtain low dimensional sketches via:

 yt≜Axt,t=1,2,… (2)

Then the hypothesis test for the original problem (1), becomes the following hypothesis test based on the sketches (2)

 H0:yt∼N(0,AA⊺),t=1,2,…H1:yt∼N(0,AA⊺),t=1,2,…,κ,yt∼N(Aμ,AA⊺),t=κ+1,κ+2,… (3)

Note that both mean and covariance structures are affected by the projections.

Time-varying projection. In certain applications, one may use different sketching matrices at each time. The projections are denoted by and the number of sketches can change as well. The hypothesis test for sketches becomes:

 H0:yt∼N(0,AtA⊺t),t=1,2,…H1:yt∼N(0,AtA⊺t),t=1,2,…,κ,yt∼N(Atμ,AtA⊺t),t=κ+1,κ+2,… (4)

The above models also capture several cases:
(i) (Pairwise comparison.) In applications such as social network data analysis and computer vision, we are interested in a pairwise comparison of variables [28, 29]. This can be modeled as observing the difference between a pair of variables, i.e., at each time , the measurements are , for a set of . There are a total of possible comparisons, and we may select out of such comparisons to observe. The pairwise comparison model leads to a structured fixed projection with only entries, for instance:

 A=⎡⎢ ⎢ ⎢⎣1−10…00100…−106010…0−1⎤⎥ ⎥ ⎥⎦∈RM×N.

(ii) (Missing data.) In various applications we may only observe a subset of entries at each time (e.g., due to sensor failure), and the locations of the observed entries also vary with time [30]. This corresponds to being a submatrix of an identity matrix by selecting rows from an index set at time .

(iii) (PCA.) There are also approaches to change-point detection using principal component analysis (PCA) of the data streams (e.g., [26, 31]), which can be viewed as using a fixed projection being the signal singular space associated with the leading singular values.

## Iii Sketching procedure

### Iii-a Fixed projection

#### Iii-A1 Derivation of GLR statistic

We now derive the likelihood ratio statistic for the hypothesis test in (3). Define the sample mean within a window

 ¯yk,t=∑ti=k+1yit−k. (5)

Since the observations are i.i.d. over time, for an assumed change-point , for the hypothesis test in (3), the log-likelihood ratio (log-LR) of observations accumulated up to time can be shown to be

 ℓ(t,k,μ)= log∏ki=1f0(yi)⋅∏ti=k+1f1(yi)∏ti=1f0(yi)= t∑i=k+1logf1(yi)f0(yi)= (t−k)[¯y⊺k,t(AA⊺)−1Aμ−12μ⊺A⊺(AA⊺)−1Aμ], (6)

where denotes the probability density function of under the null, i.e., and denotes the probability density function of under the alternative, i.e., .

Since is unknown, we replace it with a maximum likelihood estimator for fixed values of and in the likelihood ratio (6) to obtain the log-generalized likelihood ratio (log-GLR) statistic. Taking the derivative of in (6) with respect to and setting it to zero, we obtain an equation that the maximum likelihood estimator of the post-change mean vector needs to satisfy:

 A⊺(AA⊺)−1Aμ∗=A⊺(AA⊺)−1¯yt,k, (7)

or equivalently

 A⊺[(AA⊺)−1Aμ∗−(AA⊺)−1¯yt,k]=0.

Note that since is of dimension -by-, this defines an under-determined system of equations for the maximum likelihood estimator . In other words, any that satisfies

 (AA⊺)−1Aμ∗=(AA⊺)−1¯yt,k+c,

for a vector that lies in the null space of , , is a maximum likelihood estimator for the post-change mean. In particular, we may choose the zero vector , then the corresponding maximum estimator satisfies the equation below:

 (AA⊺)−1Aμ∗=(AA⊺)−1¯yt,k. (8)

Substituting such a into (6), we form the log-GLR. Using (8), the first and second terms in (6) become, respectively,

 ¯y⊺k,t(AA⊺)−1Aμ∗ =¯y⊺k,t(AA⊺)−1¯yt,k, 12μ∗⊺A⊺(AA⊺)−1Aμ∗ =12¯y⊺k,t(AA⊺)−1¯yt,k.

Combining above, from (6) we have that the log-GLR statistic is given by

 ℓ(t,k,μ∗)=t−k2¯y⊺k,t(AA⊺)−1¯yk,t. (9)

Since the change-point location is unknown, when forming the detection statistic, we take the maximum over a set of possible locations, i.e., the most recent samples from to , where is the window size. Then we define the sketching procedure, which is a stopping time that stops whenever the log-GLR statistic raises above a threshold :

 T=inf{t:maxt−w≤kb}. (10)

Here the role of the window is two-fold: it reduces the data storage when implement the detection procedure, and it establishes a minimum level of change that we want to detect.

#### Iii-A2 Equivalent formulation of fixed projection sketching procedure

We can further simplify the log-GLR statistic in (9) using the singular value decomposition (SVD) of . This will facilitates the performance analysis in Section IV and leads into some insights about the structure of the log-GLR statistic. Let the SVD of be given by

 A=UΣV⊺, (11)

where , are the left and right singular spaces, is a diagonal matrix containing all non-zero singular values. Then . Thus, we can write the log-GLR statistic (9) as

 ℓ(t,k,μ∗)=t−k2¯y⊺k,tUΣ−2U⊺¯yk,t. (12)

Substitution of the sample average (5) into (12) results in

 ℓ(t,k,μ∗)=∥∥Σ−1U⊺(∑ti=k+1yi)∥∥22(t−k).

Now define transformed data

 zi≜Σ−1U⊺yi.

Since under the null hypothesis , we have . Similarly, under the alternative hypothesis , we have . Combing above, we obtain the following equivalent form for the sketching procedure in (10):

 T′=inf{t:maxt−w≤kb}. (13)

This form of the detection procedure has one intuitive explanation: the sketching procedure essentially projects the data to form (less than ) independent data streams, and then form a log-GLR statistic for these independent data streams.

### Iii-B Time-varying projection

#### Iii-B1 GLR statistic

Similarly, we can derive the log-LR statistic for the time-varying projections. For an assumed change-point , using all observations from to time , we find the log likelihood ratio statistic similar to (6):

 ℓ(t,k,μ)=t∑i=k+1[y⊺i(AiA⊺i)−1Aiμ−12μ⊺A⊺i(AiA⊺i)−1Aiμ]. (14)

Similarly, we replace the unknown post-change mean vector by its maximum likelihood estimator using data in . Taking the derivative of in (14) with respect to and setting it to zero, we obtain an equation that the maximum likelihood estimator needs to satisfy

 [t∑i=k+1A⊺i(AiA⊺i)−1Ai]μ∗=t∑i=k+1A⊺i(AiA⊺i)−1yi. (15)

To solve from the above, one needs to discuss the rank of the matrix on the left-hand-side of (15). Define the SVD of with and being the eigenspaces, and being a diagonal matrix that contains all the singular values. We have that

 t∑i=k+1A⊺i(AiA⊺i)−1Ai=t∑i=k+1ViV⊺i=QQ⊺, (16)

where and . Consider the rank of for two cases below:

1. s are independent Gaussian random matrices. The columns within each in (16) are linearly independent with probability one. Moreover, the columns in different blocks are independent since s are independent, and their entries are drawn as independent continuous random variables. Therefore, the columns of are linearly independent and rank() = with probability one. Hence, we have that if , is rank deficient with probability 1; if , is full rank with probability one;

2. s are independent random matrices with entries drawn from a discrete distribution. In this case, we can claim that if , is not full rank and if , is full rank with high probability, however, with probability less than one.

From above we can see that this matrix is rank deficient when is small. However, this is generally the case since we want to detect quickly. Since the matrix in (16) is non-invertible in general, we use the pseudo-inverse of the matrix. Define

 Bk,t≜(t∑i=k+1A⊺i(AiA⊺i)−1Ai)†∈RN×N.

From (15), we obtain an estimate of the maximum likelihood estimator for the post-change mean

 μ∗=Bk,tt∑i=k+1A⊺i(AiA⊺i)−1yi.

Substituting such a into (14), we obtain the log-GLR statistic for time-varying projection:

 ℓ(t,k,μ∗)=(t∑i=k+1A⊺i(AiA⊺i)−1yi)⊺Bk,t(t∑i=k+1A⊺i(AiA⊺i)−1yi). (17)

#### Iii-B2 Time-varying 0-1 project matrices

To further simplify the expression of GLR in (17), we focus on a special case when has - entries only. This means that at each time we only observe a subset of the entries.

In this case, is an -by- identity matrix, and is a diagonal matrix. For a diagonal matrix with diagonal entries , the pseudo-inverse of is a diagonal matrix with diagonal entries if and with diagonal entries if . Let the index set of the observed entries at time be . Define indicator variables

 Itn={1,if n∈Ωt;0,otherwise. (18)

Then the log-GLR statistic in (17) becomes

 (19)

Hence, for 0-1 matrices, the sketching procedure based on log-GLR statistic is given by

 T{0,1}=inf{t:maxt−w≤kb}, (20)

where is the prescribed threshold and is the window length. Note that the log-GLR statistic essentially computes the sum of each entry within the time window , and then averages the squared-sum.

## Iv Performance analysis

In this section, we present theoretical approximations to two performance metrics, the average-run-length (ARL), which captures the false-alarm-rate, and the expected detection delay (EDD), which captures the power of the detection statistic.

### Iv-a Performance metrics

We first introduce some necessary notations. Under the null hypothesis in (1), the observations are zero mean. Denote the probability and expectation in this case by and , respectively. Under the alternative hypothesis, there exists a change-point , such that the observations have mean for all . Probability and expectation in this case are denoted by and , respectively.

The choice of the threshold involves a trade-off between two standard performance metrics that are commonly used for analyzing change-point detection procedures [32]: (i) the ARL, defined to be the expected value of the stopping time when there is no change; (ii) the EDD, defined to be the expected stopping time in the extreme case where a change occurs immediately at , which is denoted as .

The following argument from [33] explains why we consider . When there is a change at , we are interested in the expected delay until its detection, i.e., the conditional expectation , which is a function of . When the shift in the mean only occurs in the positive direction , it can be shown that . It is not obvious that this remains true when can be either positive or negative. However, since is certainly of interest and reasonably easy to analyze, it is common to consider in the literature and we also adopt this as a surrogate.

### Iv-B Fixed projection

Define a special function (cf. [34], page 82)

 ν(u)=2u−2exp[−2∞∑i=1i−1Φ(−|u|i1/2/2)],

where denotes the cumulative probability function (CDF) for the standard Gaussian with zero mean and unit variance. For numerical purposes, a simple and accurate approximation is given by (cf. [35])

 ν(u)≈2/u[Φ(u/2)−0.5](u/2)Φ(u/2)+ϕ(u/2),

where denotes the probability distribution function (PDF) for standard Gaussian. We obtain an approximation to the ARL of the sketching procedure with a fixed projection as follows:

###### Theorem 1 (ARL, fixed projection).

Assume that , with and fixed. Then for for some positive integer , we have that the ARL of the sketching procedure defined in (10) is given by

 (21)

where

 c(M,b,w)=∫√2b(1−M2b)√2bw(1−M2b)uν2(u)du. (22)

This theorem gives an explicit expression for ARL as a function of the threshold , the dimension of the sketches , and the window length . As we will show below, the approximation to ARL given by Theorem 1 is highly accurate. On a higher level, this theorem characterizes the mean of the stopping time, when the detection statistic is driven by noise. The requirement for for some positive integer comes from [32] that our results are based on; this ensures the correct scaling when we pass to the limit. This essentially requires that the window length be large enough when the threshold increases. In practice, has to be large enough so that it does not cause a miss detection: has to be longer than the anticipated expected detection delay as explained in [32].

Moreover, we obtain an approximation to the EDD of the sketching procedure with a fixed projection as follows. Define

 Δ=∥V⊺μ∥. (23)

Let be a random walk where the increments are independent and identically distributed with mean and variance . Since the random walk has a positive drift, its minimum value is lower bounded by zero and we can find the expected value of the minimum is given by [35]

 E{mint≥0~St}=−∞∑i=1i−1E{~S−i},

where , and the infinite series converges and can be evaluated easily numerically. Define a stopping time

 τ=min{t:~St>0}.

Let . It can be shown that [32]

 ρ(Δ)=Δ2/4+1−∞∑i=1i−1E{~S−i}.
###### Theorem 2 (EDD, fixed projection).

Suppose with other parameters held fixed. Then for a given matrix with the right singular vectors , the EDD of the sketching procedure (10) when is given by

 E0{T}=b+ρ(Δ)−M/2−E{mint≥0~St}+o(1)Δ2/2, (24)

The theorem finds an explicit expression for the EDD as a function of threshold , the number of sketches , and the signal strength captured through which depends on the post mean vector and the projection subspace .

The proofs for the above two theorems utilize the equivalent form of in (13) and draw a connection of the sketching procedure to the so-called mixture procedure (cf. in [32]) when sensors are affected by the change, and the post-change mean vector is given by .

#### Iv-B1 Accuracy of theoretical approximations

Consider a generated as a Gaussian random matrix, with entries i.i.d. . Using the expression in Theorem 1, we can find the threshold such that the corresponding ARL is equal to 5000. This can be done conveniently, since the ARL is an increasing function of the threshold , we can use bisection to find such a threshold . Then we compare it with a threshold found from the simulation.

As shown in Table I, the threshold found using Theorem 1 is very close to that obtained from simulation. Therefore, even if the theoretical ARL approximation is derived for tends to infinity, it is still applicable when is large but finite. Theorem 1 is quite useful in determining a threshold for a targeted ARL, as simulations for large and can be quite time-consuming, especially for a large ARL (e.g., 5000 or 10,000).

Moreover, we simulate the EDD for detecting a signal such that the post-change mean vector has all entries equal to a constant . As also shown in Table I, the approximation for EDD using Theorem 2 is quite accurate.

We have also verified that the theoretical approximations are accurate for the expander graphs and details omitted here since they are similar.

#### Iv-B2 Consequence

Theorem 1 and Theorem 2 have the following consequence:

###### Remark 1.

For a fixed large ARL, when increases, the ratio is bounded between and . This is a property quite useful for establishing results in Section V. This is demonstrated numerically in Fig. 1 when , , for a fixed ARL being 5000. The corresponding threshold is found using Theorem 1, when increases from 10 to 100. More precisely, Theorem 1 leads to the following corollary:

###### Corollary 1.

Assume a large constant . Let . For any large enough , the threshold such that the corresponding ARL satisfies . In other words, .

Note that is on the order of , hence, it means that ARL can be very large, however, it is still bounded above (this means that the corollary holds for an non-asymptotic regime).

###### Remark 2.

As , the first order approximation to the EDD in Theorem 2 is given by , i.e., the threshold divided by the Kullback-Leibler (K-L) divergence (see, e.g., [32] shows that is the K-L divergence between and ). This is consistent with our intuition since the expected increment of the detection statistics is roughly the K-L divergence of the test. For finite , especially when the signal strength is weak and when the number of sketches is not large enough, the other terms than will play a significant role in determining the EDD.

### Iv-C Time-varying 0-1 random projection matrices

Below, we obtain approximations to ARL and EDD for , i.e., when 0-1 sketching matrices are used. We assume a fixed dimension . We also assume that at each time , we randomly select out of dimensions to observe. Hence, at each time, each signal dimension has a probability

 r=M/N∈(0,1)

to be observed. The sampling scheme is illustrated in Fig. 2, when and (the number of the dots in each column is ) over 17 consecutive time periods from time to time .

For such a sampling scheme, we have the following result:

###### Theorem 3 (ARL, time-varying 0-1 random projection).

Let . Let . When , for the procedure defined in (20), we have that

 E∞{T{0,1}}=2√πc(N,b′,w)1√N11−N2b′(N2)N2b′−N2eb′−N2+o(1), (25)

where is defined by replacing with in (22).

Moreover, we can obtain an approximation to EDD of , as justified by the following arguments. First, relax the deterministic constraint that at each time we observe exactly out of entries. Instead, assume a random sampling scheme that at each time we observe one entry of with probability , . Consider i.i.d. Bernoulli random variables with parameter for and . Define

 Zn,k,t≜∑ti=k+1[xi]nξni√(t−k)r.

Based on this, we define a procedure whose behavior is arguably similar to :

 T′{0,1}=inf{t≥1:maxt−w≤kb},

where is the prescribed threshold. Then, using the arguments in Appendix B, we can show that the approximation to EDD of this procedure is given by

 E0{T′{0,1}}=(2b−N∑Nn=1μ2n+o(1))⋅NM, (26)

and we use this to approximate the EDD of .

Table II shows the accuracy of the approximations for ARL in (25) and for EDD in (26) with various s when , , and all entries of . The results show that the thresholds obtained using the theoretical approximations, and that the EDD approximations are both very accurate.

## V Bounding relative performance

In this section, we characterize the relative performance of the sketching procedure compared to that without sketching (i.e., using the original log-GLR statistic). We show that the performance loss due sketching can be small, when the signal-to-noise ratio and are both sufficiently large. In the following, we focus on fixed projection to illustrate this point.

### V-a Relative performance metric

We consider a relative performance measure, which is the ratio of EDD using the original data (denoted as EDD(), which corresponds to ), versus the EDD using the sketches (denoted as EDD()

 EDD(N)EDD(M)∈(0,1).

We will show that this ratio depends critically on the following quantity

 Γ≜∥V⊺μ∥2∥μ∥2, (27)

which is the ratio of the KL divergence after and before the sketching.

We start by deriving the relative performance measure using theoretical approximations we obtained in the last section. Recall the expression for ARL approximation in (24). Define

 h(Δ,M)=ρ(Δ)−M/2−E{~S−i}. (28)

From Theorem 2, we obtain that the EDD of the sketching procedure is proportional to

 2b∥V⊺μ∥2⋅(1+h(∥V⊺μ∥,M)2b)⋅(1+o(1)).

Let and be the thresholds such that the corresponding ARLs are , for the procedure without sketching and with sketches, respectively. Define , and

 P=1+h(∥μ∥,N)/bN1+h(∥V⊺μ∥,M)/bM. (29)

Using the definitions above, we have

 EDD(N)EDD(M)=P⋅bNbM⋅∥V⊺μ∥2∥μ∥2(1+o(1))=P⋅NM⋅QMQN⋅Γ(1+o(1)). (30)

### V-B Discussion of factors in (30)

We can show that for sufficiently large and large signal strength. This can be verified numerically. Since all quantities that depends on can be computed explicitly: the thresholds and can be found from Theorem 1 once we set a target ARL, the function can be evaluated using (28) which depends explicitly on and . Fig. 3 shows the value of when and all the entries of the post-change mean vector are equal to a constant value that varies across the -axis. Note that is less than one only when the signal strength are small and is small. Thus, we have,

 EDD(N)EDD(M)≥NM⋅QMQN⋅Γ(1+o(1)),

for sufficiently large and signal strength .

Using Corollary 1, we have that and , and hence, an lower bound of the ratio is between and , for large or large signal strength.

Next, we will bound when is a Gaussian matrix and an expander graph, respectively.

### V-C Bounding Γ

#### V-C1 Gaussian matrix

Consider whose entries are i.i.d. Gaussian with zero mean and variance equal to . First, we have the following lemma

###### Lemma 1 ([36]).

Let have i.i.d. entries. Then for any fixed vector , we have that

 Γ∼Beta(M2,N−M2). (31)

More related results can be found in [37]. Since the distribution has a mean , we have that

 E{Γ}=M/2M/2+(N−M)/2=MN.

We may also show that, provided and grow proportionally, converges to its mean value at a rate exponential in . Define to be

 δ≜limN→∞MN. (32)

We have the following result.

###### Theorem 4 (Gaussian A).

Let have entries i.i.d. . Let such that (32) holds. Then for , we have that

 P{δ−ϵ<Γ<δ+ϵ}→1, (33)

at a rate exponential in . Hence, for Gaussian , is approximately with probability .

#### V-C2 Sparse sketching matrix A

We can show that for certain sparse 0-1 matrices (in particular, the expander graphs), is also bounded. This holds for the “one-sided” changes, i.e., the post-change mean vector is element-wise positive. Such a scenario is encountered in environmental monitoring (see, e.g., [32, 38]). These sparse sketching matrices enable efficient sketching schemes, as each entry in the sketching vector only requires collecting information from few dimensions of the original data vector.

Assume for all . Let be consisting of binary entries, which corresponds to a bipartite graph, illustrated in Fig. 4. We further consider a bipartite graph with regular left degree (i.e., the number of edges from each variable node is ), and regular right degree (i.e., the number of edges from each parity check node is ), as illustrated in Fig. 4. Hence, this requires .

Expander graphs satisfy the above requirements, and they have been used in compressed sensing to sense a sparse vector (e.g., [39]). In particular, a matrix corresponds to a -expander graph with regular right degree if and only if each column of has exactly “1”s, and for any set of right nodes with , the set of neighbors of the left nodes has size . If it further holds that each row of has “1”s, we say corresponds to a -expander with regular right degree and regular left degree . The existence of such expander graphs is established in [40]:

###### Lemma 2 ([40]).

For any fixed and , when is sufficiently large, there always exists an expander with a regular right degree and a regular left degree for some constants , and .

###### Theorem 5 (Expander A).

If corresponds to a -expander with regular degree and regular left degree , for any nonnegative vector , we have that

 Γ≥M(1−ϵ)dN.

Hence, for expander graphs, is approximately greater than , where is a small number.

### V-D Consequence

Combine the results above, we shown that for the regime where and the signal strength are sufficiently large, the performance loss can be small (as indeed observed from our numerical examples). In this regime, when is a Gaussian random matrix, the relative performance measure is a constant, under the conditions in Corollary 1. Moreover, when is a sparse 0-1 matrix with non-zero entries on each row (in particular, an expander graph), the ratio (30) is lower bounded by for some small number , when Corollary 1 holds.

There is one intuitive explanation. Unlike in compressed sensing, where the goal is to recover a sparse signal and one needs the projection to preserve norm up to a factor through the restricted isometry property (RIP) [41], our goal is to detect a non-zero vector in Gaussian noise, which is a much simpler task than compressed sensing. Hence, even though the projection reduces the norm of the vector, as long as the projection does not diminish the signal normal below the noise floor.

On the other hand, when the signal is weak, and is not large enough, there can be significant performance loss (as indeed observed in our numerical examples) and we cannot lower bound the relative performance measure. Fortunately, in this regime, we can use our theoretical results in Theorem 1 and Theorem 2 to design the number of sketches for an anticipated worst-case signal strength , or determine the infeasibility of the problem, i.e., it is better not to use sketching since the signal is too weak.

## Vi Numerical examples

In this section, we present numerical examples to demonstrate the performance of the sketching procedure. We focus on comparing the sketching procedure with the GLR procedure without sketching (by letting in the sketching procedure). We also compare the sketching procedures with a standard multivariate CUSUM using sketches.

In the subsequent examples, we select ARL to be 5000 to represent a low false detection rate (similar choice has been made in other sequential change-point detection work such as [32]). In practice, however, the target ARL value depends on how frequent we can tolerate false detection (e.g., once a month or once a year). Below, EDD denotes the EDD when (i.e., no sketching is used). All simulated results are obtained from repetitions. We also consider the minimum number of sketches

 Mmin:EDD(Mmin)≤δ+EDDo, (34)

such that the corresponding sketching procedure is only sample slower than the full procedure. Below, we focus on the delay loss .

### Vi-a Fixed projection, Gaussian random matrix

First consider Gaussian with and different number of sketches .

#### Vi-A1 EDD versus signal magnitude

Assume the post-change mean vector has entries with equal magnitude: , to simplify our discussion. Fig. 5(a) shows EDD versus an increasing signal magnitude . Note that when and are sufficiently large, the sketching procedure can approach the performance of the procedure using the full data as predicted by our theory. When signal is weak, we have to use a much larger to prevent a significant performance loss (and when signal is too weak we cannot use sketching). Table III shows for each signal strength; we find that when is sufficiently large, we may even use less than for an to have little performance loss. Note that here we do not require signals to be sparse.

#### Vi-A2 EDD versus signal sparsity

Now assume that the post-change mean vector is sparse: only entries being 1 and the other entries being 0. Fig. 5(b) shows EDD versus an increasing . Note that as increases, the signal strength also increases, thus the sketching procedure will approach the performance using the full data. Similarly, the required is listed in Table IV. For example, when , we find that one can use for an with little performance loss.

### Vi-B Fixed projection, expander graph

Now assume being an expander graph with and different number of sketches . We run the simulations with the same settings as those in Section VI-A.

#### Vi-B1 EDD versus signal magnitude

Assume the post-change mean vector . Fig. 6(a) shows EDD with an increasing . Note that the simulated EDDs are smaller than those for the Gaussian random projections in Fig. 5. A possible reason is that the expander graph is better at aggregating the signals when are all positive. However, when are can be either positive or negative, the two choices of have similar performance, as shown in Fig. 7, where are drawn i.i.d. uniformly from .

#### Vi-B2 EDD versus signal sparsity

Assume that the post-change mean vector has only entries being one and the other entries being zero. Fig. 6(b) shows the simulated EDD versus an increasing . As tends to , the sketching procedure approaches the performance using the full data.

### Vi-C Time-varying projections with 0-1 matrices

To demonstrate the performance of the procedure (20) using time-varying projection with 0-1 entries, again, we consider two cases: the post-change mean vector and the post-change mean vector has entries being one and the other entries being zero. The simulated EDDs are shown in Fig. 8. Note that can detect change quickly with a small subset of observations. Although EDDs of are larger than those for the fixed projections in Fig. 5 and Fig. 6, this example shows that projection with 0-1 entries can have little performance loss in some cases, and it is still a viable candidate since such projection means a simpler measurement scheme.

### Vi-D Comparison with multivariate CUSUM

We compare our sketching method with a benchmark adapted from the conventional multivariate CUSUM procedure [42] for the sketches. A main difference is that in multivariate CUSUM, one needs a prescribed post-change mean vector (which is set to be an all-one vector in our example), rather than estimate it as the GLR statistic does. Hence, its performance may be affected by parameter misspecification. We compare the performance again in two settings, when all are equal to a constant, and when entries of the post-change mean vector are positive valued. In Fig. 9, the log-GLR based sketching procedure performs much better than the multivariate CUSUM.

## Vii Examples for real applications

### Vii-a Solar flare detection

We use our method to detect a solar flare in a video sequence from the Solar Data Observatory (SDO)111The video can be found at http://nislab.ee.duke.edu/MOUSSE/. The Solar Object Locator for the original data is SOL2011-04-30T21-45-49L061C108.. Each frame is of size pixels, which results in an ambient dimension . In this example, the normal frames are slowly drifting background sun surfaces, and the anomaly is a much brighter transient solar flare emerges at . Fig. 10(a) is a snapshot of the original SDO data at before the solar flare emerges, and Fig. 10(b) is a snapshot at when the solar flare emerges as a brighter curve in the middle of the image. We preprocess the data by tracking and removing the slowly changing background with the MOUSSE algorithm [43] to obtain tracking residuals. The Gaussianity for the residuals, which corresponds to our , is verified by the Kolmogorov-Smirnov test. For instance, the p-value is for the signal at , which indicates that the Gaussianity is a reasonable assumption.

We apply the sketching procedure with fixed projection to the MOUSSE residuals. Choosing the sketching matrix to be an -by- Gaussian random matrix with entries i.i.d. . Note that the signal is deterministic in this case. To evaluate our method, we run the procedure times, each time using a different random Gaussian matrix as the fixed projection . Fig. 11 shows the error-bars of the EDDs from runs. As increases, both the means and standard deviations of the EDDs decrease. When is larger than , EDD is often less than , which means that our sketching detection procedure can reliably detect the solar flare with only sketches. This is a significant reduction and the dimensionality reduction ratio is .

### Vii-B Change-point detection for power systems

Finally, we present a synthetic example based on the real power network topology. We consider the Western States Power Grid of the United States, which consists of nodes and edges. The minimum degree of a node in the network is , as shown in Fig. 12222The topology of the power network can be downloaded at http://networkdata.ics.uci.edu/data/power/ [44].. The nodes represent generators, transformers, and substations, and edges represent high-voltage transmission lines between them [44]. Note that the graph is sparse and that there are many “communities” which correspond to densely connected subnetworks.

In this example, we simulate a situation for power failure over this large network. Assume that at each time we may observe the real power injection at an edge. When the power system is in a steady state, the observation is the true state plus Gaussian observation noise [45]. We may estimate the true state (e.g., using techniques in [45]), subtract it from the observation vector, and treat the residual vector as our signal , which can be assumed to be i.i.d. standard Gaussian. When a failure happens in a power system, there will be a shift in the mean for a small number of affected edges, since in practice, when there is a power failure, usually only a small part of the network is affected simultaneously.

To perform sketching, at each time, we randomly choose nodes in the network and measure the sum of the quantities over all attached edges as shown in Fig. 13. This corresponds to s with and various . Note that in this case, our projection matrix is a 0-1 matrix whose structure is constrained by the network topology. Our example is a simplified model for power networks and aims to shed some light on the potential of our method applied to monitoring real power networks.

In the following experiment, we assume that on average of the edges in the network increase by . Set the threshold such that the ARL is . Fig. 14 shows the simulated EDD versus an increasing signal strength . Note that the EDD from using a small number of sketches is quite small if is sufficiently large. For example, when , one may detect the change by observing from only sketches (when the EDD is increased only by one sample), which is a significant dimesionality reduction with a ratio of .