First-order optimal sequential subspace change-point detection

# First-order optimal sequential subspace change-point detection

###### Abstract

We consider the sequential change-point detection problem of detecting changes that are characterized by a subspace structure. Such changes are frequent in high-dimensional streaming data altering the form of the corresponding covariance matrix. In this work we present a Subspace-CUSUM procedure and demonstrate its first-order asymptotic optimality properties for the case where the subspace structure is unknown and needs to be simultaneously estimated. To achieve this goal we develop a suitable analytical methodology that includes a proper parameter optimization for the proposed detection scheme. Numerical simulations corroborate our theoretical findings.

First-order optimal sequential subspace change-point detection

Liyan Xiethanks: Work supported by the US National Science Foundation under Grant CCF 1650913, CMMI 1538746, and CCF 1442635.                George V. Moustakidesthanks: Work supported by the US National Science Foundation under Grant CIF 1513373, through Rutgers University.                Yao Xie
Georgia Institute of Technology, School of Industrial and Systems Engineering, Atlanta, GA, USA.
Rutgers University, Department of Computer Science, New Brunswick, NJ, USA.

## 1 Introduction

Detecting changes in the distribution of high-dimensional streaming data is a fundamental problem in various applications such as swarm behavior monitoring [1], sensor networks, and seismic event detection. In various scenarios, the change can be represented as a linear subspace which is captured through a change in the covariance structure.

Given a sequence of samples , , where and is the signal dimension, there may be a change-point time where the distribution of the data stream changes. Our goal is to detect this change as quickly as possible using on-line techniques. We are particularly interested in the structured change that occurs in the signal covariance. We study two related settings, the emerging subspace: meaning that the change is a subspace emerging from a noisy background, and the switching subspace: meaning that the change is a switch in the direction of the subspace. The emerging subspace problem can arise from coherent weak signal detection from seismic sensor arrays, and the switching subspace detection can be used for principal component analysis for streaming data. In these settings, the change can be shown to be equivalent to a low-rank component added to the original covariance matrix.

Classical approaches to covariance change detection usually consider generic settings without assuming any structure. The CUSUM statistics can be derived if the pre-change and post-change distributions are known. For the multivariate case, the Hotelling control chart is the traditional way to detect the covariance changes. The determinant of the sample covariance matrix was also used in [2] to detect change of the determinant of the covariance matrix. A multivariate CUSUM based on likelihood functions of multi-variate Gaussian is studied in [3] but it only considers the covariance change from to for a constant . Offline change detection of covariance change from to is studied in [4] using the Schwarz information criterion [5], where the change-point location must satisfy certain regularity condition to ensure the existence of the maximum likelihood estimator. Recently, [6] studies the hypothesis test to detect a shift in the off-diagonal sub-matrix planted in the covariance matrix using the likelihood ratios.

In this paper, we propose the Subspace-CUSUM procedure by combing the CUSUM statistic with subspace estimation and proper parameter optimization. We prove that the resulting detector is first-order asymptotically optimal in the sense that the ratio of its expected detection delay with the corresponding of the optimum CUSUM (that has complete knowledge of the pre- and post-change statistics) tends to 1 as the average run length tends to infinity.

The rest of this paper is organized as follows. Section 2 details on the two problems of emerging and switching subspace. Section 3 presents the Subspace-CUSUM procedure. Section 4 considers the asymptotic analysis of the proposed scheme along with parameter optimization and proof of first-order asymptotic optimality. Finally, in Section 5 we present simulation results that corroborate our theoretical findings.

## 2 Subspace Change-point detection

Both settings, emerging and switching subspace, can be shown to be related to the so-called spiked covariance matrix [7]. For simplicity, we consider the rank-one spiked covariance matrix problem, which is given by

 Σ=σ2Ik+θuu⊺,

where denotes an identity matrix of size ; the signal strength; represents a basis for the subspace and the noise power. We can define the Signal-to-Noise Ratio (SNR) as .

In the emerging subspace problem the sequentially observed data are as follows

 (1)

where is the unknown change-point that we would like to detect as soon as possible. We assume that the subspace is unknown since it represents anomaly or new information.

In the switching subspace problem the data satisfy

 (2)

where and are the pre- and post-change subspaces. We assume that is completely known since it describes the statistical behavior under nominal conditions while is considered unknown since, as before, it expresses an anomaly.

The switching subspace problem (2) can be easily reduced into the emerging subspace problem (1). Indeed if we select any orthonormal matrix that satisfies

 Qu1=0,QQ⊺=Ik−1,

and project the observed data onto the space that is orthogonal to namely , then is a zero-mean random vector with covariance matrix before the change and after the change where , and

 ~θ=θ∥Qu2∥2=θ[1−(u⊺1u2)2].

The data in (2) under this transformation becomes

 (3)

which is the emerging subspace problem in (1). We need however to emphasize that by projecting the observations onto a lower dimensional space we lose information, suggesting that the two versions of the problem are not equivalent. In particular the optimum detector for the transformed data in (3) and the one of the original data in (2) do not coincide. This can be easily verified by computing the corresponding CUSUM tests and their (optimum) performance. Despite this difference, it is clear that with this result we are going to present next, and by adopting the transformed version (3), we offer a computationally simple method to solve the original problem (2). Therefore, from now on, our analysis will focus solely on detecting with the ccorresponding observations following the model depicted in (1).

## 3 Subspace CUSUM

The CUSUM test [8, 9], when the observations are i.i.d. before and after the change, is known to be exactly optimum [10] in the sense that it solves a very well defined constrained optimization problem introduced in [11]. If denote the pre- and post-change probability density function (pdf) of the observations respectively then the CUSUM statistic and the corresponding CUSUM stopping time are defined [10] as follows

 St=(St−1)++logf0(xt)f∞(xt),  T% C=inf{t>0:St≥b}, (4)

where and denotes a constant threshold. We must of course point out that application of CUSUM is only possible if we have exact knowledge of the pre- and post-change pdfs.

For the data model depicted in (1) the log-likelihood ratio takes the special form

 logf0(xt)f∞(xt)=12σ2ρ1+ρ{(u⊺xt)2−σ2(1+1ρ)log(1+ρ)}.

The multiplicative factor can be omitted since it only performs a constant scaling of the test statistic. We can therefore define the CUSUM test statistic using the following recursion

 St=(St−1)++(u⊺xt)2−σ2(1+1ρ)log(1+ρ). (5)

Using a simple argument based on Jensen’s inequality, we can claim that the increment in (5) has a negative average under the nominal measure and a positive average under the alternative. Due to this property, the CUSUM statistic oscillates near before the change, and increases with a linear trend after the change.

Since in our case we assume that the vector is unknown we propose the following alternative to (5) with replaced by any estimate

 St=(St−1)++(^u⊺txt)2−d. (6)

Quantity is a constant that we would like to select properly so that the increment of mimic the main property of the increment of the CUSUM statistic , that is, have a negative mean under nominal and a positive mean under the alternative probability measure. This will require

 E∞[(^u⊺txt)2]

The proposed CUSUM-like stopping time is then defined as

 TC=inf{t>0:St≥b}. (8)

To be able to apply (6) we need to specify and of course the estimate . Regarding the latter we propose a sliding window of size and form the sample covariance matrix

 Σt=∑t+wi=t+1xix⊺i,

using the observations that lie in the future of . Then is simply the unit-norm eigenvector corresponding to the largest eigenvalue of . The usage of observations from the future might seem somewhat awkward but it is always possible by properly delaying the data. The main advantage of this idea is that it provides estimates that are independent from . Of course employing observations from times after affects the actual performance of our scheme. In particular, if with (8) we stop at time this implies that we used data from times up to and, consequently, is the true time we stop and not .

The independence between and allows for the simple computation of the two expectations in (7). However, for this computation to be possible, especially under the alternative regime, it is necessary to be able to describe the statistical behavior of our estimate . We will assume that the window size is sufficiently large so that Central Limit Theorem type approximations are possible for and we will consider that is actually Gaussian with mean (the correct vector) and (error) covariance matrix that can be specified, analytically, of being size [12, 13]. Explicit formulas will be given in the Appendix.

###### Lemma 1.

Adopting the Gaussian approximation for we have the following two mean values under the pre- and post-change regime:

 E∞[(^u⊺txt)2] =σ2, E0[(^u⊺txt)2] =σ2(1+ρ)[1−k−1wρ].
###### Proof.

The proof is given in the Appendix. ∎

Lemma 1 also suggests that the window size and the drift must satisfy

 σ2

Necessary condition for this to be true is that . Actually this constraint is required for the Gaussian approximation to make sense. But in order for the approximation to be efficient we, in fact, need to be significantly larger than the lower bound. We can see that when the SNR is high () then with relatively small window size we can obtain efficient estimates. When on the other hand SNR is low () then far larger window sizes are necessary to guarantee validity of the Gaussian approximation.

## 4 Asymptotic analysis

In this section we will provide performance estimates for the optimum CUSUM test (that has all the information regarding the data) and the Subspace-CUSUM test proposed in the previous section. This will allow for the optimum design of the two parameters and for demonstrating that the resulting detector is asymptotically optimum.

In sequential change detection there are two quantities that play vital role in the performance of a detector: a) the average run length (ARL) and b) the expected detection delay (EDD). ARL measures the average period between false alarms while EDD the (worst-case) average detection delay. It is known that CUSUM minimizes the latter while keeps the former above a prescribed level. Let us first compute these two quantities for the case of CUSUM given in (4).

### 4.1 Asymptotic performance

From [14, Pages 396–397] we have that the test depicted in (4) has the following performance

 E∞[TC]=ebK(1+o(1)),   E0[TC]=bI0(1+o(1)), (10)

where is the constant threshold; is of the order of a constant with its exact value being unimportant for the asymptotic analysis; finally is the Kullback-Leibler information number . We recall that the worst-case average detection delay in CUSUM is equal to . This is the reason we consider the computation of this quantity. If now, we impose the constraint that the ARL is equal to and for the asymptotic analysis that , then we can compute the threshold that can achieve this false alarm performance namely . Substituting this value of the threshold in EDD we obtain

 E0[TC]=logγI0(1+o(1)). (11)

Applying this formula in our problem we end up with the following optimum performance

 E0[TC]=2logγρ−log(1+ρ)(1+o(1)). (12)

For the performance computation of Subspace-CUSUM, since the increment in (6) is not a log-likelihood, we cannot use (11) directly. To compute the ARL of we first find from the solution of the equation

 E∞[eδ∞[(^u⊺txt)2−d]]=1 (13)

and then we note that is the log-likelihood ratio between the two pdfs and . This allows us to compute the threshold asymptotically as after assuming that . Similarly we can find a and define so that is the log-likelihood ratio between and leading to with the dependence on being in the term. Substituting we obtain

 (14)

where the last term is added because we use data from the future of as we explained before. Parameter , defined in (13), is directly related to . We show in the Appendix that can be expressed in terms of as follows

 d=−12δ∞log(1−2σ2δ∞). (15)

After using Lemma 1 and (15) we obtain the following expression for the EDD:

 E0[TC]=logγ(1+o(1))σ2δ∞(1+ρ)(1−k−1wρ)+12log(1−2σ2δ∞)+w. (16)

### 4.2 Parameter optimization and asymptotic optimality

Note that in the previous equation we have two parameters and and the goal is to select them so as to minimize the EDD. Therefore if we first fix the window size we can minimize over (which is equivalent to minimizing with respect to the drift ). We observe that the denominator is a concave function of therefore it exhibits a single maximum. The optimum can be computed by taking the derivative and equating to 0 which leads to a particular . Substituting this optimal value we obtain the following minimum EDD:

 E0[TC]=2logγ(1+o(1))(1+ρ)(1−k−1wρ)−1−log[(1+ρ)(1−k−1wρ)]+w. (17)

Equ. (17) involves only the target ARL level and the window size . If we keep constant it is easy to verify that the ratio of the EDD of the proposed scheme over the EDD of the optimum CUSUM tends, as , to a quantity which is strictly greater than 1. In order to make this ratio tend to 1 and therefore establish asymptotic optimality we need to select the window size as a function of . Actually we can perform this selection optimally by minimizing (17) with respect to for given . The following proposition identifies the optimum window size.

###### Proposition 1.

For each ARL level , the optimal window size that minimizes the corresponding EDD is given by

 w∗=√logγ⋅√2(k−1)ρ−log(1+ρ)(1+o(1)),

resulting in an optimal drift

 d∗=σ2(1+ρ)(1−k−1w∗ρ)(1+ρ)(1−k−1w∗ρ)−1log[(1+ρ)(1−k−1w∗ρ)].

Using these optimal parameter values it is straightforward to establish that the corresponding Subspace-CUSUM is first-order asymptotically optimum. This is summarized in our next theorem.

###### Theorem 1.

As the ARL level , the corresponding EDD of the Subspace-CUSUM procedure with the two parameters and optimized as above satisfies

 limγ→∞E0[TC]E0[TC]=1. (18)
###### Proof.

As we pointed out, the proof is straightforward. Indeed if we substitute the optimum and and then take the ratio with respect to the optimum CUSUM performance depicted in (12) we obtain

 \vspace−0.1inE0[TC]E0[TC]=1+√k−12logγ+o(1)→1,

which proves the desired limit. Even though the ratio tends to 1, we note that . This is corroborated by our simulations (see Fig. 2, red curve). ∎

## 5 Numerical examples

We present simulations to illustrate the satisfactory performance of Subspace-CUSUM. For comparison, we consider two other detection procedures: one uses the largest eigenvalue of the sample covariance matrix as the test statistic while the other is the exact CUSUM assuming all parameters are known (ideal but unrealistic case).

The threshold for each detection procedure is determined through Monte-Carlo simulation so they all have the same ARL. Fig. 2 depicts the EDD versus ARL with the latter under a logarithmic scale. Parameters are selected as follows: , , and window length . Exact CUSUM (black) is compared against Subspace-CUSUM (green) and largest eigenvalue scheme (blue). We see that Subspace-CUSUM has much smaller EDD than the largest eigenvalue procedure while Subspace-CUSUM with optimized window size (red) is uniformly more efficient.

We also consider EDD versus ARL for different and with numerically optimized so as to minimize the detection delay for each ARL level. The results appear in Fig. 2, which demonstrate that indeed the optimal increases with ARL.

## References

• [1] Matthew Berger, Lee M Seversky, and Daniel S Brown, “Classifying swarm behavior via compressive subspace learning,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE, 2016, pp. 5328–5335.
• [2] Frank B Alt, “Multivariate quality control,” Encyclopedia of Statistical Sciences, 2004.
• [3] John D Healy, “A note on multivariate cusum procedures,” Technometrics, vol. 29, no. 4, pp. 409–412, 1987.
• [4] Jie Chen and AK Gupta, “Statistical inference of covariance change points in gaussian model,” Statistics, vol. 38, no. 1, pp. 17–28, 2004.
• [5] Gideon Schwarz et al., “Estimating the dimension of a model,” The annals of statistics, vol. 6, no. 2, pp. 461–464, 1978.
• [6] Ery Arias-Castro, Sébastien Bubeck, Gábor Lugosi, et al., “Detection of correlations,” The Annals of Statistics, vol. 40, no. 1, pp. 412–435, 2012.
• [7] Iain M Johnstone, “On the distribution of the largest eigenvalue in principal components analysis,” Annals of statistics, pp. 295–327, 2001.
• [8] Ewan S Page, “Continuous inspection schemes,” Biometrika, vol. 41, no. 1/2, pp. 100–115, 1954.
• [9] David Siegmund, Sequential analysis: tests and confidence intervals, Springer Science & Business Media, 1985.
• [10] George V Moustakides, “Optimal stopping times for detecting changes in distributions,” The Annals of Statistics, pp. 1379–1387, 1986.
• [11] Gary Lorden, “Procedures for reacting to a change in distribution,” The Annals of Mathematical Statistics, pp. 1897–1908, 1971.
• [12] Theodore Wilbur Anderson, “Asymptotic theory for principal component analysis,” The Annals of Mathematical Statistics, vol. 34, no. 1, pp. 122–148, 1963.
• [13] Debashis Paul, “Asymptotics of sample eigenstructure for a large dimensional spiked covariance model,” Statistica Sinica, pp. 1617–1642, 2007.
• [14] Alexander Tartakovsky, Igor Nikiforov, and Michele Basseville, Sequential analysis: Hypothesis testing and changepoint detection, Chapman and Hall/CRC, 2014.

## Appendix A Appendix

###### Proof of Lemma 1.

Using the independence between and we can write

 E[(^u⊺txt)2]=E[^u⊺tE[xtx⊺t]^ut]. (19)

Consequently, under the nominal regime

 E∞[(^u⊺txt)2]=E∞[^u⊺tσ2Ik^ut]=σ2,

with the last equality being true because is of unit norm.

Under the alternative regime we are going to use Central Limit Theorem arguments [12, 13] that describe the statistical behavior of the estimator. We have that

 √w(ωt−u)→N(0,1+ρρ2(Ik−uu⊺))

where the limit is in distribution as and denotes the un-normalized eigenvector. For large we can write where

 vt∼N(0,1+ρwρ2(Ik−uu⊺)).

Our estimator is related to through the normalization process , and if we use this in (19) after recalling that under the alternative and using repeatedly the fact that and are orthogonal, we have

 E0[(^u⊺txt)2]=σ2E0[^u⊺t(Ik+ρuu⊺)^ut]=σ2(1+ρE0[(^u⊺tu)2])=σ2(1+ρE0[11+∥vt∥2])≈σ2(1+ρE0[1−∥vt∥2])=σ2(1+ρ)(1−k−1wρ).

For the approximate equality we used the fact that to a first order approximation we can write because is of the order of while the approximation error is of higher order. This completes the proof. ∎

###### Proof of Proposition 1.

Let us first evaluate the expectation in (13) to demonstrate the relationship between and depicted in (15). Using standard computations involving Gaussian random vectors we can write

 E∞[eδ∞[(^u⊺txt)2−d]]=e−δ∞dE∞[E∞[eδ∞(^u⊺txt)2|^ut]]=e−δ∞dE∞[∫eδ∞x⊺t(^ut^u⊺t)xt⋅e−x⊺txt/(2σ2)√(2π)kσ2kdxt]=e−δ∞d√1−2σ2δ∞.

To compute the integral we used the standard technique of “completing the square” in the exponent and with proper normalization generate an alternative Gaussian pdf which integrates to 1. The interesting observation is that the result of the integration does not actually depend on .

If we use the optimum value for in terms of then as we argued in the text we obtain for EDD the expression appearing in (16). We can now fix and optimize EDD with respect to . This is a straightforward process since it amounts in maximizing the denominator. Taking the derivative and equating to 0 yields the optimum

 δ∗∞=12σ2⎛⎜ ⎜⎝1−1(1+ρ)(1−k−1wρ)⎞⎟ ⎟⎠.

Substituting this value in (16) produces (17).

The next step consists in minimizing (17) with respect to . Again taking the derivative and equating to 0 we can show that the optimum window size is the depicted in Proposition 1. ∎

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters