Changepoint detection in highdimensional covariance structure
Abstract
In this paper we introduce a novel approach for an important problem of break detection. Specifically, we are interested in detection of an abrupt change in the covariance structure of a highdimensional random process – a problem, which has applications in many areas e.g., neuroimaging and finance. The developed approach is essentially a testing procedure involving a choice of a critical level. To that end a nonstandard bootstrap scheme is proposed and theoretically justified under mild assumptions. Theoretical study features a result providing guaranties for break detection. All the theoretical results are established in a highdimensional setting (dimensionality ). Multiscale nature of the approach allows for a tradeoff between sensitivity of break detection and localization. The approach can be naturally employed in an online setting. Simulation study demonstrates that the approach matches the nominal level of false alarm probability and exhibits high power, outperforming a recent approach.
10.1214/154957804100000000 0000 \volume0 \firstpage0 \lastpage0 \startlocaldefs\endlocaldefs \startlocaldefs \endlocaldefs
Changepoint detection in highdimensional covariance structure
,
class=MSC] \kwd[Primary ]62M10, 62H15 \kwd[; secondary ]91B84, 62P10
multiscale \kwdbootstrap \kwdstructural change \kwdcritical value \kwdprecision matrix
1 Introduction
The analysis of high dimensional time series is crucial for many fields including neuroimaging and financial engineering. There, one often has to deal with processes involving abrupt structural changes which necessitate a corresponding adaptation of a model and/or a strategy. Structural break analysis comprises determining if an abrupt change is present in the given sample and if so, estimating the changepoint, namely the moment in time when it takes place. In literature both problems may be referred to as changepoint or break detection. In this study we will be using terms break detection and changepoint localization respectively in order to distinguish between them. The majority of approaches to the problem consider only a univariate process [13] [1]. However, in recent years the interest for multidimensional approaches has increased. Most of them cover the case of fixed dimensionality [27] [25] [2] [35] [36]. Some approaches [10, 24, 11] feature highdimensional theoretical guaranties but only the case of dimensionality polynomially growing in sample size is covered. The case of exponential growth has not been considered so far.
In order to detect a break, a test statistic is usually computed for each point (e.g. [27]). The break is detected if the maximum of these values exceeds a certain threshold. A proper choice of the latter may be a tricky issue. Consider a pair of plots (Figure 1) of the statistic defined in Section 2. It is rather difficult to see how many breaks are there, if any. The classic approach to the problem is based on the asymptotic behaviour of the statistic [13] [1] [2] [24] [6] [36]. As an alternative, permutation [24] [27] or parametric bootstrap may be used [24]. Clearly, it seems attractive to choose the threshold in a solely datadriven way as it is suggested in the recent paper [10], but a careful bootstrap validation is still an open question.
In the current study we are interested in a particular kind of a break – an abrupt transformation in the covariance matrix – which is motivated by applications to neuroimaging. The covariance structure of data in functional Magnetic Resonance Imaging has recently drawn a lot of interest, as it encodes socalled functional connectivity networks [34] which refer to the explicit influence among neural systems [18]. A rather popular approach to inferencing these networks is based on estimating inverse covariance or precision matrices [Allen2012]. The technique generally makes use of the observation that functional connectivity networks are of smallworld type [34], which makes sparsity assumptions feasible. Analysing the dynamics of these networks is particularly important for the research on neural diseases and also in the context of brain development with emphasis on characterizing the reconfiguration of the brain during learning [4].
A similar problem is found in finance: the dynamics of the covariance structure of a highdimensional process modelling exchange rates and market indexes is crucial for a proper asset allocation in a portfolio [12, 5, 14, 29].
One approach to the changepoint localization is developed in [25], the corresponding significance testing problem is considered in [2]. However, neither of these papers address the highdimensional case.
A widely used break detection approach (named CUSUM) [11, 2, 24] suggests to compute a statistic at a point as a distance of estimators of some parameter of the underlying distributions obtained using all the data before and after that point. This technique requires the whole sample to be known in advance, which prevents it from being used in online setting. In order to overcome this drawback we propose the following augmentation: choose a window size and compute parameter estimators using only points before and points after the central point (see Section 2 for formal definition). Window size is an important parameter and its choice is casespecific (see Section 4 for theoretical treatment of this issue). Using a small window results in high variability and low sensitivity, while a large window implies higher uncertainty in changepoint localization yielding the issue of a proper choice of window size. The multiscale nature of the proposed method enables us to incorporate the advantages of narrower and wider windows by considering multiple window sizes at once in order for wider windows to provide higher sensitivity while narrower ones improve changepoint localization.
The contribution of our study is the development of a novel break detection approach which is

highdimensional, allowing for up to exponential growth of the dimensionality with the window size

suitable for online setting

multiscale, attaining tradeoff between break detection sensitivity and changepoint localization accuracy

using a fully datadriven threshold selection algorithm rigorously justified under mild assumptions

featuring formal sensitivity guaranties in highdimensional setting
We consider the following setup. Let denote an independent sample of zeromean vectors (the online setting is discussed in Section 3) and we want to test a hypothesis
(1.1) 
versus an alternative suggesting the existence of a break:
(1.2) 
and localize the changepoint as precisely as possible or (in online setting) to detect a break as soon as possible.
The approach proposed in the paper focuses on applications in neuroimagin
In the current study it is also assumed that some subset of indices of size (possibly, ) is chosen. The threshold is chosen relying on the subsample while the teststatistic is computed based on the whole sample.
To this end we define a family of test statistics in Section 2.1 which is followed by Section 2.2 describing a datadriven (bootstrap) calibration scheme and Section 2.3 describing changepoint localization procedure. The theoretical part of the paper justifies the proposed procedure in a highdimensional setting. The result justifying the validity of the proposed calibration scheme is stated in Section 3. Section 4 is devoted to the sensitivity result yielding a bound for the window size necessary to reliably detect a break of a given extent and hence bounding the uncertainty of the changepoint localization (or the delay of detection in online setting). The theoretical study is supported by a comparative simulation study (described in Section 5) demonstrating conservativeness of the proposed test and higher sensitivity compared to the other algorithms. Appendix A contains a finitesample version of sensitivity result along with the proofs. Appendix B provides a a finitesample version of bootstrap sensitivity result which is followed by the proofs. Finally, Appendix H lists results which were essential for our theoretical study.
2 Proposed approach
This section describes the proposed approach along with a datadriven calibration scheme. Informally the proposed statistic can be described as follows. Provided that the break may happen only at moment , one could estimate some parameter of the distribution using datapoints to the left of , estimate it again using datapoints to the right and use the norm of their difference as a teststatistic . Yet, in practice one does not usually possess such knowledge, therefore we propose to maximize these statistics over all possible locations yielding . Finally, in order to attain a tradeoff between break detection sensitivity and changepoint localization accuracy we build a multiscale approach: consider a family of test statistics for multiple window sizes at once.
2.1 Definition of the test statistic
Now we present a formal definition of the test statistic. In order to detect a break we consider a set of window sizes . Denote the size of the widest window as and of the narrowest as . Given a sample of length , for each window size define a set of central points . Next, for all define a set of indices which belong to the window on the left side of the central point as and correspondingly for the window on the right side define . Denote the sum of numbers of central points for all window sizes as
(2.1) 
For each window size , each central point and each side we define a desparsified estimator of precision matrix [23] [22] as
(2.2) 
where
(2.3) 
and is a consistent estimator of precision matrix which can be obtained by graphical lasso [31] or nodewise procedure [22] (see Definition 3.1 and Appendix H.5 for details).
Now define a matrix of size with elements
(2.4) 
where for , stands for the th row of . Denote their variances as and introduce the diagonal matrix . Denote a consistent estimator (see Definition 3.1 for details) of the precision matrix obtained based on the subsample as . In practice, the variances are unknown, but under normality assumption one can plug in which have been proven to be consistent (uniformly for all and ) estimators of [23] [3]. If the nodewise procedure is employed, the uniform consistency of an empirical estimate of has been shown under some mild assumptions (not including normality) [22].
For each window size and a central point we define a statistic
(2.5) 
where we write for a vector composed of stacked columns of matrix . Finally we define our family of test statistics for all as
(2.6) 
Our approach heavily relies on the following expansion under
This expansion might have been used in order to investigate the asymptotic properties of and obtain the threshold, however we propose a datadriven scheme.
Remark 2.1.
A different test statistic can be defined as the maximum distance between elements of empirical covariance matrices and . However, application to neuroimaging motivates the search for a structural change in a functional connectivity network which is encoded by the structure of the corresponding precision matrix. Clearly, a change in the precision matrix also means a change in the covariance matrix, though we believe that the definition (2.5) increases the sensitivity to this kind of alternative.
2.2 Bootstrap calibration
Our approach rejects in favor of if at least one of statistics exceeds the corresponding threshold or formally if .
In order to properly choose the thresholds, we define bootstrap statistics in the following nonstandard way. Note, that we cannot use an ordinary scheme with replacement or weighted bootstrap since in a highdimensional case ( the covariance matrix of bootstrap distribution would be singular which would made inverse covariance matrix estimation procedures meaningless.
First, draw with replacement a sequence of indices from and denote
where stands for averaging over values of index belonging to e.g., . Denote the measure are distributed with respect to as . In accordance with (2.4) define
(2.9) 
and for technical purposes define
(2.10) 
Now for all central point define a bootstrap counterpart of
(2.11) 
which is intuitively reasonable due to expansion (2.7). And finally we define the bootstrap counterpart of as
(2.12) 
Now for each given we can define quantile functions such that
(2.13) 
Next for a given significance level we apply multiplicity correction choosing as
(2.14) 
and finally choose thresholds as .
Remark 2.2.
One can choose and use the whole given sample for calibration as well as for detection. In fact, it would improve the bounds in Theorem 3.1 and Theorem 4.1, since it effectively means . However, in practise such a decision might lead to reduction of sensitivity due to overestimation of the thresholds.
2.3 Changepoint localization
In order to localize a changepoint we have to assume that . Consider the narrowest window detecting a changepoint as :
(2.15) 
and the central point where this window detects a break for the first time as
(2.16) 
By construction of the family of test statistics we conclude (up to the confidence level ) that the changepoint is localized in the interval
(2.17) 
Clearly, if a nonmultiscale version of the approach is employed, i.e. , and the precision of localization (delay of the detection in online setting) equals .
3 Bootstrap validity
This section states and discusses the theoretical result demonstrating the validity of the proposed bootstrap scheme i.e.
(3.1) 
Our theoretical results require the tails of the underlying distributions to be light. Specifically, we impose SubGaussianity vector condition.
Assumption 3.1SubGaussianity vector condition.
(3.2) 
Naturally, in order to establish a theoretical result we have to assume that a method featuring theoretical guaranties was used for estimating the precision matrices. Such methods include graphical lasso [31], adaptive graphical lasso [37] and thresholded desparsified estimator based on nodewise procedure [22]. These approaches overcome the high dimensionality of the problem by imposing a sparsity assumption, specifically bounding the maximum number of nonzero elements in a row: . These approaches are guaranteed to yield a root consistent estimate revealing the sparsity pattern of the precision matrix [31, 3, 22] or formally
Definition 3.1.
Consider an i.i.d. sample . Denote their precision matrix as . Let and grow with . A positivedefinite matrix is a consistent estimator of the highdimensional precision matrix if
(3.3) 
and
(3.4) 
Graphical lasso and its adaptive versions impose an assumption, common for penalized approaches.
Assumption 3.2Irrepresentability condition.
Denote an active set
(3.5) 
and define a matrix where denotes Kronecker product. Irrepresentability condition holds if there exists such that
(3.6) 
The interpretation of irrepresentability condition under normality assumption is given in [23] [31]. Particularly, Assumption 3.2 requires low correlation between the elements of empirical covariance matrix from the active set and from its complement. The higher the constant is, the stricter upper bound is assumed.
These observations give rise to the two following assumptions.
Assumption 3.3.A.
Suppose, either graphical lasso or its adaptive version was used with regularization parameter and also impose Assumption 3.2.
Assumption 3.3.B.
Suppose, thresholded desparsified estimator based on nodewise procedure was used with regularization parameter .
Now we are ready to establish a result which guarantees that the suggested bootstrap procedure yields proper thresholds.
Theorem 3.1.
Assume holds and furthermore, let be i.i.d. Let Assumption 3.1 and either Assumption 3.3.A or Assumption 3.3.B hold. Also assume, the spectrum of is bounded. Allow the parameters grow with . Further let , and also impose the sparsity assumption
(3.7) 
Then
(3.8) 
The finitesample version of this result, namely, Theorem B.1, is given in Appendix B along with the proofs.
Bootstrap validity result discussion
Theorem 3.1 guarantees under mild assumptions (Assumption 3.2 seems to be the most restrictive one, yet it may be dropped if the nodewise procedure is employed) that the firsttype error rate meets the nominal level if the narrowest window size and the set are large enough. Clearly, the dependence on dimensionality is logarithmic which establishes applicability of the approach in a highdimensional setting. It is worth noticing that, unusually, the sparsity bound gets stricter with but the dependence is only logarithmic. Indeed, we gain nothing from longer samples, since we use only data points each time.
Online setting
As one can easily see, the theoretical result is stated in offline setting, when the whole sample of size is acquired in advance. In online setting we suggest to control the probability to raise a false alarm for at least one central point among data points (which differs from the classical techniques controlling the mean distance between false alarms [32]). Having and chosen one should acquire datapoints (the set ), use the proposed bootstrap scheme with bootstrap samples of length in order to obtain the thresholds. Next the approach can be naturally applied in online setting and Theorem 3.1 guarantees the capability of the proposed bootstrap scheme to control the aforementioned probability to raise a false alarm.
Proofs
The proof of the bootstrap validity result, presented in Appendix B, mostly relies on the highdimensional central limit theorems obtained in [9], [8]. These papers also present bootstrap justification results, yet do not include a comprehensive bootstrap validity result. The theoretical treatment is complicated by the randomness of . We overcome it by applying the socalled “sandwiching” proof technique (see Lemma C.1), initially used in [33].
4 Sensitivity result
Consider the following setting. Let there be index , such that are i.i.d. and are i.i.d. as well. Denote precision matrices and . Define the break extent as
(4.1) 
The question is, how large the window size should be in order to reliably reject and how firmly can we localize the changepoint.
Theorem 4.1.
Let Assumption 3.1 and either Assumption 3.3.A or Assumption 3.3.B hold. Also assume, the spectrums of and are bounded. Allow the parameters grow with and let decay with . Further let , ,
(4.2) 
and
(4.3) 
Then will be rejected with probability approaching .
This result is a direct corollary of the finitesample sensitivity result established and discussed in Appendix A.
The assumption is only technical. The result may be proven without relying on it by methodologically the same argument.
Sensitivity result discussion
Assumptions (4.2) and (4.3) are essentially a sparsity bound and a bound for the largest window size . Clearly, they do not yield a particular value necessary to detect a break, since it depends on the underlying distributions, however, the result includes dimensionality only under the sign of logarithm, which guarantees high sensitivity of the test in highdimensional setting.
Online setting
Theorem 4.1 is established in offline setting as well. In online setting it guarantees that the proposed approach can reliably detect a break of an extent not less than with a delay at most bounded by (4.3).
Changepoint localization guaranties
Theorem 4.1 implies by construction of statistic that the changepoint can be localized with precision up to . Hence condition (4.3) provides the bound for changepoint localization accuracy.
5 Simulation study
5.1 Design
In our simulation we test
versus an alternative
The alternative covariance matrix was generated in the following way. First we draw . The matrix is composed as a blockdiagonal matrix of matrices of size with ones on their diagonals and their offdiagonal element drawn uniformly from and an identity matrix of size . The dimensionality of the problem is chosen as , the length of the sample and we choose the set , e.g. . The absence of positive effect of large sample size is discussed in Sections 3 and 4. Moreover, in all the simulations under alternative the sample was generated with the change point in the middle: but the algorithm was oblivious about this as well as about either of the covariance matrices. The significance level was chosen. In all the experiments graphical lasso with penalization parameter was used in order to obtain . In the same way, graphical lasso with penalization parameter was used in order to obtain .
We have also come up with an approach to the same problem not involving bootstrap. The paper [26] defines a highdimensional twosample test for equality of matrices. Moreover, the authors prove asymptotic normality of their statistic which makes computing pvalue possible. We suggest to run this test for every and every , adjust the obtained pvalues using Holm method [19] and eventually compare them against .
The paper [27] suggests an approach based on comparing characteristic functions of random variables. The critical values were chosen with permutation test as proposed by the authors. In our experiments the method was allowed to consider all the sample at once. The Rpackage ecp [21] was used.
The first type error rate and power for our approach are reported in Table 1. As one can see, our approach allows to properly control first type error rate. As expected, its power is higher for larger windows and it is decreased by adding narrower windows into consideration which is the price to be paid for better localization of a change point.
In our study the approach proposed in [27] and the one based on the two sample test [26] turned out to be conservative, but neither of them exhibited power above .
Also, in order to justify application of multiscale approach (i. e. ) for the sake of better changepoint localization we report the distribution of the narrowest detecting window (defined by (2.15)) over in Figure 2. The Table 1 represents average precision of changepoint localization for various choices of set of window sizes . One can see, that multiscale approach significantly improves the precision of localization.
I type error rate  Power  Localization precision  

{70}  0.02  0.09  70 
{100}  0.00  0.37  100 
{140}  0.01  0.81  140 
{70, 140}  0.01  0.76  135 
{100, 140}  0.01  0.75  124 
{70, 100, 140}  0.01  0.74  123 
Appendix A Proof of sensitivity result
Proof of Theorem 4.1.
Theorem A.1.
Let . Let denote a symmetric estimator of s.t. for some it holds that
(A.1) 
and . Suppose Assumption 3.1 holds and there exists such that for all and on some set
(A.2) 
Moreover, let the residual defined in Lemma F.2 be bounded:
Discussion of finitesample sensitivity result
The assumption (A.4) is rather complicated. Here we note that if either graphical lasso [31], adaptive graphical lasso [37] or thresholded desparsified estimator based on nodewise procedure [22] with penalization parameter chosen as was used, given , , , and it boils down to
(A.7) 
for some positive constant independent of while the parameters , and may be chosen as in (B.12), (B.11), (B.10) (high probability of is ensured by Lemma G.1). At the same time the remainder can be bounded by (B.6).
As expected, the bound for sufficient window size decreases with growth of the break extent and the size of the set , but increases with dimensionality . It is worth noticing, that the latter dependence is only logarithmic. And again, in the same way as with Theorem 3.1, the bound increases with the sample size (only logarithmically) since we use only data points.
Proof of Theorem A.1.
Consider a pair of centered normal vectors
(A.8) 
(A.9) 
(A.10) 
(A.11) 
where vectors and are defined in proofs of Lemma E.2 and Lemma F.1 respectively. Lemma A.2 applies here and yields for all positive
(A.12) 
where and is the number of central points for window of size . Applying Lemma G.2 on a set of probability at least yields , and hence, due to the fact that by construction,
(A.13) 
Due to Lemma F.2 and continuity of Gaussian c.d.f.
(A.14) 
and due to Lemma F.2 along with the fact that , choosing as proposed by equation (A.5) we ensure that .
Now by assumption of the theorem and by construction of the test statistics
(A.15) 
Finally, we notice that due to assumption (A.4) and therefore, will be rejected. ∎
Lemma A.1.
Consider a centered random Gaussian vector with arbitrary covariance matrix . For any positive it holds that
(A.16) 
Proof.
By convexity we obtain the following chain of inequalities for any
(A.17) 
Chernoff bound yields for any
(A.18) 
Finally, optimization over yields the claim. ∎
As a trivial corollary, one obtains
Lemma A.2.
Consider a centered random Gaussian vector with arbitrary covariance matrix . For any positive it holds that
(A.19) 
Appendix B Proof of bootstrap validity result
Proof of Theorem 3.1.
Theorem B.1.
Assume holds and furthermore, let be i.i.d. Let denote a symmetric estimator of s.t. for some positive
(B.1) 
and . Suppose Assumption 3.1 holds and there exists such that for all , and on set
(B.2) 
Moreover, let
Discussion of finitesample bootstrap validity result
The terms , , and involved in the statement of Theorem B.1 are rather complicated. The exact expressions for them are provided by Lemma G.2, Lemma E.1, Lemma F.2 and Lemma C.1 respectively, 3rd and 4th moments and involved therein are bounded by Lemma E.4 and Lemma G.3 while asymptotic bounds for are provided in [22] (for nodewise procedure) and [23] (for graphical lasso). For the case of graphical lasso an explicit form of is given in [3].
Here we just note that if is a root consistent estimator, recovering sparsity pattern (graphical lasso [31], adaptive graphical lasso [37] or thresholded desparsified estimator based on nodewise procedure [22]), then for , , , and given the spectrum of is bounded
(B.6) 
If either graphical lasso, adaptive graphical lasso or nodewise procedure [28] is used with in order to obtain , then on set it holds that
(B.7) 
The high probability of may be ensured by means of Lemma G.1 e.g., choosing for . Further
(B.8) 
(B.9) 
Here are positive constants independent of , , , and . We also note that the proper choice of , and in (B.5) is
(B.10) 
(B.11) 
(B.12) 
which ensures the probability defined by (B.5) to be above . For exact expression of , and see Lemma G.2, Lemma G.1 and Lemma F.2.
Proof of Theorem B.1.
Appendix C Sandwiching lemma
Lemma C.1.
Consider a normal multivariate vector with a deterministic covariance matrix and a normal multivariate vector with a possibly random covariance matrix such that
(C.1) 
(C.2) 
(C.3) 
where and are subvectors of and respectively. Then
(C.4) 
Proof.
Let us introduce some notation. Denote multivariate cumulative distribution function of as respectively. Define the following sets for all
(C.5) 
(C.6) 
and their boundaries
(C.7) 
(C.8) 
Consider and denote sets Define a set of thresholds satisfying the confidence level
(C.9) 
here and below comparison of vectors should be understood elementwise. Notice that due to continuity of multivariate normal distribution and assumption (C.2)
(C.10) 
Now for all and for all it holds that
(C.11) 
where we have consequently used (C.3), (C.1), (C.7) and (C.10). In the same way one obtains for all and for all
(C.12) 
which implies that .
Now denote quantile functions of as :
(C.13) 
In exactly the same way define quantile functions of . Clearly for all ,
(C.14) 
and hence
(C.15) 
(C.16) 
Using Taylor expansion with Lagrange remainder term we obtain for some