Change-point detection in high-dimensional covariance structure

Change-point detection in high-dimensional covariance structure

\fnmsValeriy \snmAvanesov\correflabel=e1]avanesov@wias-berlin.de [    \fnmsNazar \snmBuzunlabel=e2]buzun@wias-berlin.de [ WIAS
Mohrenstr. 39
10117 Berlin
Germany
\printeade1,e2
WIAS
Abstract

In this paper we introduce a novel approach for an important problem of break detection. Specifically, we are interested in detection of an abrupt change in the covariance structure of a high-dimensional random process – a problem, which has applications in many areas e.g., neuroimaging and finance. The developed approach is essentially a testing procedure involving a choice of a critical level. To that end a non-standard bootstrap scheme is proposed and theoretically justified under mild assumptions. Theoretical study features a result providing guaranties for break detection. All the theoretical results are established in a high-dimensional setting (dimensionality ). Multiscale nature of the approach allows for a trade-off between sensitivity of break detection and localization. The approach can be naturally employed in an on-line setting. Simulation study demonstrates that the approach matches the nominal level of false alarm probability and exhibits high power, outperforming a recent approach.

[
\kwd
\doi

10.1214/154957804100000000 0000 \volume0 \firstpage0 \lastpage0 \startlocaldefs\endlocaldefs \startlocaldefs \endlocaldefs

\runtitle

Change-point detection in high-dimensional covariance structure

{aug}

,

class=MSC] \kwd[Primary ]62M10, 62H15 \kwd[; secondary ]91B84, 62P10

multiscale \kwdbootstrap \kwdstructural change \kwdcritical value \kwdprecision matrix

1 Introduction

The analysis of high dimensional time series is crucial for many fields including neuroimaging and financial engineering. There, one often has to deal with processes involving abrupt structural changes which necessitate a corresponding adaptation of a model and/or a strategy. Structural break analysis comprises determining if an abrupt change is present in the given sample and if so, estimating the change-point, namely the moment in time when it takes place. In literature both problems may be referred to as change-point or break detection. In this study we will be using terms break detection and change-point localization respectively in order to distinguish between them. The majority of approaches to the problem consider only a univariate process [13] [1]. However, in recent years the interest for multi-dimensional approaches has increased. Most of them cover the case of fixed dimensionality [27] [25] [2] [35] [36]. Some approaches [10, 24, 11] feature high-dimensional theoretical guaranties but only the case of dimensionality polynomially growing in sample size is covered. The case of exponential growth has not been considered so far.

In order to detect a break, a test statistic is usually computed for each point (e.g. [27]). The break is detected if the maximum of these values exceeds a certain threshold. A proper choice of the latter may be a tricky issue. Consider a pair of plots (Figure 1) of the statistic defined in Section 2. It is rather difficult to see how many breaks are there, if any. The classic approach to the problem is based on the asymptotic behaviour of the statistic [13] [1] [2] [24] [6] [36]. As an alternative, permutation [24] [27] or parametric bootstrap may be used [24]. Clearly, it seems attractive to choose the threshold in a solely data-driven way as it is suggested in the recent paper [10], but a careful bootstrap validation is still an open question.

In the current study we are interested in a particular kind of a break – an abrupt transformation in the covariance matrix – which is motivated by applications to neuroimaging. The covariance structure of data in functional Magnetic Resonance Imaging has recently drawn a lot of interest, as it encodes so-called functional connectivity networks [34] which refer to the explicit influence among neural systems [18]. A rather popular approach to inferencing these networks is based on estimating inverse covariance or precision matrices [Allen2012]. The technique generally makes use of the observation that functional connectivity networks are of small-world type [34], which makes sparsity assumptions feasible. Analysing the dynamics of these networks is particularly important for the research on neural diseases and also in the context of brain development with emphasis on characterizing the re-configuration of the brain during learning [4].

A similar problem is found in finance: the dynamics of the covariance structure of a high-dimensional process modelling exchange rates and market indexes is crucial for a proper asset allocation in a portfolio [12, 5, 14, 29].

One approach to the change-point localization is developed in [25], the corresponding significance testing problem is considered in [2]. However, neither of these papers address the high-dimensional case.

A widely used break detection approach (named CUSUM) [11, 2, 24] suggests to compute a statistic at a point as a distance of estimators of some parameter of the underlying distributions obtained using all the data before and after that point. This technique requires the whole sample to be known in advance, which prevents it from being used in online setting. In order to overcome this drawback we propose the following augmentation: choose a window size and compute parameter estimators using only points before and points after the central point (see Section 2 for formal definition). Window size is an important parameter and its choice is case-specific (see Section 4 for theoretical treatment of this issue). Using a small window results in high variability and low sensitivity, while a large window implies higher uncertainty in change-point localization yielding the issue of a proper choice of window size. The multiscale nature of the proposed method enables us to incorporate the advantages of narrower and wider windows by considering multiple window sizes at once in order for wider windows to provide higher sensitivity while narrower ones improve change-point localization.

The contribution of our study is the development of a novel break detection approach which is

  • high-dimensional, allowing for up to exponential growth of the dimensionality with the window size

  • suitable for on-line setting

  • multiscale, attaining trade-off between break detection sensitivity and change-point localization accuracy

  • using a fully data-driven threshold selection algorithm rigorously justified under mild assumptions

  • featuring formal sensitivity guaranties in high-dimensional setting

Figure 1: Plots of test statstics computed on synthetically generated data without (left) and with a single change-point at (right). Clearly, the choice of a threshold is not obvious.

We consider the following setup. Let denote an independent sample of zero-mean vectors (the on-line setting is discussed in Section 3) and we want to test a hypothesis

(1.1)

versus an alternative suggesting the existence of a break:

(1.2)

and localize the change-point as precisely as possible or (in online setting) to detect a break as soon as possible.

The approach proposed in the paper focuses on applications in neuroimagin

In the current study it is also assumed that some subset of indices of size (possibly, ) is chosen. The threshold is chosen relying on the sub-sample while the test-statistic is computed based on the whole sample.

To this end we define a family of test statistics in Section 2.1 which is followed by Section 2.2 describing a data-driven (bootstrap) calibration scheme and Section 2.3 describing change-point localization procedure. The theoretical part of the paper justifies the proposed procedure in a high-dimensional setting. The result justifying the validity of the proposed calibration scheme is stated in Section 3. Section 4 is devoted to the sensitivity result yielding a bound for the window size necessary to reliably detect a break of a given extent and hence bounding the uncertainty of the change-point localization (or the delay of detection in online setting). The theoretical study is supported by a comparative simulation study (described in Section 5) demonstrating conservativeness of the proposed test and higher sensitivity compared to the other algorithms. Appendix A contains a finite-sample version of sensitivity result along with the proofs. Appendix B provides a a finite-sample version of bootstrap sensitivity result which is followed by the proofs. Finally, Appendix H lists results which were essential for our theoretical study.

2 Proposed approach

This section describes the proposed approach along with a data-driven calibration scheme. Informally the proposed statistic can be described as follows. Provided that the break may happen only at moment , one could estimate some parameter of the distribution using data-points to the left of , estimate it again using data-points to the right and use the norm of their difference as a test-statistic . Yet, in practice one does not usually possess such knowledge, therefore we propose to maximize these statistics over all possible locations yielding . Finally, in order to attain a trade-off between break detection sensitivity and change-point localization accuracy we build a multiscale approach: consider a family of test statistics for multiple window sizes at once.

2.1 Definition of the test statistic

Now we present a formal definition of the test statistic. In order to detect a break we consider a set of window sizes . Denote the size of the widest window as and of the narrowest as . Given a sample of length , for each window size define a set of central points . Next, for all define a set of indices which belong to the window on the left side of the central point as and correspondingly for the window on the right side define . Denote the sum of numbers of central points for all window sizes as

(2.1)

For each window size , each central point and each side we define a de-sparsified estimator of precision matrix [23] [22] as

(2.2)

where

(2.3)

and is a consistent estimator of precision matrix which can be obtained by graphical lasso [31] or node-wise procedure [22] (see Definition 3.1 and Appendix H.5 for details).

Now define a matrix of size with elements

(2.4)

where for , stands for the -th row of . Denote their variances as and introduce the diagonal matrix . Denote a consistent estimator (see Definition 3.1 for details) of the precision matrix obtained based on the sub-sample as . In practice, the variances are unknown, but under normality assumption one can plug in which have been proven to be consistent (uniformly for all and ) estimators of [23] [3]. If the node-wise procedure is employed, the uniform consistency of an empirical estimate of has been shown under some mild assumptions (not including normality) [22].

For each window size and a central point we define a statistic

(2.5)

where we write for a vector composed of stacked columns of matrix . Finally we define our family of test statistics for all as

(2.6)

Our approach heavily relies on the following expansion under

(2.7)

where the residual term

(2.8)

can be controlled under mild assumptions [23] [22] [3].

This expansion might have been used in order to investigate the asymptotic properties of and obtain the threshold, however we propose a data-driven scheme.

Remark 2.1.

A different test statistic can be defined as the maximum distance between elements of empirical covariance matrices and . However, application to neuroimaging motivates the search for a structural change in a functional connectivity network which is encoded by the structure of the corresponding precision matrix. Clearly, a change in the precision matrix also means a change in the covariance matrix, though we believe that the definition (2.5) increases the sensitivity to this kind of alternative.

2.2 Bootstrap calibration

Our approach rejects in favor of if at least one of statistics exceeds the corresponding threshold or formally if .

In order to properly choose the thresholds, we define bootstrap statistics in the following non-standard way. Note, that we cannot use an ordinary scheme with replacement or weighted bootstrap since in a high-dimensional case ( the covariance matrix of bootstrap distribution would be singular which would made inverse covariance matrix estimation procedures meaningless.

First, draw with replacement a sequence of indices from and denote

where stands for averaging over values of index belonging to e.g., . Denote the measure are distributed with respect to as . In accordance with (2.4) define

(2.9)

and for technical purposes define

(2.10)

Now for all central point define a bootstrap counterpart of

(2.11)

which is intuitively reasonable due to expansion (2.7). And finally we define the bootstrap counterpart of as

(2.12)

Now for each given we can define quantile functions such that

(2.13)

Next for a given significance level we apply multiplicity correction choosing as

(2.14)

and finally choose thresholds as .

Remark 2.2.

One can choose and use the whole given sample for calibration as well as for detection. In fact, it would improve the bounds in Theorem 3.1 and Theorem 4.1, since it effectively means . However, in practise such a decision might lead to reduction of sensitivity due to overestimation of the thresholds.

2.3 Change-point localization

In order to localize a change-point we have to assume that . Consider the narrowest window detecting a change-point as :

(2.15)

and the central point where this window detects a break for the first time as

(2.16)

By construction of the family of test statistics we conclude (up to the confidence level ) that the change-point is localized in the interval

(2.17)

Clearly, if a non-multiscale version of the approach is employed, i.e. , and the precision of localization (delay of the detection in online setting) equals .

3 Bootstrap validity

This section states and discusses the theoretical result demonstrating the validity of the proposed bootstrap scheme i.e.

(3.1)

Our theoretical results require the tails of the underlying distributions to be light. Specifically, we impose Sub-Gaussianity vector condition.

Assumption 3.1Sub-Gaussianity vector condition.
(3.2)

Naturally, in order to establish a theoretical result we have to assume that a method featuring theoretical guaranties was used for estimating the precision matrices. Such methods include graphical lasso [31], adaptive graphical lasso [37] and thresholded de-sparsified estimator based on node-wise procedure [22]. These approaches overcome the high dimensionality of the problem by imposing a sparsity assumption, specifically bounding the maximum number of non-zero elements in a row: . These approaches are guaranteed to yield a root- consistent estimate revealing the sparsity pattern of the precision matrix [31, 3, 22] or formally

Definition 3.1.

Consider an i.i.d. sample . Denote their precision matrix as . Let and grow with . A positive-definite matrix is a consistent estimator of the high-dimensional precision matrix if

(3.3)

and

(3.4)

Graphical lasso and its adaptive versions impose an assumption, common for -penalized approaches.

Assumption 3.2Irrepresentability condition.

Denote an active set

(3.5)

and define a matrix where denotes Kronecker product. Irrepresentability condition holds if there exists such that

(3.6)

The interpretation of irrepresentability condition under normality assumption is given in [23] [31]. Particularly, Assumption 3.2 requires low correlation between the elements of empirical covariance matrix from the active set and from its complement. The higher the constant is, the stricter upper bound is assumed.

These observations give rise to the two following assumptions.

Assumption 3.3.A.

Suppose, either graphical lasso or its adaptive version was used with regularization parameter and also impose Assumption 3.2.

Assumption 3.3.B.

Suppose, thresholded de-sparsified estimator based on node-wise procedure was used with regularization parameter .

Now we are ready to establish a result which guarantees that the suggested bootstrap procedure yields proper thresholds.

Theorem 3.1.

Assume holds and furthermore, let be i.i.d. Let Assumption 3.1 and either Assumption 3.3.A or Assumption 3.3.B hold. Also assume, the spectrum of is bounded. Allow the parameters grow with . Further let , and also impose the sparsity assumption

(3.7)

Then

(3.8)

The finite-sample version of this result, namely, Theorem B.1, is given in Appendix B along with the proofs.

Bootstrap validity result discussion

Theorem 3.1 guarantees under mild assumptions (Assumption 3.2 seems to be the most restrictive one, yet it may be dropped if the node-wise procedure is employed) that the first-type error rate meets the nominal level if the narrowest window size and the set are large enough. Clearly, the dependence on dimensionality is logarithmic which establishes applicability of the approach in a high-dimensional setting. It is worth noticing that, unusually, the sparsity bound gets stricter with but the dependence is only logarithmic. Indeed, we gain nothing from longer samples, since we use only data points each time.

On-line setting

As one can easily see, the theoretical result is stated in off-line setting, when the whole sample of size is acquired in advance. In on-line setting we suggest to control the probability to raise a false alarm for at least one central point among data points (which differs from the classical techniques controlling the mean distance between false alarms [32]). Having and chosen one should acquire data-points (the set ), use the proposed bootstrap scheme with bootstrap samples of length in order to obtain the thresholds. Next the approach can be naturally applied in on-line setting and Theorem 3.1 guarantees the capability of the proposed bootstrap scheme to control the aforementioned probability to raise a false alarm.

Proofs

The proof of the bootstrap validity result, presented in Appendix B, mostly relies on the high-dimensional central limit theorems obtained in [9], [8]. These papers also present bootstrap justification results, yet do not include a comprehensive bootstrap validity result. The theoretical treatment is complicated by the randomness of . We overcome it by applying the so-called “sandwiching” proof technique (see Lemma C.1), initially used in [33].

4 Sensitivity result

Consider the following setting. Let there be index , such that are i.i.d. and are i.i.d. as well. Denote precision matrices and . Define the break extent as

(4.1)

The question is, how large the window size should be in order to reliably reject and how firmly can we localize the change-point.

Theorem 4.1.

Let Assumption 3.1 and either Assumption 3.3.A or Assumption 3.3.B hold. Also assume, the spectrums of and are bounded. Allow the parameters grow with and let decay with . Further let , ,

(4.2)

and

(4.3)

Then will be rejected with probability approaching .

This result is a direct corollary of the finite-sample sensitivity result established and discussed in Appendix A.

The assumption is only technical. The result may be proven without relying on it by methodologically the same argument.

Sensitivity result discussion

Assumptions (4.2) and (4.3) are essentially a sparsity bound and a bound for the largest window size . Clearly, they do not yield a particular value necessary to detect a break, since it depends on the underlying distributions, however, the result includes dimensionality only under the sign of logarithm, which guarantees high sensitivity of the test in high-dimensional setting.

Online setting

Theorem 4.1 is established in offline setting as well. In online setting it guarantees that the proposed approach can reliably detect a break of an extent not less than with a delay at most bounded by (4.3).

Change-point localization guaranties

Theorem 4.1 implies by construction of statistic that the change-point can be localized with precision up to . Hence condition (4.3) provides the bound for change-point localization accuracy.

5 Simulation study

5.1 Design

In our simulation we test

versus an alternative

The alternative covariance matrix was generated in the following way. First we draw . The matrix is composed as a block-diagonal matrix of matrices of size with ones on their diagonals and their off-diagonal element drawn uniformly from and an identity matrix of size . The dimensionality of the problem is chosen as , the length of the sample and we choose the set , e.g. . The absence of positive effect of large sample size is discussed in Sections 3 and 4. Moreover, in all the simulations under alternative the sample was generated with the change point in the middle: but the algorithm was oblivious about this as well as about either of the covariance matrices. The significance level was chosen. In all the experiments graphical lasso with penalization parameter was used in order to obtain . In the same way, graphical lasso with penalization parameter was used in order to obtain .

We have also come up with an approach to the same problem not involving bootstrap. The paper [26] defines a high-dimensional two-sample test for equality of matrices. Moreover, the authors prove asymptotic normality of their statistic which makes computing p-value possible. We suggest to run this test for every and every , adjust the obtained p-values using Holm method [19] and eventually compare them against .

The paper [27] suggests an approach based on comparing characteristic functions of random variables. The critical values were chosen with permutation test as proposed by the authors. In our experiments the method was allowed to consider all the sample at once. The R-package ecp [21] was used.

The first type error rate and power for our approach are reported in Table 1. As one can see, our approach allows to properly control first type error rate. As expected, its power is higher for larger windows and it is decreased by adding narrower windows into consideration which is the price to be paid for better localization of a change point.

In our study the approach proposed in [27] and the one based on the two sample test [26] turned out to be conservative, but neither of them exhibited power above .

Also, in order to justify application of multiscale approach (i. e. ) for the sake of better change-point localization we report the distribution of the narrowest detecting window (defined by (2.15)) over in Figure 2. The Table 1 represents average precision of change-point localization for various choices of set of window sizes . One can see, that multiscale approach significantly improves the precision of localization.

I type error rate Power Localization precision
{70} 0.02 0.09 70
{100} 0.00 0.37 100
{140} 0.01 0.81 140
{70, 140} 0.01 0.76 135
{100, 140} 0.01 0.75 124
{70, 100, 140} 0.01 0.74 123
Table 1: First type error rate, power and precision of change-point localization of the proposed approach for various sets of window sizes
Figure 2: Pie charts representing distribution of narrowest detecting window and the precision of localization in cases of , and respectively

Appendix A Proof of sensitivity result

Proof of Theorem 4.1.

Proof consists in applying of the finite-sample Theorem A.1. Its applicability is guaranteed by the consistency results given in papers [31, 3, 22] and by the results from [23, 22, 3] bounding the term . High probability of set is ensured by Lemma G.1. ∎

Theorem A.1.

Let . Let denote a symmetric estimator of s.t. for some it holds that

(A.1)

and . Suppose Assumption 3.1 holds and there exists such that for all and on some set

(A.2)

Moreover, let the residual defined in Lemma F.2 be bounded:

(A.3)

Also let

(A.4)

where

(A.5)

and is defined in Lemma G.2. Then on set with probability at least

(A.6)

where is defined in Lemma G.2, will be rejected.

Discussion of finite-sample sensitivity result

The assumption (A.4) is rather complicated. Here we note that if either graphical lasso [31], adaptive graphical lasso [37] or thresholded de-sparsified estimator based on node-wise procedure [22] with penalization parameter chosen as was used, given , , , and it boils down to

(A.7)

for some positive constant independent of while the parameters , and may be chosen as in (B.12), (B.11), (B.10) (high probability of is ensured by Lemma G.1). At the same time the remainder can be bounded by (B.6).

As expected, the bound for sufficient window size decreases with growth of the break extent and the size of the set , but increases with dimensionality . It is worth noticing, that the latter dependence is only logarithmic. And again, in the same way as with Theorem 3.1, the bound increases with the sample size (only logarithmically) since we use only data points.

Proof of Theorem A.1.

Consider a pair of centered normal vectors

(A.8)
(A.9)
(A.10)
(A.11)

where vectors and are defined in proofs of Lemma E.2 and Lemma F.1 respectively. Lemma A.2 applies here and yields for all positive

(A.12)

where and is the number of central points for window of size . Applying Lemma G.2 on a set of probability at least yields , and hence, due to the fact that by construction,

(A.13)

Due to Lemma F.2 and continuity of Gaussian c.d.f.

(A.14)

and due to Lemma F.2 along with the fact that , choosing as proposed by equation (A.5) we ensure that .

Now by assumption of the theorem and by construction of the test statistics

(A.15)

Finally, we notice that due to assumption (A.4) and therefore, will be rejected. ∎

Lemma A.1.

Consider a centered random Gaussian vector with arbitrary covariance matrix . For any positive it holds that

(A.16)
Proof.

By convexity we obtain the following chain of inequalities for any

(A.17)

Chernoff bound yields for any

(A.18)

Finally, optimization over yields the claim. ∎

As a trivial corollary, one obtains

Lemma A.2.

Consider a centered random Gaussian vector with arbitrary covariance matrix . For any positive it holds that

(A.19)

Appendix B Proof of bootstrap validity result

Proof of Theorem 3.1.

Proof consists in applying of the finite-sample Theorem B.1. Its applicability is guaranteed by the consistency results given in papers [31, 3, 22] and by the results from [23, 22, 3] bounding the term . High probability of set is ensured by Lemma G.1. ∎

Theorem B.1.

Assume holds and furthermore, let be i.i.d. Let denote a symmetric estimator of s.t. for some positive

(B.1)

and . Suppose Assumption 3.1 holds and there exists such that for all , and on set

(B.2)

Moreover, let

(B.3)

where the remainders , , are defined in Lemma E.1, Lemma F.2 and Lemma C.1 respectively and the mis-tie involved in the definition of comes from Lemma G.2. Then on set it holds that

(B.4)

where

(B.5)

and the terms , and are defined in Lemma G.2, Lemma G.1 and Lemma F.2 respectively.

Discussion of finite-sample bootstrap validity result

The terms , , and involved in the statement of Theorem B.1 are rather complicated. The exact expressions for them are provided by Lemma G.2, Lemma E.1, Lemma F.2 and Lemma C.1 respectively, 3rd and 4th moments and involved therein are bounded by Lemma E.4 and Lemma G.3 while asymptotic bounds for are provided in [22] (for node-wise procedure) and [23] (for graphical lasso). For the case of graphical lasso an explicit form of is given in [3].

Here we just note that if is a root- consistent estimator, recovering sparsity pattern (graphical lasso [31], adaptive graphical lasso [37] or thresholded de-sparsified estimator based on node-wise procedure [22]), then for , , , and given the spectrum of is bounded

(B.6)

If either graphical lasso, adaptive graphical lasso or node-wise procedure [28] is used with in order to obtain , then on set it holds that

(B.7)

The high probability of may be ensured by means of Lemma G.1 e.g., choosing for . Further

(B.8)
(B.9)

Here are positive constants independent of , , , and . We also note that the proper choice of , and in (B.5) is

(B.10)
(B.11)
(B.12)

which ensures the probability defined by (B.5) to be above . For exact expression of , and see Lemma G.2, Lemma G.1 and Lemma F.2.

Proof of Theorem B.1.

The proof consists in application of Lemma F.1, Lemma E.2 and Lemma D.1 justifying applicability of Lemma C.1. ∎

Appendix C Sandwiching lemma

Lemma C.1.

Consider a normal multivariate vector with a deterministic covariance matrix and a normal multivariate vector with a possibly random covariance matrix such that

(C.1)
(C.2)
(C.3)

where and are sub-vectors of and respectively. Then

(C.4)
Proof.

Let us introduce some notation. Denote multivariate cumulative distribution function of as respectively. Define the following sets for all

(C.5)
(C.6)

and their boundaries

(C.7)
(C.8)

Consider and denote sets Define a set of thresholds satisfying the confidence level

(C.9)

here and below comparison of vectors should be understood element-wise. Notice that due to continuity of multivariate normal distribution and assumption (C.2)

(C.10)

Now for all and for all it holds that

(C.11)

where we have consequently used (C.3), (C.1), (C.7) and (C.10). In the same way one obtains for all and for all

(C.12)

which implies that .

Now denote quantile functions of as :

(C.13)

In exactly the same way define quantile functions of . Clearly for all ,

(C.14)

and hence

(C.15)
(C.16)

Using Taylor expansion with Lagrange remainder term we obtain for some