Covariance and precision matrix estimation for high-dimensional time series

# Covariance and precision matrix estimation for high-dimensional time series

[ [    [ [    [ [ University of Illinois at Urbana-Champaign, University of Chicago and University of Chicago X. Chen
Department of Statistics
University of Illinois at Urbana-Champaign
725 S. Wright Street
Champaign, Illinois 61820
USA
M. Xu
W. B. Wu
Department of Statistics
University of Chicago
5734 S. University Avenue
Chicago, Illinois 60637
USA
\smonth4 \syear2013\smonth10 \syear2013
\smonth4 \syear2013\smonth10 \syear2013
\smonth4 \syear2013\smonth10 \syear2013
###### Abstract

We consider estimation of covariance matrices and their inverses (a.k.a. precision matrices) for high-dimensional stationary and locally stationary time series. In the latter case the covariance matrices evolve smoothly in time, thus forming a covariance matrix function. Using the functional dependence measure of Wu [Proc. Natl. Acad. Sci. USA 102 (2005) 14150–14154 (electronic)], we obtain the rate of convergence for the thresholded estimate and illustrate how the dependence affects the rate of convergence. Asymptotic properties are also obtained for the precision matrix estimate which is based on the graphical Lasso principle. Our theory substantially generalizes earlier ones by allowing dependence, by allowing nonstationarity and by relaxing the associated moment conditions.

[
\kwd
\doi

10.1214/13-AOS1182 \volume41 \issue6 2013 \firstpage2994 \lastpage3021 \newproclaimdefnDefinition[section] \newproclaimconjConjecture[section] \newproclaimexmpExample[section] \newproclaimrmkRemark \newproclaimnoteNote \newproclaimcaseCase

\runtitle

High-dimensional covariance estimation for time series

{aug}

a]\fnmsXiaohui \snmChen\correflabel=e1]xhchen@illinois.edu, b]\fnmsMengyu \snmXulabel=e2]mengyu@galton.uchicago.edu and b]\fnmsWei Biao \snmWulabel=e3]wbwu@galton.uchicago.edu

class=AMS] \kwd[Primary ]62H12 \kwd[; secondary ]62M10 High-dimensional inference \kwdsparsity \kwdcovariance matrix \kwdprecision matrix \kwdthresholding \kwdLasso \kwddependence \kwdfunctional dependence measure \kwdconsistency \kwdNagaev inequality \kwdnonstationary time series \kwdspatial–temporal processes

## 1 Introduction

Estimation of covariance matrices and their inverses (a.k.a. precision matrices) is of fundamental importance in almost every aspect of statistics, ranging from the principal component analysis [Johnstone and Lu (2009)], graphical modeling [Meinshausen and Bühlmann (2006), Ravikumar et al. (2011), Yuan (2010)], classification based on the linear or quadratic discriminant analysis [Bickel and Levina (2004)], and real-world applications such as portfolio selection [Ledoit and Wolf (2003), Talih (2003)] and wireless communication [Guerci (1999), Ward (1994), Li, Stocia and Wang (2003), Abrahamsson, Selen and Stoica (2007)]. Suppose we have temporally observed -dimensional vectors , with having mean zero and covariance matrix whose dimension is . Our goal is to estimate the covariance matrices and their inverses based on the data matrix . In the classical situation where is fixed, and are mean zero independent and identically distributed (i.i.d.) random vectors, it is well known that the sample covariance matrix

 ^Σn=n−1n∑i=1ziz⊤i (1)

is a consistent and well behaved estimator of , and is a natural and good estimator of . See Anderson (1958) for a detailed account. However, when the dimensionality grows with , random matrix theory asserts that is no longer a consistent estimate of in the sense that its eigenvalues do not converge to those of ; see, for example, the Marčenko–Pastur law [Marčenko and Pastur (1967)] or the Tracy–Widom law [Johnstone (2001)]. Moreover, it is clear that is not defined when is not invertible in the high-dimensional case with .

During the last decade, various special cases of the above covariance matrix estimation problem have been studied. In most of the previous papers it is assumed that the vectors are i.i.d. and thus the covariance matrix is time-invariant. See, for example, Bickel and Levina (2008a, 2008b), Cai, Zhang and Zhou (2010), Cai and Zhou (2012, 2013), where consistency and rates of convergence are established for various regularized (banded, tapered or thresholded) estimates of covariance matrices and their inverses. As an alternative regularized estimate for sparse precision matrix, one can adopt the Lasso-type entry-wise 1-norm penalized likelihood approach; see Rothman et al. (2008), Friedman, Hastie and Tibshirani (2008), Banerjee, El Ghaoui and d’Aspremont (2008), Ravikumar et al. (2011), Fan, Feng and Wu (2009). Other estimates include the Cholesky decomposition based method [Wu and Pourahmadi (2003), Huang et al. (2006)], neighborhood selection for sparse graphical models [Liu and Luo (2012), Yuan (2010), Meinshausen and Bühlmann (2006)], regularized likelihood approach [Lam and Fan (2009), Fan, Feng and Wu (2009)] and the sparse matrix transform [Cao, Bachega and Bouman (2011)]. Xiao and Wu (2012) considered covariance matrix estimation for univariate stationary processes.

The assumption that are i.i.d. is quite restrictive for situations that involve temporally observed data. In Zhou, Lafferty and Wasserman (2010) and Kolar and Xing (2011) the authors considered time-varying Gaussian graphical models where the sampling distribution can change smoothly over time. However, they assume that the underlying random vectors are independent. Using nonparametric smoothing techniques, they estimate the time-vary covariance matrices in terms of covariance matrix functions. Their asymptotic theory critically depends on the independence assumption.

The importance of estimating covariance matrices for dependent and nonstationary processes has been increasingly seen across a wide variety of research areas. In modeling spatial–temporal data, Wikle and Hooten (2010) proposed quadratic nonlinear dynamic models to accommodate the interactions between the processes which are useful for characterizing dynamic processes in geophysics [Kondrashov et al. (2005)]. Zheng, Chen and Blasch (2007) considered non-Gaussian clutter and noise processes in space–time adaptive processing, where the space–time covariance matrix is important for detecting airborne moving targets in the nonstationary clutter environment [Ward (1994), Guerci (1999)]. In finance, Jacquier, Polson and Rossi (2004) considered multivariate stochastic volatility models parametrized by time-varying covariance matrices with heavy tails and correlated errors. Talih (2003) investigated the Markowitz portfolio selection problem for optimal returns of a large number of stocks with hidden and heterogeneous Gaussian graphical model structures. In essence, those real-world problems pose a number of challenges: (i) nonlinear dynamics of data generating systems, (ii) temporally dependent and nonstationary observations, (iii) high-dimensionality of the parameter space and (iv) non-Gaussian distributions. Therefore, the combination of more flexible nonlinear and nonstationary components in the models and regularized covariance matrix estimation are essential to perform related statistical inference.

In contrast to the longstanding progresses and extensive research that have been made in terms of heuristics and methodology, theoretical work on estimation of covariance matrices based on high-dimensional time series data is largely untouched. In this paper we shall substantially relax the i.i.d. assumption by establishing an asymptotic theory that can have a wide range of applicability. We shall deal with the estimation of covariance and precision matrices for high-dimensional stationary processes in Sections 2 and 3, respectively. Section 2 provides a rate of convergence for the thresholded estimator, and Section 3 concerns the graphical Lasso estimator for precision matrices. For locally stationary processes, an important class of nonstationary processes, we shall study in Section 4 the estimation of time-varying covariance and precision matrices. This generalization allows us to consider time-varying covariance and precision matrix estimation under temporal dependence; hence our results significantly extend previous ones by Zhou, Lafferty and Wasserman (2010) and Kolar and Xing (2011). Furthermore, by assuming a mild moment condition on the underlying processes, we can relax the multivariate Gaussian assumption that was imposed in Zhou, Lafferty and Wasserman (2010) and Kolar and Xing (2011) [and also by Bickel and Levina (2008a, 2008b) in the i.i.d. setting]. Specifically, we shall show that, thresholding on the kernel smoothed sample covariance matrices, estimators based on the localized graphical Lasso procedure are consistent estimators for time-varying covariance and precision matrices.

To deal with temporal dependence, we shall use the functional dependence measure of Wu (2005). With the latter, we are able to obtain explicit rates of convergence for the thresholded covariance matrix estimates and illustrate how the dependence affects the rates. In particular, we show that, based on the moment condition of the underlying process, there exists a threshold value. If the dependence of the process does not exceed that threshold, then the rates of convergence will be the same as those obtained under independence. On the other hand, if the dependence is stronger, then the rates of convergence will depend on the dependence. This phase transition phenomenon is of independent interest.

We now introduce some notation. We shall use to denote positive constants whose values may differ from place to place. Those constants are independent of the sample size and the dimension . For some quantities and , which may depend on and , we write if holds for some constant that is independent of and and if there exists a constant such that . We use and . For a vector , we write and for a matrix , , , and . For a random vector , write , , if .

## 2 Covariance matrix estimation for high-dimensional stationary processes

In this section we shall assume that is a -dimensional stationary process of the form

 zi=g(Fi), (2)

where is an -valued measurable function, is a shift process and are i.i.d. random vectors. Following Wu (2005), we can view and as the input and the output of a physical system, respectively, and is the transform representing the underlying physical mechanism. The framework (2) is quite general. Some examples are presented in Wu (2011). It can also be conveniently extended to locally stationary processes; see Section 4.

Write and , the data matrix observed at time points . Here we shall consider estimation of the covariance matrix based on the realization , while Section 3 concerns estimation of its inverse. We consider Frobenius and spectral norm convergence of the thresholded estimator

 Tu(^Σn)=(^σjkI(|^σjk|≥u))1≤j,k≤p, (3)

where is the sample covariance matrix defined in (1); see Bickel and Levina (2008a). It was shown in the latter paper that, with a properly chosen , is a consistent estimator when [see (45)] and are i.i.d. sub-Gaussian. Our rates of convergence depend on the dependence of the process and the moment conditions, which can be quite mild. Our main theoretical result is given in Section 2.1. To obtain a consistent estimate for , we need to impose regularization conditions. In particular, we shall assume that is weakly dependent in that most of its entries are small, by providing a bound on the tail empirical process of covariances. Some examples are provided in Section 2.3 with applications to spatial–temporal processes.

### 2.1 Asymptotic results

To establish a convergence theory for covariance matrix estimates, we shall use the functional dependence measure of Wu (2005). Recall that , , where is the th coordinate projection of the -valued measurable function . For , the functional dependence measure of is defined by

 θi,w,j=∥∥Zji−Z′ji∥∥w=(E∣∣Zji−Z′ji∣∣w)1/w, (4)

where , and is such that , , are i.i.d. In other words, is a coupled version of with in the latter replaced by an i.i.d. copy . In Wu (2011) functional dependence measures were computed for some commonly used linear and nonlinear stationary processes. We shall assume that the short-range dependence (SRD) condition holds,

 Θm,w=max1≤j≤p∞∑l=mθl,w,j<∞. (5)

If (5) fails, the process may exhibit long-range dependence, and the asymptotic behavior can be quite different. A nonlinear process satisfying (5) is given in Example 2.1, while Example 2.1 concerns linear processes. Theorems 2.1 and 2.3 provide rates of convergence under the normalized Frobenius norm and the spectral norm for the thresholded estimate , respectively. The constants  therein are independent of , and .

###### Theorem 2.1

Assume that there exist , , and a positive constant such that and for all . Let and . Define

and

 D(u)=1p2p∑j,k=1(u2∧σ2jk). (8)

Then there exists a constant , independent of , and , such that

 E|Tu(^Σn)−Σ|2Fp2≲D(u)+min(1n,u2−qnq/2,H(u)+G(Cu)). (9)
{rmk}

If , elementary calculations indicate that. Hence the right-hand side of (9) is . The term is needed if .

By Theorem 2.1, if , then . Better convergence rates can be achieved if by choosing a larger threshold; see cases (i)–(iii) in Corollary 2.2 below.

###### Corollary 2.2

Assume that the conditions of Theorem 2.1 hold. Let ; let if and if .

Let be the unique solution to the equation . (i) If , then there is a fixed constant such that for all . (ii) If and , let solve , then . (iii) If , and , let be the solution to the equation over the interval , then . (iv) If , then the right-hand side of (9) is for all and .

Theorem 2.1 and Corollary 2.2 describe how the Frobenius rate of convergence depends on the sample size , the dimension , the smallness measure quantified by the function and the heaviness of tails (moment conditions) and strength of dependence which are characterized by and , respectively. It suggests the interesting dichotomy phenomenon: under the weaker dependence condition , the thresholded estimate has the same convergence rates as those obtained under independence. However, the convergence becomes slower under stronger temporal dependence with . The phase transition occurring at . The theorem also provides information about the optimal threshold , as revealed in its proof. The optimal threshold balances the bias or the smallness function , the tail function and the variance component which roughly corresponds to the Gaussian-type function . Under different conditions, the optimal threshold assumes different forms; see Corollaries 2.4 and 2.5.

{pf*}

Proof of Theorem 2.1 We first assume . Note that

 E∣∣Tu(^Σn)−Σ∣∣2F = p∑j,k=1E[^σjkI(|^σjk|≥u)−σjk]2 ≤ 2p∑j,k=1E(W2jk)+2B(u/2),

where and

 B(u)=p∑j,k=1σ2jkI(|σjk|

Let events , and , . Observe that

 Wjk=WjkI(A1jk)+WjkI(A2jk)+WjkI(A3jk).

We shall consider these three terms separately. Write .

Case I: on the event , since the functional dependence measure for the product process , , satisfies

 ∥∥ZjiZki−Z′jiZ′ki∥∥q ≤ ≤ μ(θi,2q,j+θi,2q,k),

it follows from the moment inequality Theorem 2.1 in Wu (2007) that

 ∥ξjk∥q≤cqn−1/2μΘ0,2q, (13)

where is a constant only depending on . Let . Then

 E{W2jkI(A1jk)}≤Eξ2jkI(|σjk|≥u/2)≤C1I(|σjk|≥u/2)n. (14)

Case II: on the event , we observe that

 E{W2jkI(A2jk)} = E[σ2jkI(|σjk|≥u/2,|^σjk|

Case III: on the event , let

 Δjk = E[ξ2jkI(|^σjk|≥u,|σjk|u/2)] ≤ E[ξ2jkI(|ξjk|>u/2)].

Then

 E{W2jkI(A3jk)} = E[^σ2jkI(|^σjk|≥u,|σjk|

Since the functional dependence measure for the product process satisfies (2.1), under the decay condition , , we have by Theorem 2(ii) in Liu, Xiao and Wu (2013) that

 P(|ξjk|>v)≤C2n(nv)q+C3e−C4nv2 (18)

holds for all . Using integration by parts, we obtain

 = v2P(|ξjk|>v)+∫∞v2P(|ξjk|>√w)dw ≤ v2[C2n(nv)q+C3e−C4nv2] +∫∞v2[C2n(n√w)q+C3e−C4nw]dw = C5n1−qv2−q+C3((C4n)−1+v2)e−C4nv2,

where . By (13), we also have

 E[ξ2jkI(|ξjk|>v)]≤min(∥ξjk∥22,∥ξjk∥qqvq−2)≲min(1n,v2−qnq/2). (20)

Combining cases I, II and III, by (11) and (14)–(20), we have

 E|Tu(^Σn)−Σ|2Fp2 ≲ B(u/2)p2+1+nu2np2p∑j,k=1I(|σjk|≥u/2) +min(1n,u2−qnq/2,H(u)+G(Cu))=:M0(u),

where , and the constant of is independent of , and . If , then (9) clearly follows from the inequality . If , we also have (9) since in this case and the right-hand side of (9) has the same order of magnitude .

The other cases with and can be similarly handled. The key difference is that, instead of (18), we shall now use the following versions of Nagaev inequalities which can allow stronger dependence:

{pf*}

Proof of Corollary 2.2 Let be the term on the right-hand side of (9). We now minimize over . Let

 M2(v)=D(v)+min(1n,v2−qnq/2,max(H(v),G(v))).

Then . Clearly, . Let . If , then for some constant , we have . Also we have . Hence

 infv≥n−1/2M2(v)≍infv≥n−1/2max[D(v),H(v),G(v)]. (22)

Note that the equation has a unique solution on , and the function is decreasing over . A plot of the function in (22) is given in Figure 2(a). Let be the minimizer of the right-hand side of (22). For (i), assume for some . Then satisfies , which implies , and hence (i) follows. Note that (ii) follows in view of and . Similarly we have (iii) since . The last case (iv) is straightforward since for all .

If , assume , and then (22) still holds with therein replaced by . A plot for this case is given in Figure 2(b). Note that if . Then we can similarly have (i)–(iv).

{rmk}

From the proof of Corollary 2.2, if , in case (iii), we can actually have the following dichotomy: let be the solution to the equation . Then the minimizer if and if . For , (22) indicates that is not needed; see also Remark 2.1.

Using the argument for Theorem 2.1, we can similarly establish a spectral norm convergence rate. Bickel and Levina (2008a) considered the special setting with i.i.d. vectors. Our Theorem 2.3 is a significant improvement by relaxing the independence assumption, by obtaining a sharper rate and by presenting a moment bound. As in Theorem 2.1, we also have the phase transition at . Note that Bickel and Levina (2008a) only provides a probabilistic bound.

###### Theorem 2.3

Let the moment and the dependence conditions in Theorem 2.1 be satisfied. Let and , for , and , respectively. Define

 (23)

and . Then there exists a constant , independent of , and , such that

 ≲ D∗(u)+M∗(u/2)

where and are given in (6) and (7), respectively.

{pf}

We shall only deal with the weaker dependent case with . The other cases similarly follow. Recall the proof of Theorem 2.1 for , and . Let matrices . Similar to (11), let . Then

 ∣∣ρ(Tu(^Σn)−Σ)∣∣≤B∗(u/2)+3∑l=1∣∣ρ(Vl)∣∣. (25)

Let and , where is a large constant. Since , by (18),

 ∥ρ(V1)∥222 ≤ ∫∞0zP(Q≥z)dz ≲ z2u2+∫∞zuzpSu[n(nz/Su)q+e−C4nz2S−2u]dz ≲ M2∗(u/2),

where . Similar to (2.1), since on ,

 ∣∣ρ(V2)∣∣≤Q+uSu≤Q+2D∗(u). (27)

Using the idea of (2.1), we have

 ρ2(V3) ≤ ≤ 2∑j,kξ2jkI(|ξjk|>u/2)+2B2∗(u/2).

By (2.1)–(20) and (25)–(2.1), we have (2.3) since .

The bounds in Theorems 2.1 and 2.3 depend on the smallness measures, the moment order , the dependence parameter , the dimension and the sample size . The problem of selecting optimal thresholds is highly nontrivial. Our numeric experiments show that the cross-validation based method has a reasonably good performance. However, we are unable to provide a theoretical justification of the latter method, and pose it as an open problem.

{exmp}

[(Stationary Markov chains)] We consider the nonlinear process  defined by the iterated random function

 zi=g(zi−1,ei), (29)

where ’s are i.i.d. innovations, and is an -valued and jointly measurable function, which satisfies the following two conditions: (i) there exists some such that and (ii)

 L=supx≠x′∥g(x,e0)−g(x′,e0)∥2q|x−x′|<1. (30)

Then, it can be shown that defined in (29) has a stationary ergodic distribution and, in addition, has the geometric moment contraction (GMC) property; see Wu and Shao (2004) for details. Therefore, we have and Theorems 2.1 and 2.3 with and can be applied.

{exmp}

[(Stationary linear processes)] An important special class of (2) is the vector linear process

 zi=∞∑m=0Amei−m, (31)

where , are matrices, and are i.i.d. mean zero random vectors with finite covariance matrix . Then exists almost surely with covariance matrix if the latter converges. Assume that the innovation vector , where are i.i.d. with mean zero, variance and , , and the coefficient matrices satisfy , . By Rosenthal’s inequality, the functional dependence measure , and hence by (5) . By Theorem 2.1, the normalized Frobenius norm of the thresholded estimator has a convergence rate established in (9) with , and . Note that our moment condition relaxes the commonly assumed sub-Gaussian condition in previous literature [Rothman et al. (2008), Lam and Fan (2009), Zhou, Lafferty and Wasserman (2010)]. For the vector AR(1) process , where is a real matrix with spectral norm , it is of form (31) with , and the functional dependence measure . The rates of convergence established in (9) hold with and .

### 2.2 Positive-definitization

The thresholded estimate may not be positive definite. Here we shall propose a simple modification that is positive definite and has the same rate of convergence. Let be its eigen-decomposition, where is an orthonormal matrix and is a diagonal matrix. For , consider

 ~Sv=p∑j=1(^λj∨v)qjq⊤j, (32)

where and is the rate of convergence in (9). Let be the diagonal elements of . Then we have by Theorem 2.1 that , and consequently

 |~Sv−Σ|2F ≤ 2∣∣~Sv−Tu(^Σn)∣∣2F+2∣∣Tu(^Σn)−Σ∣∣2F ≤ 2p∑j=1(^λj−(^λj∨v))2+2