Adaptive covariance matrix estimation through block thresholding
Abstract
Estimation of large covariance matrices has drawn considerable recent attention, and the theoretical focus so far has mainly been on developing a minimax theory over a fixed parameter space. In this paper, we consider adaptive covariance matrix estimation where the goal is to construct a single procedure which is minimax rate optimal simultaneously over each parameter space in a large collection. A fully datadriven block thresholding estimator is proposed. The estimator is constructed by carefully dividing the sample covariance matrix into blocks and then simultaneously estimating the entries in a block by thresholding. The estimator is shown to be optimally rate adaptive over a wide range of bandable covariance matrices. A simulation study is carried out and shows that the block thresholding estimator performs well numerically. Some of the technical tools developed in this paper can also be of independent interest.
10.1214/12AOS999 \volume40 \issue4 2012 \firstpage2014 \lastpage2042 \newproclaimdefinitionDefinition \newproclaimremarkRemark \newproclaimassumptionAssumption
Adaptive covariance matrix estimation
a]\fnmsT. Tony \snmCai\thanksreft1label=e1]tcai@wharton.upenn.edu and b]\fnmsMing \snmYuan\corref\thanksreft2label=e2]myuan@isye.gatech.edu \thankstextt1Supported in part by NSF FRG Grant DMS0854973. \thankstextt2Supported in part by NSF Career Award DMS0846234.
class=AMS] \kwd[Primary ]62H12 \kwd[; secondary ]62F12 \kwd62G09 Adaptive estimation \kwdblock thresholding \kwdcovariance matrix \kwdFrobenius norm \kwdminimax estimation \kwdoptimal rate of convergence \kwdspectral norm
1 Introduction
Covariance matrix estimation is of fundamental importance in multivariate analysis. Driven by a wide range of applications in science and engineering, the highdimensional setting, where the dimension can be much larger than the sample size , is of particular current interest. In such a setting, conventional methods and results based on fixed and large are no longer applicable, and in particular, the commonly used sample covariance matrix and normal maximum likelihood estimate perform poorly.
A number of regularization methods, including banding, tapering, thresholding and minimization, have been developed in recent years for estimating a large covariance matrix or its inverse. See, for example, Ledoit and Wolf (2004), Huang et al. (2006), Yuan and Lin (2007), Banerjee, El Ghaoui and d’Aspremont (2008), Bickel and Levina (2008a, 2008b), El Karoui (2008), Fan, Fan and Lv (2008), Friedman, Hastie and Tibshirani (2008), Rocha, Zhao and Yu (2008), Rothman et al. (2008), Lam and Fan (2009), Rothman, Levina and Zhu (2009), Cai, Zhang and Zhou (2010), Yuan (2010), Cai and Liu (2011) and Cai, Liu and Luo (2011), among many others.
Let be independent copies of a dimensional Gaussian random vector . The goal is to estimate the covariance matrix and its inverse based on the sample . It is now well known that the usual sample covariance matrix
where , is not a consistent estimator of the covariance matrix when , and structural assumptions are required in order to estimate consistently.
One of the most commonly considered classes of covariance matrices is the “bandable” matrices, where the entries of the matrix decay as they move away from the diagonal. More specifically, consider the following class of covariance matrices introduced in Bickel and Levina (2008a):
Such a family of covariance matrices naturally arises in a number of settings, including temporal or spatial data analysis. See Bickel and Levina (2008a) for further discussions. Several regularization methods have been introduced for estimating a bandable covariance matrix . Bickel and Levina (2008a) suggested banding the sample covariance matrix and estimating by where is a banding matrix
and represents the Schur product, that is, for two matrices of the same dimensions. See Figure 1(a) for an illustration. Bickel and Levina (2008a) proposed to choose and showed that the resulting banding estimator attains the rate of convergence
(2) 
uniformly over , where stands for the spectral norm. This result indicates that even when , it is still possible to consistently estimate , so long as .
Cai, Zhang and Zhou (2010) established the minimax rate of convergence for estimation over and introduced a tapering estimator where the tapering matrix is given by
with . See Figure 1(b) for an illustration. It was shown that the tapering estimator with achieves the rate of convergence
(3) 
uniformly over , which is always faster than the rate in (2). This implies that the rate of convergence given in (2) for the banding estimator with is in fact suboptimal. Furthermore, a lower bound argument was given in Cai, Zhang and Zhou (2010) which showed that the rate of convergence given in (3) is indeed optimal for estimating the covariance matrices over .
(a) Weighting matrix for banding  (b) Weighting matrix for tapering 
The minimax rate of convergence in (3) provides an important benchmark for the evaluation of the performance of covariance matrix estimators. It is, however, evident from its construction that the rate optimal tapering estimator constructed in Cai, Zhang and Zhou (2010) requires explicit knowledge of the decay rate which is typically unknown in practice. It is also clear that a tapering estimator designed for a parameter space with a given decay rate performs poorly over another parameter space with a different decay rate. The tapering estimator mentioned above is thus not very practical.
This naturally leads to the important question of adaptive estimation: Is it possible to construct a single estimator, not depending on the decay rate , that achieves the optimal rate of convergence simultaneously over a wide range of the parameter spaces ? We shall show in this paper that the answer is affirmative. A fully datadriven adaptive estimator is constructed and is shown to be simultaneously rate optimal over the collection of the parameter spaces for all . That is,
In many applications, the inverse covariance matrix is of significant interest. We introduce a slightly modified version of and show that it adaptively attains the optimal rate of convergence for estimating .
The adaptive covariance matrix estimator achieves its adaptivity through block thresholding of the sample covariance matrix . The idea of adaptive estimation via block thresholding can be traced back to nonparametric function estimation using Fourier or wavelet series. See, for example, Efromovich (1985) and Cai (1999). However, the application of block thresholding to covariance matrix estimation poses new challenges. One of the main difficulties in dealing with covariance matrix estimation, as opposed to function estimation or sequence estimation problems, is the fact that the spectral norm is not separable in its entries. Another practical challenge is due to the fact that the covariance matrix is “twodirectional” where one direction is along the rows and another along the columns. The blocks of different sizes need to be carefully constructed so that they fit well in the sample covariance matrix and the risk can be assessed based on their joint effects rather than their individual contributions. There are two main steps in the construction of the adaptive covariance matrix estimator. The first step is the construction of the blocks. Once the blocks are constructed, the second step is to estimate the entries of the covariance matrix in groups and make simultaneous decisions on all the entries within a block. This is done by thresholding the sample covariance matrix block by block. The threshold level is determined by the location, block size and corresponding spectral norms. The detailed construction is given in Section 2.
We shall show that the proposed block thresholding estimator is simultaneously rateoptimal over every for all . The theoretical analysis of the estimator requires some new technical tools that can be of independent interest. One is a concentration inequality which shows that although the sample covariance matrix is not a reliable estimator of , its submatrices could still be a good estimate of the corresponding submatrices of . Another useful tool is a socalled norm compression inequality which reduces the analysis on the whole matrix to a matrix of much smaller dimensions, whose entries are the spectral norms of the blocks.
In addition to the analysis of the theoretical properties of the proposed adaptive block thresholding estimator, a simulation study is carried out to investigate the finite sample performance of the estimator. The simulations show that the proposed estimator enjoys good numerical performance when compared with nonadaptive estimators such as the banding and tapering estimators.
Besides bandable matrices considered in the present paper, estimating sparse covariance matrices and sparse precision matrices has also been actively studied in the recent literature. Bickel and Levina (2008b) proposed a thresholding estimator for sparse covariance matrices and obtained the rate of convergence. Cai and Zhou (2011) developed a new general minimax lower bound technique and established the minimax rate of convergence for estimating sparse covariance matrices under the spectral norm and other matrix operator norms. Cai and Liu (2011) introduced an adaptive thresholding procedure for estimating sparse covariance matrices that automatically adjusts to the variability of individual entries. Estimation of sparse precision matrices has also drawn considerable attention due to its close connections to Gaussian graphical model selection. See Yuan and Lin (2007), Yuan (2010), Ravikumar et al. (2011) and Cai, Liu and Luo (2011). The optimal rate of convergence for estimating sparse inverse covariance matrices was established in Cai, Liu and Zhou (2011).
The rest of the paper is organized as follows. Section 2 presents a detailed construction of the datadriven block thresholding estimator . The theoretical properties of the estimator are investigated in Section 3. It is shown that the estimator achieves the optimal rate of convergence simultaneously over each for all . In addition, it is also shown that a slightly modified version of is adaptively rateoptimal for estimating over the collection . Simulation studies are carried out to illustrate the merits of the proposed method, and the numerical results are presented in Section 4. Section 5 discusses extension to subguassian noise, adaptive estimation under the Frobenius norm and other related issues. The proofs of the main results are given in Section 6.
2 Block thresholding
In this section we present in detail the construction of the adaptive covariance matrix estimator. The main strategy in the construction is to divide the sample covariance matrix into blocks and then apply thresholding to each block according to their sizes and dimensions. We shall explain these two steps separately in Sections 2.1 and 2.2.
2.1 Construction of blocks
As mentioned in the Introduction, the application of block thresholding to covariance matrix estimation requires more care than in the conventional sequence estimation problems such as those from nonparametric function estimation. We begin with the blocking scheme for a general symmetric matrix. A key in our construction is to make blocks larger for entries that are farther away from the diagonal and take advantage of the approximately banding structure of the covariance matrices in . Before we give a precise description of the construction of the blocks, it is helpful to graphically illustrate the construction in the following plot.
Due to the symmetry, we shall focus only on the upper half for brevity. We start by constructing blocks of size along the diagonal as indicated by the darkest squares in Figure 2. Note that the last block may be of a smaller size if is not a divisor of . Next, new blocks are created successively toward the top right corner. We would like to increase the block sizes along the way. To this end, we extend to the right from the diagonal blocks by either two or one block of the same dimensions () in an alternating fashion. After this step, as exhibited in Figure 2, the odd rows of blocks will have three blocks, and the even rows will have two in the upper half. Next, the size of new blocks is doubled to . Similarly to before, the last block may be of smaller size if is not a divisor of , and for the most part, we shall neglect such a caveat hereafter for brevity. The same procedure is then followed. We extend to the right again by three or two blocks of the size . Afterwards, the block size is again enlarged to and we extend to the right by three or two blocks of size . This procedure will continue until the whole upper half of the matrix is covered. For the lower half, the same construction is followed to yield a symmetric blocking of the whole matrix.
The initial block size can take any value as long as . In particular, we can take . The specific choice of does not impact the rate of convergence, but in practice it may be beneficial sometimes to use a value different from . In what follows, we shall keep using for the sake of generality.
For notational purposes, hereafter we shall refer to the collection of index sets for the blocks created in this fashion as where for some subintervals . It is clear that forms a partition of , that is,
For a matrix and an index set , we shall also write , a submatrix of . Hence is uniquely determined by and the partition . With slight abuse of notation, we shall also refer to an index set as a block when no confusion occurs, for the sake of brevity.
Denote by the dimension of , that is,
Clearly by construction, most of the blocks in are necessarily square in that . The exceptions occur when the block sizes are not divisors of , which leaves the blocks along the last row and column in rectangles rather than squares. We opt for the more general definition of to account for these rectangle blocks.
2.2 Block thresholding
Once the blocks are constructed, the next step is to estimate the entries of the covariance matrix , block by block, through thresholding the corresponding blocks of the sample covariance matrix based on the location, block size and corresponding spectral norms.
We now describe the procedure in detail. Denote by the block thresholding estimator, and let . The estimate of the block is defined as follows: {longlist}[(a)]
keep the diagonal blocks: if is on the diagonal, that is, ;
“kill” the large blocks: if ;
threshold the intermediate blocks: For all other blocks , set
(4) 
where is a turning parameter. Our theoretical development indicates that the resulting block thresholding estimator is optimally rate adaptive whenever is a sufficiently large constant. In particular, it can be taken as fixed at . In practice, a datadriven choice of could potentially lead to further improved finite sample performance.
It is clear from the construction that the block thresholding estimate is fully datadriven and does not require the knowledge of . The choice of the thresholding constant comes from our theoretical and numerical studies. See Section 5 for more discussions on the choice of .
We should also note that, instead of the hard thresholding operator , more general thresholding rules can also be applied in a similar blockwise fashion. In particular, one can use block thresholding rules where
and is a univariate thresholding rule. Typical examples include the soft thresholding rule and the socalled adaptive lasso rule for some , among others. Rothman, Levina and Zhu (2009) considered entrywise universal thresholding for estimating sparse covariance matrix. In particular, they investigate the class of univariate thresholding rules such that (a) ; (b) if ; and (c) . Although we will focus on the hard thresholding rule in the present paper for brevity, all the theoretical results developed here apply to the more general class of block thresholding rules as well.
3 Adaptivity
We now study the properties of the proposed block thresholding estimator and show that the estimator simultaneously achieves the minimax optimal rate of convergence over the full range of for all . More specifically, we have the following result.
Theorem 3.1
Let be the block thresholding estimator of as defined in the Section 2. Then
(5) 
for all , where is a positive constant not depending on and .
Comparing with the minimax rate of convergence given in Cai, Zhang and Zhou (2010), this shows that the block thresholding estimator is optimally rate adaptive over for all .
The block thresholding estimator is positive definite with high probability, but it is not guaranteed to be positive definite. A simple additional step, as was done in Cai and Zhou (2011), can make the final estimator positive semidefinite and still achieve the optimal rate of convergence. Write the eigendecomposition of as , where ’s and ’s are, respectively, the eigenvalues and eigenvectors of . Let be the positive part of , and define
Then is positive semidefinite, and it can be shown easily that attains the same rate as . See Cai and Zhou (2011) for further details. If a strictly positive definite estimator is desired, one can also set for some small positive value , say , and the resulting estimator is then positive definite and attains the optimal rate of convergence.
The inverse of the covariance matrix, , is of significant interest in many applications. An adaptive estimator of can also be constructed based on our proposed block thresholding estimator. To this end, let be its eigendecomposition, that is, is an orthogonal matrix, and is a diagonal matrix. We propose to estimate by
where is the th diagonal element of . The truncation of is needed to deal with the case where is near singular. The result presented above regarding can be used to show that adaptively achieves the optimal rate of convergence for estimating .
Theorem 3.2
Let be defined as above. Then
for all , where is a constant not depending on and .
The proof of the adaptivity results is somewhat involved and requires some new technical tools. The main ideas in the theoretical analysis can be summarized as follows:

The different can be decomposed into a sum of matrices such that each matrix in the sum only consists of blocks in that are of the same size. The individual components in the sum are then bounded separately according to their block sizes.

Although the sample covariance matrix is not a reliable estimator of , its submatrix, , could still be a good estimate of . This is made precise through a concentration inequality.

The analysis on the whole matrix is reduced to the analysis of a matrix of much smaller dimensions, whose entries are the spectral norms of the blocks, through the application of a socalled norm compression inequality.

With high probability, large blocks in , which correspond to negligible parts of the true covariance matrix , are all shrunk to zero because by construction they are necessarily far away from the diagonal.
We shall elaborate below these main ideas in our analysis and introduce some useful technical tools. The detailed proof is relegated to Section 6.
3.1 Main strategy
Recall that is the collection of blocks created using the procedure in Section 2.1, and it forms a partition of . We analyze the error by first decomposing it into a sum of matrices such that each matrix in the sum only consists of blocks in that are of the same size. More precisely, for a matrix , define to be a matrix whose entry equals that of if belongs to a block of dimension , and zero otherwise. In other words,
With this notation, is decomposed as
This decomposition into the sum of blocks of different sizes is illustrated in Figure 3 below.
(a)  (b) 
We shall first separate the blocks into two groups, one for big blocks and another for small blocks. See Figure 4 for an illustration. By the triangle inequality, for any ,
(6) 
The errors on the big blocks will be bounded as a whole, and the errors on the small blocks will be bounded separately according to block sizes. With a careful choice of the cutoff value , it can be shown that there exists a constant not depending on and such that for any and ,
(7) 
and
(8) 
which then implies Theorem 3.1 because
The choice of the cutoff value depends on and and different approaches are taken to establish (7) and (8). In both cases, a key technical tool we shall use is a concentration inequality on the deviation of a block of the sample covariance matrix from its counterpart of the true covariance matrix, which we now describe.
(a) Small blocks  (b) Large blocks 
3.2 Concentration inequality
The rationale behind our block thresholding approach is that although the sample covariance matrix is not a reliable estimator of , its submatrix, , could still be a good estimate of . This observation is formalized in the following theorem.
Theorem 3.3
There exists an absolute constant such that for all ,
In particular, we can take .
Theorem 3.3 enables one to bound the estimation error block by block. Note that larger blocks are necessarily far away from the diagonal by construction. For bandable matrices, this means that larger blocks are necessarily small in the spectral norm. From Theorem 3.3, if , with overwhelming probability,
for blocks with sufficiently large sizes. As we shall show in Section 6, and in the above inequality can be replaced by their respective sample counterparts. This observation suggests that larger blocks are shrunken to zero with our proposed block thresholding procedure, which is essential in establishing (8).
The treatment of smaller blocks is more complicated. In light of Theorem 3.3, blocks of smaller sizes can be estimated well, that is, is close to for of smaller sizes. To translate the closeness in such a blockwise fashion into the closeness in terms of the whole covariance matrix, we need a simple yet useful result based on a matrix norm compression transform.
3.3 Norm compression inequality
We shall now present a socalled norm compression inequality which is particularly useful for analyzing the properties of the block thresholding estimators. We begin by introducing a matrix norm compression transform.
Let be a symmetric matrix, and let be positive integers such that . The matrix can then be partitioned in a block form as
where is a submatrix. We shall call such a partition of the matrix a regular partition and the blocks regular blocks. Denote by a norm compression transform
The following theorem shows that such a norm compression transform does not decrease the matrix norm.
Theorem 3.4 ((Norm compression inequality))
For any matrix and block sizes such that ,
Together with Theorems 3.3 and 3.4 provides a very useful tool for bounding . Note first that Theorem 3.4 only applies to a regular partition, that is, the divisions of the rows and columns are the same. It is clear that corresponds to regular blocks of size with the possible exception of the last row and column which can be of a different size, that is, . Hence, Theorem 3.4 can be directly applied. However, this is no longer the case when .
To take advantage of Theorem 3.4, a new blocking scheme is needed for . Consider the case when . It is clear that does not form a regular blocking. But we can form new blocks with , that is, half the size of the original blocks in . Denote by the collection of the new blocks . It is clear that under this new blocking scheme, each block of size consists of four elements from . Thus
Applying Theorem 3.4 to the regular blocks yields
which can be further bounded by
where stands for the matrix norm. Observe that each row or column of has at most 12 nonzero entries, and each entry is bounded by
because implies . This property suggests that can be controlled in a blockbyblock fashion. This can be done using the concentration inequalities given in Section 3.2.
The case when can be treated similarly. Let and for It is not hard to see that each block in of size occupies up to four blocks in this regular blocking. And following the same argument as before, we can derive bounds for .
4 Numerical results
The block thresholding estimator proposed in Section 2 is easy to implement. In this section we turn to the numerical performance of the estimator. The simulation study further illustrates the merits of the proposed block thresholding estimator. The performance is relatively insensitive to the choice of , and we shall focus on throughout this section for brevity.
We consider two different sets of covariance matrices. The setting of our first set of numerical experiments is similar to those from Cai, Zhang and Zhou (2010). Specifically, the true covariance matrix is of the form
where the value of is set to be to ensure positive definiteness of all covariance matrices, and are independently sampled from a uniform distribution between and .
The second settings are slightly more complicated, and the covariance matrix is randomly generated as follows. We first simulate a symmetric matrix whose diagonal entries are zero and offdiagonal entries () are independently generated as . Let be its smallest eigenvalue. The covariance matrix is then set to be to ensure its positive definiteness.
For each setting, four different combinations of and are considered, and , and for each combination, 200 simulated datasets are generated. On each simulated dataset, we apply the proposed block thresholding procedure with . For comparison purposes, we also use the banding estimator of Bickel and Levina (2008a) and tapering estimator of Cai, Zhang and Zhou (2010) on the simulated datasets. For both estimators, a tuning parameter needs to be chosen. The two estimators perform similarly for the similar values of . For brevity, we report only the results for the tapering estimator because it is known to be rate optimal if is appropriately selected based on the true parameter space. It is clear that for both our settings, with . But such knowledge would be absent in practice. To demonstrate the importance of knowing the true parameter space for these estimators and consequently the necessity of an adaptive estimator such as the one proposed here, we apply the estimators with five different values of , and . We chose for the tapering estimator following Cai, Zhang and Zhou (2010).The performance of these estimators is summarized in Figures 5 and 6 for the two settings, respectively.
It can be seen in both settings that the numerical performance of the tapering estimators critically depends on the specification of the decay rate . Misspecifying could lead to rather poor performance by the tapering estimators. It is perhaps not surprising to observe that the tapering estimator with performed the best among all estimators since it correctly specifies the true decay rate and therefore, in a certain sense, made use of the information that may not be known a priori in practice. In contrast, the proposed block thresholding estimator yields competitive performance while not using such information.
5 Discussion
In this paper we introduced a fully datadriven covariance matrix estimator by blockwise thresholding of the sample covariance matrix. The estimator simultaneously attains the optimal rate of convergence for estimating bandable covariance matrices over the full range of the parameter spaces for all . The estimator also performs well numerically.
As noted in Section 2.2, the choice of the thresholding constant is based on our theoretical and numerical studies. Similar to wavelet thresholding in nonparametric function estimation, in principle other choices of can also be used. For example, the adaptivity results on the block thresholding estimator holds as long as where the value comes from the concentration inequality given in Theorem 3.3. Our experience suggests the performance of the block thresholding estimator is relatively insensitive to a small change of . However, numerically the estimator can sometimes be further improved by using datadependent choices of .
Throughout the paper, we have focused on the Gaussian case for ease of exposition and to allow for the most clear description of the block thresholding estimator. The method and the results can also be extended to more general subgaussian distributions. Suppose that the distribution of the ’s is subgaussian in the sense that there exists a constant such that
(9) 
Let denote the collection of distributions satisfying both (1) and (9). Then for any given , the block thresholding estimator adaptively attains the optimal rate of convergence over for all , and whenever is chosen sufficiently large.
In this paper we have focused on estimation under the spectral norm. The block thresholding procedure, however, can be naturally extended to achieve adaption under other matrix norms. Consider, for example, the Frobenius norm. In this case, it is natural and also necessary to threshold the blocks based on their respective Frobenius norms instead of the spectral norms. Then following a similar argument as before, it can be shown that this Frobenius norm based block thresholding estimator can adaptively achieve the minimax rate of convergence over every for all . It should also be noted that adaptive estimation under the Frobenius norm is a much easier problem because the squared Frobenius norm is entrywise decomposable, and the matrix can then be estimated well row by row or column by column. For example, applying a suitable block thresholding procedure for sequence estimation to the sample covariance matrix, rowbyrow would also lead to an adaptive covariance matrix estimator.
The block thresholding approach can also be used for estimating sparse covariance matrices. A major difference in this case from that of estimating bandable covariance matrices is that the block sizes cannot be too large. With suitable choices of the block size and thresholding level, a fully datadriven block thresholding estimator can be shown to be rateoptimal for estimating sparse covariance matrices. We shall report the details of these results elsewhere.
6 Proofs
In this section we shall first prove Theorems 3.3 and 3.4 and then prove the main results, Theorems 3.1 and 3.2. The proofs of some additional technical lemmas are given at the end of the section.
6.1 Proof of Theorem 3.3
The proof relies the following lemmas.
Lemma 1
Let be a random matrix following the Wishart distribution where
Then
where .
Let and be independent copies of . Let
be its sample covariance matrix. It is clear that . Hence
Note that
Therefore,
Observe that
Similarly,
The proof is now complete.
Lemma 2
Let . There exists an absolute constant such that for any ,
In particular, we can take .
Without loss of generality, assume that . Let be a matrix, and where is the unit sphere in the dimensional Euclidean space. Observe that
where as before, we use to represent the spectral norm for a matrix and norm for a vector. As shown by Böröczky and Wintsche [(2005), e.g., Corollary 1.2], there exists an cover set of such that
for some absolute constant . Note that
(10) 
In other words,
(11) 
Now consider . Let and . Then
where
Similarly, . Therefore,
Clearly the distributional properties of are invariant to the mean of . We shall therefore assume without loss of generality that in the rest of the proof.
For any fixed , we have
where , , and similarly, , . It is not hard to see that
and is simply the difference between the sample and population covariance of and . We now appeal to the following lemma:
Applying Lemma 1, we obtain
where . By the tail bound for random variables, we have
See, for example, Lemma 1 of Laurent and Massart (2000). In summary,
Now an application of union bound then yields