Minimax Estimation of Large Precision Matrices with Bandable Cholesky Factor
Last decade witnesses significant methodological and theoretical advances in estimating large precision matrices. In particular, there are scientific applications such as longitudinal data, meteorology and spectroscopy in which the ordering of the variables can be interpreted through a bandable structure on the Cholesky factor of the precision matrix. However, the minimax theory has still been largely unknown, as opposed to the well established minimax results over the corresponding bandable covariance matrices. In this paper, we focus on two commonly used types of parameter spaces, and develop the optimal rates of convergence under both the operator norm and the Frobenius norm. A striking phenomenon is found: two types of parameter spaces are fundamentally different under the operator norm but enjoy the same rate optimality under the Frobenius norm, which is in sharp contrast to the equivalence of corresponding two types of bandable covariance matrices under both norms. This fundamental difference is established by carefully constructing the corresponding minimax lower bounds. Two new estimation procedures are developed: for the operator norm, our optimal procedure is based on a novel local cropping estimator targeting on all principle submatrices of the precision matrix while for the Frobenius norm, our optimal procedure relies on a delicate regression-based block-thresholding rule. We further establish rate optimality in the nonparanormal model. Numerical studies are carried out to confirm our theoretical findings.
eq(LABEL:#1) \newrefformatchapChapter LABEL:#1 \newrefformatsecSection LABEL:#1 \newrefformatalgoAlgoritheorem LABEL:#1 \newrefformatfigFigure LABEL:#1 \newrefformattabTable LABEL:#1 \newrefformatrmkRemark LABEL:#1 \newrefformatclmClaim LABEL:#1 \newrefformatdefDefinition LABEL:#1 \newrefformatcorCorollary LABEL:#1 \newrefformatlmmLemma LABEL:#1 \newrefformatpropProposition LABEL:#1 \newrefformatappAppendix LABEL:#1 \newrefformatexExample LABEL:#1 \newrefformatexerExercise LABEL:#1 \newrefformatsolnSolution LABEL:#1 \newrefformatcondCondition LABEL:#1
Keywords: optimal rate of convergence, precision matrix, local cropping, Cholesky factor, minimax lower bound, block-thresholding, operator norm, Frobenius norm
AMS 2000 Subject Classification: Primary 62H12; secondary 62F12, 62C20, 62G09.
Covariance matrix plays a fundamental role in many important multivariate statistical problems. They include the principal component analysis, linear and quadratic discriminant analysis, clustering analysis, regression analysis and conditional dependence relationship studies in graphical models. During the last two decades, with the advances of technology, it is very common that the datasets are high-dimensional (the dimension can be much larger than the sample size ) in many applications such as genomics, fMRI data, astrophysics, spectroscopic imaging, risk management, portfolio allocation and numerical weather forecasting (heyer1997application, ; eisen1998cluster, ; hamill2001distance, ; ledoit2003improved, ; schafer2005shrinkage, ; padmanabhan2016estimating, ). It has been well-known that the sample covariance matrix performs poorly and can yield to invalid conclusions in the high-dimensional settings. For example, see wachter1976probability ; wachter1978strong ; johnstone2001distribution ; karoui2003largest ; paul2007asymptotics ; johnstone2009consistency for details on the limiting behaviors of the spectra of sample covariance matrices when both and increase.
To avoid the curse of dimensionality, certain structural assumptions are almost necessary in order to estimate the covariance matrix or its inverse, the precision matrix, consistently. In this paper, we consider large precision matrix estimation with bandable Cholesky factor. Its connection with other structures are discussed at the end of introduction. Both the operator norm loss () and the Frobenius norm loss () are investigated.
We begin with introducing the bandable Cholesky factor of the precision matrix. Assume that is a centered -variate random vector with covariance matrix . Let be the coefficients of the population regression of on its previous variables . In other words, is the linear projection of on in population (Define ). Set as the lower triangular matrix with zeros on the diagonal and zero-padded coefficients arranged in the rows. Denote the residual and . The regression theory implies the residuals are uncorrelated, and thus the matrix is diagonal. The modified Cholesky decomposition of is
where is the Cholesky factor of . There is a natural order on the variables based on the above Cholesky decomposition. Indeed, the well-known AR() model can be characterized by the -banded Cholesky factor of the precision matrix in which if . Inspired by the auto-regression model, we consider the bandable structures imposed on the Cholesky factor. More specifically, for , we define the parameter space of precision matrices by
Here, , are the maximum and minimum eigenvalues of and the index set . We follow the convention that the sum over an empty set of indices is equal to zero when . This parameter space was first proposed in (bickel2008regularized, ). The parameter specifies how fast the sequence decays to zero as goes away from . The covariance matrix estimation problem has been extensively studied when a similar bandable structure is imposed on the covariance matrix (e.g., (bickel2008regularized, ; cai2010optimal, )). Unlike the order in these bandable covariance matrices, in which large distance implies nearly independence, the order in bandable Cholesky factor encodes nearly conditional independence in the sense that the coefficients is close to zero when is large.
Although several approaches have been developed to estimate the precision matrix with bandable Cholesky factor, the optimality question remains mostly open, partially due to the following two reasons. (i) Intuitively, one would expect the minimax rate of convergence over under the operator norm to be the same as that over the class of bandable covariance matrices with the same decay parameter . Under sub-Gaussian assumptions, cai2010optimal established the optimal rate of convergence uniformly for all bandable covariance matrices with bounded spectra such that , . To establish such a rate of convergence for , lee2017estimating provided a lower bound with the matching rate. However, we show a surprising result in this paper that estimation over is a much harder task than that over bandable covariance matrices. Therefore, the lower bound in lee2017estimating is sub-optimal, and all attempts on showing the same rate of convergence intrinsically cannot succeed. (ii) From the methodological aspect, due to the regression interpretation of the Cholesky decomposition (1), almost all existing methods rely on an intermediate estimator of obtained by running regularized regression of each variable against its previous variables . For instance, bickel2008regularized estimated each row of by fitting the banded regression model with some bandwidth . wu2003nonparametric used an AIC or BIC penalty to pick the best bandwidth . In addition, huang2006covariance proposed adding a Lasso or Ridge penalty while levina2008sparse proposed using a nested Lasso penalty to the regression. See, for instance, banerjee2014posterior ; lee2017estimating for Bayesian approaches following the similar idea. The typical analysis for those estimation procedures in a row-wise fashion is to bound the operator norm by its matrix / norm. Although this analysis may provide optimal rates of convergence under the operator norm loss for some sparsity structure (see, i.e., cai2012optimal ; cai2016estimating2 for sparse covariance and precision matrices estimation), it might be sub-optimal for the bandable structure as seen in bandable covariance matrix estimation cai2010optimal ; bickel2008regularized . Therefore, in order to obtain rate-optimality over , a novel analysis or even a new estimation approach is expected.
Main Results. With regard to the above two issues, we provide satisfactory solutions in this paper. We at the first time show that the rate of convergence under the operator norm over is intrinsically slower than that over the counterpart class of bandable covariance matrices. This is achieved via a novel minimax lower bound construction. Moreover, in order to obtain a rate-optimal estimator, we propose a novel local cropping estimator which does not rely on any estimator of , and thus requires a new analysis. Our local cropping approach targets on accurate estimation of principal submatrices of the precision matrix under the operator norm, which results in a tradeoff between one variance term and two bias terms. The name comes after the idea of estimating each principal submatrix of the precision matrix, which is to crop the center by submatrix of the inverse of by sample covariance matrix using their neighbors in two directions of the same size. (During the finalizing process of this paper, we realized that a similar estimator is independently proposed to estimate precision matrices with a different structure (hu2017minimax, ).) Since our procedure does not directly explore the structure on each row of , the analysis of bias terms is much more involved, requiring a block-wise partition strategy. More details are discussed in Sections 2.1 and 3.1. In the end, besides , a similar type of classes of parameter spaces with bandable Cholesky factor is considered as well,
We further establish another surprising result: the optimal rates of convergence of two spaces, namely and , are different under the operator norm. This remarkable distinction is different from the comparison of two similar types of parameter spaces for bandable covariance matrices in cai2010optimal and bandable Toeplitz covariance matrices in cai2013optimal . The contrast of minimax results on and is summarized in Theorem 1 below. We mainly focus on the high-dimensional setting, assuming that and . Otherwise, one can always obtain trivial constant minimax rate (i.e., inconsistency) or the minimax rate as the smaller of and the one in Theorem 1.
Under normality assumption, the minimax risk of estimating the precision matrix over the parameter space with given in \prettyrefeq: def paraspp under the operator norm satisfies
The minimax risk of estimating the precision matrix over the parameter space with given in \prettyrefeq: def paraspq under the operator norm satisfies
Moreover, we also consider the minimax rates of convergence of precision matrix estimation under the Frobenius norm loss over and . This time, we prove that two types of spaces enjoy the same optimal rate of convergence. Together with the different rates of convergence under the operator norm loss, we demonstrate the intrinsic difference between operator norm and Frobenius norm. The Frobenius norm of a by matrix is defined as the vector norm of all entries. Driven by this fact, our estimation approach is naturally obtained by optimally estimating and in (1) separately. Due to the decay structure in , which is defined in terms of nested norm of each row of , our estimator is based on regression with a delicate block-thresholding rule. The minimax procedure is motivated by wavelet nonparametric function estimation, although the space cannot be exactly described by any Besov ball (cai2012minimax ; delyon1996minimax ). We summarize the optimality result under the Frobenius norm in Theorem 2 below.
Under normality assumption, the minimax risk of estimating the precision matrix over and given in \prettyrefeq: def paraspp and \prettyrefeq: def paraspq satisfies
Related Literature. During the last decade, various structural assumptions are imposed in the literature of high-dimensional statistics in order to estimate the covariance/precision matrix consistently under various loss functions. While mostly driven by the specific scientific applications, popular structures include ordered sparsity (bandable covariance matrices, precision matrices with bandable Cholesky factor), unordered sparsity (sparse covariance matrices, sparse precision matrices) and other more complicated ones such as certain combination of sparsity and low-rankness (spike covariance matrices, covariance with tensor product, latent graphical models). Many estimation procedures have been proposed accordingly to estimate high-dimensional covariance/precision matrices via taking advantages of these specific structures. For example, banding (wu2009banding ; bickel2008regularized ; xiao2014theoretic ; bien2016convex ) and tapering methods (furrer2007estimation ; cai2010optimal ) were developed to estimate bandable covariance matrices or precision matrices with bandable Cholesky factor; thresholding procedures were used in bickel2008covariance ; karoui2008operator ; cai2011adaptive to estimate sparse covariance matrices; penalized likelihood estimation (huang2006covariance ; yuan2007model ; d2008first ; banerjee2008model ; rothman2008sparse ; lam2009sparsistency ; ravikumar2011high ) and penalized regression methods (meinshausen2006high ; yuan2010high ; cai2011constrained ; sun2013sparse ; ren2015asymptotic ) are designed for sparse precision matrix estimation.
The fundamental difficulty of various covariance/precision matrices estimation problems have been carefully investigated in terms of the minimax risks under the operator norm loss among other losses, especially for those ordered and unordered sparsity structures. Specifically, for unordered structures, cai2012optimal considered the problems of optimal estimation of sparse covariance while cai2016estimating2 (see ren2015asymptotic as well) established the optimality results for estimating sparse precision matrices. For ordered structures, cai2010optimal established the optimal rates of convergence over two types of bandable covariance matrices. In addition, with an extra Toeplitz structure, cai2013optimal studied optimal estimation of two types of bandable Toeplitz covariance matrices. However, it was still largely unknown about the optimality results on estimating precision matrices with bandable Cholesky factor. See an exposure paper with discussion cai2016estimating and references therein on minimax results of covariance/precision matrix estimation under some other losses. In this paper, we provide a solution to this open problem by establishing the optimal rates of convergence over two types of precision matrices with bandable Cholesky factor. Thus, this paper completes the minimaxity results of all four sparsity structures commonly considered in literature.
Organization of the Paper. The rest of the paper is organized as follows. First, we propose our estimation procedures for precision matrix estimation in Section 2. In particular, local cropping estimators and regression-based block-thresholding estimators are designed for estimating precision matrices under the operator norm and the Frobenius norm respectively. Section 3 establishes the optimal rates of convergence under the operator norm for two commonly used types of parameter spaces and . A striking difference between two spaces are revealed when considering operator norm loss. Section 4 considers rate-optimal estimation under the Frobenius norm The results reveal that the fundamental difficulty of estimation for two parameter spaces are the same when considering Frobenius norm loss. In Section 5, we extend the results to nonparanormal models for inverse correlation matrix estimation by applying local cropping procedure to rank-based estimators. Section 6 presents the numerical performance of our local cropping procedure to illustrate the difference between two parameter spaces by simulation studies. We also demonstrate the sub-optimality of banding estimators. All technical lemmas used in proofs of main results are relegated to the supplement.
Notation. we introduce some basic notations that will be used in the rest of the paper. indicates the indicator function while indicates the all-ones vector. indicates the sign function. represents the largest integer which is no more than . represents the smallest integer which is no less than . Define if there is a constant independent of such that . For any vector , indicates its norm. For any by matrix , we use to denote its transpose. The matrix norm is define as . The matrix norm is also called the the operator norm or the spectral norm, and denoted as . The Frobenius norm is defined as . and are the largest and smallest singular values of when is not symmetric. When is a real symmetric matrix, and denote its largest and smallest eigenvalues. and indicate the -th row and column of matrix . denotes the index set . is short for the set . For the random vector and the data matrix , and indicates the -th columns of and . For any square matrix , denotes the diagonal matrix with diagonal entries being those on the main diagonal of while for any vector , denotes the diagonal matrix with diagonal entries being . In the estimation procedure under the operator norm, we use the matrix notation in the form of to facilitate the proof, where is always a square matrix, indicates the location information, and indicates that the size of is . Throughout the paper we denote by a generic positive constant which may vary from place to place but only depends on , , and possibly some sub-Gaussian distribution constant in (17).
In this section, we introduce our methodologies for estimating precision matrices over and under both the operator norm and the Frobenius norm. Assume that , a -variate random vector with mean zero and precision matrix . Our estimation procedures are based on its i.i.d. copies . We write , where each consists of i.i.d. copies of . Our estimation procedures are different under the operator norm and the Frobenius norm.
2.1 Estimation Procedure under the Operator Norm
We focus on the estimation problem under the operator norm first. As we discussed in the introduction, almost all existing methodologies (wu2003nonparametric ; huang2006covariance ; bickel2008regularized ) directly appeal the Cholesky decomposition of the precision matrix. They first estimate the Cholesky factor and by auto-regression and then estimate the precision matrix according to . The corresponding analysis in the row-wise fashion may not suitable for the operator norm loss. In this paper, we propose a novel local cropping estimator, which focuses on the estimation of directly.
To facilitate the illustration of the estimation procedure, we define two matrix operators. The cropping operator is designed to crop the center block out of the matrix. For a by matrix , we define the matrix , where , with
The parameter indicates the location and indicates the dimension. It is clear that is a principal submatrix of . The expanding operator is designed to put a small matrix onto a large zero matrix. For a by matrix, , define the matrix , where , with
The parameter indicates the location and indicates the dimension. Note that for a by matrix , we have . An illustration of two operators is provided in Figure 1.
In addition, for technical reasons (of obtaining rates of convergence in expectation rather than in probability), we introduce a projection operator. For a real square matrix , let the singular value decomposition of be with , and . Let , where , then define
For a symmetric matrix , we modify a little bit and define , where is its eigen-decomposition. Since all eigenvalues of are in the interval , is always invertible and positive definite.
We are ready to construct the local cropping estimator with bandwidth . At a high level, we first propose an estimator of each principal submatrix of size and in using cropping and extending operators. Then we arrange over those local estimators to estimate . Since the core idea of estimating those local estimators in our procedure is to crop the inverse of sample covariance matrix with a relatively larger size, we call in (12) the local cropping estimator.
Specifically, we first define an estimator of the principal submatrix at each location . To this end, we select the sample covariance matrix with a relative larger size, in this case, . Let the modified local sample covariance matrix be
Note that the operator guarantees to be invertible. Then we use the center part of its inverse to estimate , i.e.,
Similarly, we can define local estimators of via replacing by . Arranging over these estimators in the form of weighted sum, we obtain the estimator of , that is,
The operator makes these local estimators in the correct places. The final step \prettyrefeq: def est op is motivated by the analysis of optimal bandable covariance matrix estimation procedure proposed in cai2010optimal . Indeed, the optimal tapering estimator in cai2010optimal can be rewritten as a sum of many principal submatrices of the sample covariance matrix in a similar way as \prettyrefeq: def est op. In contrast, our estimator is not in a form of tapering the sample covariance matrix. However, in the analysis of our local cropping estimator in Section 3, the direct target of is a certain tapered population precision matrix with bandwidth . There are natural bias and variance terms involved in the distance of and its direct target. Together with the bias of the tapered population precision matrix, our analysis involves two bias terms and one variance term, which critically determine the optimal choice of bandwidth.
In Section 3, we show that the local cropping estimator with an optimal choice of bandwidth would achieve the minimax risk under the operator norm over parameter spaces in \prettyrefeq: def paraspp and in \prettyrefeq: def paraspq. However, the optimal choices of bandwidth are fundamentally distinct between and . Specifically, we show that the optimal bandwidth over is while that one over is .
Of note, the estimator depends on . The index of variable is clear most of the time, while we need to be careful when it is close to the boundary. When the index is beyond the index set , we shrink the size of the corresponding block by discarding the data with meaningless indexes.
2.2 Estimation Procedure under the Frobenius Norm
Under the Frobenius norm, our estimation procedure is based on the Cholesky decomposition of the precision matrix \prettyrefeq: def cd of omega. More specifically, we estimate the matrix and respectively by auto-regression, and then combine them together to construct the estimator of . The following estimation procedure applies to both the parameter space and as we will show that they enjoy the same optimal rate of convergence in Section 4.
Our estimator of the -th row of is based on the regression of against its previous variables. Unlike those existing methods (wu2003nonparametric ; huang2006covariance ; bickel2008regularized ) which rely on certain banding or penalized approaches for such a regression problem, we apply a block-thresholding procedure due to the decay structure in which is defined in terms of nested norm. To this end, we first regress against with bandwidth with some sufficiently large . Recall that the matrix consists of observations of , and the matrix represents observations of . The empirical regression coefficients are
We then further threshold the coefficients by taking advantages of the bandable structure of the Cholesky factor . Specifically, we define with coordinate as follows,
where , , and . Note that we keep the last coefficients and apply a block-thresholding rule in which the size of each block doubles backwards sequentially for the remaining coefficients in (14). Our procedure is inspired by the optimal estimation procedure over Besov balls for many nonparametric function estimation problems, or equivalently, the corresponding Gaussian sequence models (See cai2012minimax the reference therein). We emphasize that any linear estimator of the coefficients cannot yield to the optimal rates of convergence in our setting under the Frobenius norm.
Our estimator of can be constructed by arranging zero-padded , accordingly with an identity matrix. Specifically, set the -th entry of as when , , otherwise as zero. We also need to bound the singular values of . To this end, we define
as our estimator of , where is defined in \prettyrefeq: projection.
The estimation of is based on the sample variances of those empirical residuals in the regression of against . For each , the sample variance of the empirical residual is
where . Let , where . We define as our estimator of .
Finally, define our estimator of as
For the parameter space , a much simpler banding estimation scheme on the empirical regression coefficients is able to achieve the minimax risk. Set . We use the empirical residuals and coefficients obtained by regressing each against to directly construct the estimators of and . It can be proved that this estimator achieves the minimax risk over the parameter space .
3 Rate Optimality under the Operator Norm
In this section, we establish the optimal rates of convergence for estimating the precision matrix over the parameter spaces and given in \prettyrefeq: def paraspp and \prettyrefeq: def paraspq under the operator norm. We first derive the risk upper bound of the local cropping estimator in \prettyrefsec: up op l over parameter space . We provide a matching risk lower bound by applying the Assouad’s lemma and the Le Cam’s method in \prettyrefsec: low op l over . The establishment of the rate optimality over the parameter space is similar to the one over , which is provided in \prettyrefsec: risk op e.
Throughout this section, we assume that follows certain sub-Gaussian distribution with constant , that is,
for all and all unit vectors .
3.1 Minimax Upper Bound under the Operator Norm over
In this section, we develop the following upper bound of our estimation procedure proposed in \prettyrefsec: est op.
When , the local cropping estimator defined in \prettyrefeq: def est op of the precision matrix over with given in \prettyrefeq: def paraspp satisfies
When , we have
The optimal choice of is due to the bias-variance tradeoff. Combining Theorem 3 with the minimax lower bound derived in \prettyrefsec: low op l, we immediately obtain that the local cropping estimator is rate optimal.
As we discussed in \prettyrefsec: est op, the direct target of our local cropping estimator is certain tapered population precision matrix with bandwidth , which can be written as a weighted sum of many principal submatrices of the population precision matrix. We construct this corresponding tapered population precision matrix as follows. Denote the precision matrix . We define such that for ,
The following lemma elucidates the decomposition of this tapered precision matrix .
The defined in \prettyrefeq: tp def can be written as
It is easy to check . Since the eigenvalues of are in the interval , the operator would not increase the risk much. Indeed, according to \prettyrefeq: proj 1 in \prettyreflmm: projection, we have
The following lemma bounds the bias between our direct target and the population precision matrix.
For in the parameter space defined in \prettyrefeq: def paraspp with , defined in \prettyrefeq: tp def, we have
Unlike existing methods, our procedure does not directly utilize the decay structure of Cholesky factor. Consequently, the proof of Lemma 2 is involved and requires a block-wise partition strategy.
Then we turn to the analysis of .
These two terms can be bounded in the same way, we only focus on the second term.
where we further have variance term and bias term of local estimators. For the variance term in \prettyrefeq: up lop 2, we further have
The last two inequalities hold because of the fact that the eigenvalues of and are in the interval , and \prettyreflmm: projection. The following concentration inequality of sample covariance matrix facilitates our proof.
For the observations following certain sub-Gaussian distribution with constant and precision matrix , we have
Combining \prettyreflmm: sample cov max, \prettyrefeq: up lop 2 and \prettyrefeq: up lop 3, we have
We turn to bounding the bias term of local estimator in (21).
Assume that defined in \prettyrefeq: def paraspp with . Then we have
lmm: bias in block up lop, together with \prettyrefeq: up lop 4, \prettyrefeq: up lop 2 and \prettyrefeq: up lop 1, implies that
Plugging \prettyreflmm: tp omega close and \prettyrefeq: up lop 6 into \prettyrefeq: up lop 0, we finish the proof of \prettyrefthm: up op 1. ∎
3.2 Minimax Lower Bound under the Operator Norm over\prettyref
thm: up op 1 in \prettyrefsec: up op l proves that the local cropping estimator defined in \prettyrefeq: def est op attains the convergence rate of . In this section, we establish the following matching lower bound, which proves the rate optimality of the local cropping estimator.
The minimax risk of estimating the precision matrix over defined in \prettyrefeq: def paraspp under the operator norm with satisfies
The lower bound of parameter space can be established by the lower bounds over its subsets. We construct two subsets and and calculate the lower bound over those two subsets separately. Let be a positive constant which is less than .
First, we construct . Set . Set the index set , i.e., for any , each is either or . Then we define the matrix with and
We then define as the collection of matrices indexed by ,
Next, we construct as the collection of the diagonal matrices in the following equation,
and are subsets of .