Adaptive estimation of covariance matrices via Cholesky decomposition

# Adaptive estimation of covariance matrices via Cholesky decomposition

\fnmsNicolas \snmVerzelen\thanksreft1label=e1]nicolas.verzelen@supagro.inra.fr [ INRA and SUPAGRO INRA, UMR 729 MISTEA,
F-34060 Montpellier, France
SUPAGRO, UMR 729 MISTEA,
F-34060 Montpellier, France
###### Abstract

This paper studies the estimation of a large covariance matrix. We introduce a novel procedure called ChoSelect based on the Cholesky factor of the inverse covariance. This method uses a dimension reduction strategy by selecting the pattern of zero of the Cholesky factor. Alternatively, ChoSelect can be interpreted as a graph estimation procedure for directed Gaussian graphical models. Our approach is particularly relevant when the variables under study have a natural ordering (e.g. time series) or more generally when the Cholesky factor is approximately sparse. ChoSelect achieves non-asymptotic oracle inequalities with respect to the Kullback-Leibler entropy. Moreover, it satisfies various adaptive properties from a minimax point of view. We also introduce and study a two-stage procedure that combines ChoSelect with the Lasso. This last method enables the practitioner to choose his own trade-off between statistical efficiency and computational complexity. Moreover, it is consistent under weaker assumptions than the Lasso. The practical performances of the different procedures are assessed on numerical examples.

[
\kwd
\runtitle

Covariance estimation

{aug}\thankstext

t1Research mostly carried out at Univ Paris-Sud (Laboratoire de Matématiques, CNRS-UMR 8628)

class=AMS] \kwd[Primary ]62H12 \kwd[; secondary ]62F35, 62J05

Covariance matrix \kwdbanding \kwdCholesky decomposition \kwddirected graphical models \kwdpenalized criterion \kwdminimax rate of estimation

## 1 Introduction

The problem of estimating large covariance matrices has recently attracted a lot of attention. On the one hand, there is an inflation of high-dimensional data in many scientific areas: gene arrays, functional magnetic resonance imaging (fMRI), image classification, and climate studies. On the other hand, many data analysis tools require an estimation of the covariance matrix . This is for instance the case for principal component analysis (PCA), for linear discriminant analysis (LDA), or for establishing independences or conditional independences between the variables. It is known for a long time that the simplest estimator, the sample covariance matrix performs poorly when the size of the vector is larger than the number of observations (see for instance Johnstone [16]).

Depending on the objectives of the analysis and on the applications, different approaches are used for estimating high-dimensional covariance matrices. Indeed, if one wants to perform PCA or to establish independences between the covariates, then it is advised to estimate directly the covariance matrix . In contrast, performing LDA further relies on the inverse of the covariance matrix. In the sequel, we call this matrix the precision matrix and note it . Sparse precision matrices are also of interest because of their connection with graphical models and conditional independence. The pattern of zero in indeed corresponds to the graph structure of the distribution (see for instance Lauritzen [20] Sect.5.1.3).

Most of the methods based on direct covariance matrix estimation amount to regularize the empirical covariance matrix. Let us mention the work of Ledoit and Wolf [21] who propose to replace the sample covariance with its linear combination with the identity matrix. However, these shrinkage methods are known to provide an inconsistent estimation of the eigenvectors [17]. Applying recent results on random matrix theory, El Karoui [11] and Bickel and Levina [5] have studied thresholding estimators of . The resulting estimator is sparse and is proved (for instance [5]) to be consistent with respect to the operator norm under mild conditions as long as goes to . These results are particularly of interest for performing PCA since they imply a consistent estimation of the eigenvalues and the eigenvectors. Observe that all these methods are invariant under permutation of the variables. Yet, in many applications (for instance times series, spectroscopy, climate data), there exists a natural ordering in the data. In such a case, one should use other procedures to obtain faster rates of convergence. Among other, Furrer and Bentgsson [14] and Bickel and Levina [6] use banded or tapering estimators. Again, the consistency of such estimators is proved. Moreover, all these methods share an attractive computational cost. We refer to the introduction of [5] for a more complete review.

The estimation procedures of the precision matrix fall into three categories depending whether there exists an ordering on the variables and to what extent this ordering is important. If there is not such an ordering, d’Aspremont et al. [3] and Yuan and Lin [33] have adopted a penalized likelihood approach by applying a penalty to the entries of the precision matrix. It has also been discussed by Rothman et al. [27] and Friedman et al. [13] and extended by Lam and Fan et al. [19] or Fan et al. [12] to other penalization methods. These estimators are known to converge with respect to the Frobenius norm (for instance [27]) when the underlying precision matrix is sparse enough.

When there is a natural ordering on the covariates, the regularization is introduced via the Cholesky decomposition:

 Ω=T∗S−1T ,

where is a lower triangular matrix with a unit diagonal and is a diagonal matrix with positive entries. The elements of the -th row can be interpreted as regression coefficient of -th component given its predecessors. This will be further explained in Section 2.1. For time series or spectroscopy data, it is more likely that the relevant covariates for this regression of the -th component are its closest predecessors. In other word, it is expected that the matrix is approximately banded. With this in mind, Wu and Pourahmadi [31] introduce a -banded estimator of the matrix by smoothing along the first subdiagonals and setting the rest to 0. The choice of is made by applying AIC (Akaike [1]). They prove element-wise consistency of their estimator but did not provide any high-dimensional result with respect to a loss function such as Kullback or Frobenius. Bickel and Levina [6] also consider -banded estimator of and are able to prove rates of convergence in the matrix operator norm. Moreover, they introduce a cross-validation approach for choosing a suitable , but they do not prove that the selection method achieves adaptiveness. More recently, Levina et al. [22] propose a new banding procedure based on a nested Lasso penalty. Unlike the previous methods, they allow the number used for banding to depend on the line of . They do not state any theoretical result, but they exhibit numerical evidence of its efficiency. In the sequel, we call the issue of estimating by banding the matrix the banding problem.

Between the first approach based on precision matrix regularization and the second one which relies on banding the Cholesky factor, there exists a third one which is not permutation invariant, but does not assume that the matrix is approximately banded. It consists in approximating by a sparse lower triangular matrix (i.e. most of the entries are set to ).

When is it interesting to adopt this approach? If we consider a directed graphical model whose graph is sparse and compatible with the ordering of the variables, then the Cholesky factor is sparse. Indeed, its pattern of zero is related to the directed acyclic graph (DAG) of the directed graphical model associated to this ordering (see Section 2.1 for a definition). More generally, it may be worth using this strategy even if one does not know a “good” ordering on the variables. On the one hand, most of the procedures based on the estimation of are computationally faster than their counterpart based on the estimation of . This is due to the decomposition of the likelihood into independent terms explained in Section 3. On the other hand, there exist examples of sparse Cholesky factor such that the precision matrix is not sparse at all. Consider for instance a matrix which is zero except on the diagonal and on the last line. Admittedly, it is not completely satisfying to apply a method that depends on the ordering of the variables when we do not know a good ordering. There are indeed examples of sparse precision matrices such that for a bad ordering, the Cholesky factor is not sparse at all (see [27] Sect.4). Nevertheless, if sparse precision matrices and sparse Cholesky factors have different approximation capacities, it remains still unclear which one should be favored.

In the sequel, we call the issue of estimating in the class of sparse lower triangular matrices the complete graph selection problem by analogy to the complete variable estimation problem in regression problems. In this setting, Huang et al. [15] propose to add an penalty on the elements of . More recently, Lam and Fan [19] have extended the method to other types of penalty and have proved its consistency in the Frobenius norm if the matrix is exactly sparse. To finish, let us mention that Wagaman and Levina [30] have developed a data-driven method based on the isomap algorithm for picking a “good” ordering on the variables.

In this paper, we consider both the banding problem and the complete graph selection problem. We introduce a general penalization method based on maximum likelihood for estimating the matrices and . We exhibit a non-asymptotic oracle inequality with respect to the Kullback loss without any assumption on the target .

For the adaptive banding issue, our method is shown to achieve the optimal rate of convergence and is adaptive to the rate of decay of the entries of when one moves away from the diagonal. Corresponding minimax lower bounds are also provided. We also compute asymptotic rates of convergence in the Frobenius norm. Contrary to the penalization methods, we explicitly provide the constant for tuning the penalty. Finally, the method is computationally efficient.

For complete graph selection, we prove that our estimator non-asymptotically achieves the optimal rates of convergence when is sparse. We also provide the corresponding minimax lower bounds. To our knowledge, this minimax lower bounds with respect to the Kullback discrepansy are also new. Moreover, our method is flexible and allows to integrate some prior knowledge on the graph. However, this procedure is computationally intensive which makes it infeasible for larger than . This is why we introduce in Section 7 a computationally faster version of the estimator by applying a two-stage procedure. This method inherits some of the good properties of the previous method and applies for arbitrarily large . Moreover, it is shown to select consistently the pattern of zeros under weaker assumptions than the Lasso. These theoretical results are corroborated by a simulation study.

Since data analysis methods like LDA are based on likelihood we find more relevant to obtain rates of convergence with respect to the Kullback-Leibler loss than Frobenius rates of convergence. Moreover, considering Kullback loss allows us to obtain rates of convergence which are free of hidden dependency on parameter such as the largest eigenvalue of . In this sense, we argue that this loss function is more natural for the statistical problem under consideration.

The paper is organized as follows. Section 2 gathers some preliminaries about the Cholesky decomposition and introduces the main notations. In Section 3, we describe the procedure and provide an algorithm for computing the estimator . In Section 4, we state the main result of the paper, namely a general non-asymptotic oracle type inequality for the risk of . In Section 5, we specify our result to the problem of adaptive banding. Moreover, we prove that our so-defined estimator is minimax adaptive to the decay of the off-diagonal coefficients of the matrix . Asymptotic rates of convergence with respect to the Frobenius norm are also provided. In Section 6, we investigate the complete graph selection issue. We first derive a non-asymptotic oracle inequality and then derive that our procedure is minimax adaptive to the unknown sparsity of the Cholesky factor . As previously, we provide asymptotic rates of convergence with respect to the Frobenius loss function. Moreover, we introduce a computationally feasible estimation procedure in Section 7 and we derive an oracle-type inequality and sufficient condition for consistent selection of the graph. In Section 8, the performances of the procedure are assessed on numerical examples for both the banding and the complete graph selection problem. We make a few concluding remarks in Section 9. Sketch of the proof are in Section 10, while the details are postponed to the technical Appendix [29].

## 2 Preliminaries

### 2.1 Link with conditional regression and graphical models

In this subsection, we review basic properties about Cholesky factors and explain their connection with directed graphical models.

We consider the estimation of the vector of size which follows a centered normal distribution with covariance matrix . We always assume that is non-singular. We recall that the precision matrix uniquely decomposes as where is a lower triangular matrix with unit diagonal and is a diagonal matrix. Let us first emphasize the connection between the modified Cholesky factor and conditional regressions. For any between and we note the vector of size made of the -th first elements of the th-line of . By convention is the vector of null size. Besides, we note the -th diagonal element of the matrix . Let us define the vector of size as . By standard Gaussian properties, the covariance matrix of is . Since the diagonal of is one, it follows that for any

 X[i]=i−1∑j=1−ti[j]X[j]+ϵi , (1)

where and the are independent.

Let be a directed acyclic graph who vertex set is . We assume that the direction of the edges is compatible with the natural ordering of . In other words, we assume that any edge in satisfies . Given a vertex , the set of its parents is defined by:

 pa→G(i):={j

Then, the vector is said to be a directed Gaussian graphical model with respect to if for any such that , is independent of conditionally to . This means that only the variables are relevant for predicting among the variables . There are several definitions of directed Gaussian graphical model (see Lauritzen [20]), which are all equivalent when is non-singular.

There exists a correspondence between the graph and the Cholesky factor of the precision matrix . If is a directed graphical model with respect to , then for any such that . Conversely, is a directed graphical model with respect to the graph defined by if and only . Hence, it is equivalent to estimate the pattern of zero of and the minimal graph compatible with the ordering.

These definitions and properties depend on a particular ordering of the variables. It is beyond the scope of this paper to discuss the graph estimation when the ordering is not fixed. We refer the interested reader to Kalisch and Bühlmann [18].

### 2.2 Notations

For any set , stands for its cardinality. We are given independent observations of the random vector . We always assume that follows a centered Gaussian distribution . In the sequel, we note the matrix of the observations. Moreover, for any and any subset of , and respectively refer to the vector of the observations of and to the matrix of the observations of .

In the sequel, stands for the Kullback divergence between the centered normal distribution with covariance and the centered normal distribution with covariance . We shall also sometimes assess the performance of the procedures using the Frobenius norm and the operator norm. This is why we respectively define and as the Frobenius norm and the operator norm of the matrix . For any matrix , stands for the largest eigenvalue of . Finally, , , , denote universal constants that can vary from line to line. The notation specifies the dependency on some quantities.

## 3 Description of the procedure

In this section, we introduce our procedure for estimating given a -sample of the vector . For any between and , stands for a subset of . By convention, . In terms of directed graphs, stands for the set of parents of . Besides, we call any set of the form a model. This model is one to one with a directed graph whose ordering is compatible with the natural ordering of . We shall sometimes call a graph in order to emphasize the connection with graphical models.

Given a model , we define as the affine space of lower triangular matrices with unit diagonal such for any between and , the support (i.e. the non-zero coefficients) of is included in . We note the set of all diagonal matrices with positive entries on the diagonal. The matrices and are then defined as the maximum likelihood estimators of and

 (ˆTm,ˆSm)=argminT′∈Tm, S′∈Diag(p)Ln(T,S):=12tr[T∗S−1T¯¯¯¯¯¯¯¯¯¯¯¯X∗X]+12log|S| (2)

Here, stands for the negative -likelihood. Hence, the estimated precision matrix is . This matrix is the maximum likelihood estimator of among the precision matrices which correspond to directed graphical models with respect to the graph .

For any between and , refers to a collection of subsets of and we call a collection of models (or graphs). The choice of the collection depends on the estimation problem under consideration. For instance, we shall use a collection corresponding to banded matrices when we will consider the banding problems. The collections are specified for the banding problem and the complete graph selection problem in Sections 5 and 6.

Our objective is to select a model such that the Kullback-Leibler risk is as small as possible. We achieve it through penalization. For any , is a positive function that we shall explicitly define later. The penalty function is defined as . Then, we select a model that minimizes the following criterion

 ˆm:=argminm∈M2Ln(ˆTm,ˆSm)+pen(m)=argminm∈Mtr[ˆΩm¯¯¯¯¯¯¯¯¯¯¯¯X∗X]−log|ˆΩm|+pen(m)

For short, we write , , and .

As mentioned earlier, the idea underlying the use of the matrices and lies in the regression models (1). Indeed, these regressions naturally appear when deriving the negative log-likelihood (2):

 2Ln(T′,S′)=p∑i=1s′−1i∥Xi+X

where stands for the Euclidean norm in divided by . By definition of and , we easily derive that the -th row vector of and the -th diagonal element of respectively equal

 (3)

for any . Here, stands for the support of . Hence, the row vector is the least-squares estimator of in the regression model (1) and is the empirical conditional variance of given . There are two main consequences: first, Expression (3) emphasizes the connection between covariance estimation and linear regression in a Gaussian design. Second, it highly simplifies the computational cost of our procedure. Indeed, the negative log-likelihood now writes

 Ln(ˆTm,ˆSm)=12p∑i=1[log(ˆsi,mi)+1] .

and it follows that . This is why we suggest to compute and as follows. Assume we are given a collection of graphs and penalty functions .

###### Algorithm 3.1.
Computation of and . For going from to , Compute for each model . Take . Set and built by gathering the estimators . Take .

In what follows, we refer to this method as ChoSelect. In order to select , one needs to compute all for any and any model . Hence, the complexity of the procedure is proportional to . We further discuss computational issues and we provide a faster procedure in Section 7.

## 4 Risk analysis

In this section, we first provide a bias-variance decomposition for the Kullback risk of the parametric estimator . Afterwards, we state a general non-asymptotic risk bound for .

### 4.1 Parametric estimation

Let be model in . Let us define the matrix as the best approximation of that corresponds to the model . The matrices and are defined as the minimizers in and of the Kullback loss with

 (Tm,Sm):=argminT′∈Tm, S′∈Diag(p)K(Ω;T′∗S′−1T′)

We note .

We define the conditional Kullback-Leibler divergence of the distribution of given by

 K(ti,si;t′i,s′i):=E{K[Pti,si(Xi|X

where stands for the conditional distribution of given with parameters . Applying the chain rule, we obtain that . Consequently, we analyze the Kullback risk by controlling each conditional risk . Let us define and as the projections of on the space associated to the model with respect to the Kullback divergence . In other words, and satisfy

Applying the chain rule, we check that corresponds to -th first elements of the -th line of and is the -th diagonal element of . Thanks to the previous property, we derive a bias-variance decomposition for the Kullback risk .

###### Proposition 4.1.

Assume that is smaller than . The Kullback risk of decomposes as follows

 E[K(ti,si;ˆti,mi,ˆsi,mi)]=K(ti,si;ti,mi,si,mi)+Rn,|mi| , (5)

where is defined by

 Rn,d:=d+1n−d−2+d(d+1)2(n−d−1)(n−d−2)+12[Ψ(n−d)+log(1−dn)] ,

and . Besides, is bounded as follows

 d+12(n−d−2)≤Rn,d≤d+1n−d−2+12[d+1n−d−2]2 and Rn,d=d+12(n−d−2)+O(d+1n)2 .

An explicit expression of is provided in the proof. Applying the chain rule, we then derive a bias-variance decomposition for the maximum likelihood estimator .

###### Corollary 4.2.

Let be a model such that the size of each submodel is smaller than . Then, the Kullback risk of the maximum likelihood estimator decomposes into

 E[K(Ω;ˆΩm)]=K(Ω;Ωm)+p∑i=1Rn,|mi| .

If the size of each submodels is small with respect to , the variance term is of the order . For other loss functions such as the Frobenius norm or the operator norm between and , there is no such bias-variance decomposition with a variance term that does not depend on the target.

### 4.2 Main result

In this subsection, we state a general non-asymptotic oracle inequality for the Kullback-Leibler risk of the estimator . We shall consider two types of penalty function : the first one only takes into account the complexity of the model collection while the second is based on a prior probability on the model collection.

###### Definition 4.3.

For any integer between and , the complexity function is defined by

 Hi(d):=1dlog|{m∈Mi,|mi|=d}| ,

where is any integer larger or equal to . Besides, is set to for any between and .

These functions are analogous to the complexity measures introduced in [9] Sect.1.3 or in [28] Sect.3.2. We shall obtain an oracle inequality for complexity-based penalties under the following assumption.

Assumption : Given and , the collection and the number satisfy

 ∀ 2≤i≤p ,∀mi∈Mi ,|mi|n−|mi|[1+√2Hi(|mi|)]2≤η<η(K) , (6)

where is defined as . The function is positive and increases to one with . This condition requires that the size of the collection is not too large. Assumption () is similar to the assumption made in [28] Sect 3.1 for obtaining an oracle inequality in the linear regression with Gaussian design framework. We further discuss in Sections 5 and 6 when considering the particular problems of ordered and complete variable selection.

###### Theorem 4.4.

Let and let . Assume that is larger than some quantity only depending on and that the collection satisfies . If the penalty is lower bounded as follows

 peni(mi)≥K|mi|n−|mi|(1+√2Hi(|mi|))2for any 1≤i≤p and mi∈Mi , (7)

then the risk of is upper bounded by

 E[K(Ω;˜Ω)]≤LK,ηinfm∈M[K(Ω;Ωm)+pen(m)+pn]+τn , (8)

where is defined by

and is positive. Here, stands for the identity matrix of size .

###### Remark 4.1.

This theorem tells us that performs almost as well as the best trade-off between the bias term and the penalty term . The term is unavoidable since it is of the same order as the variance term for the null model by Corollary 4.2. The error term is considered as negligible since converges exponentially fast to with .

###### Remark 4.2.

The result is non-asymptotic and holds for arbitrary large as longs is larger than the quantity (independent of ). There is no hidden dependency on except in the complexity functions and Assumption that we shall discuss for particular cases in Sections 5.1 and 6.1. Besides, we are not performing any assumption on the true precision matrix except that it is invertible. In particular, we do not assume that it is sparse and we give a rate of convergence that only depends on a bias variance trade-off. Besides, there is no hidden constant that depends on (except for ).

###### Remark 4.3.

Finally, the penalty introduced in this theorem only depends on the collection and on a number . One chooses the parameter depending on how conservative one wants the procedure to be. We further discuss the practical choice of in Sections 5 and 6. In any case, the main point is that we do not need any additional method to calibrate the penalty.

### 4.3 Penalties based on a prior distribution

The penalty defined in Theorem 4.4 only depends on the models through their cardinality. However, the methodology developed in the proof easily extend to the case where the user has some prior knowledge of the relevant models.

Suppose we are give a prior probability measure on the collection . For any non-empty model , we define by

 ∀ 2≤i≤p ,∀mi∈Mi ,l(i)mi:=−log(πMi(mi))|mi| . (9)

By convention, we set to . We define in the next proposition penalty functions based on the quantity that allow to get non-asymptotic oracle inequalities.

Assumption : Given and , the collection , the numbers and the number satisfy

 ∀ 2≤i≤p ,∀mi∈Mi ,|mi|n−|mi|[1+√2l(i)mi]2≤η<η(K) , (10)

where is defined as in .

###### Proposition 4.5.

Let and let . Assume that and that Assumption is fulfilled. If the penalty is lower bounded as follows

 peni(mi)≥K|mi|n−|mi|(1+√2l(i)mi)2for any 1≤i≤p and any mi∈Mi , (11)

then the risk of is upper bounded by

 E[K(Ω;˜Ω)]≤LK,ηinfm∈M[K(Ω;Ωm)+pen(m)+pn]+τn , (12)

where and are the same as in Theorem 4.4.

The proof is postponed to the technical Appendix [29].

###### Remark 4.4.

In this proposition, the penalty (11) as well as the risk bound (12) depend on the prior distribution . In fact, the bound (12) means that achieves the trade-off between the bias and some prior weight, which is of the order This emphasizes that favours models with a high prior probability. Similar risk bounds are obtained in the fixed design regression framework in Birgé and Massart [8].

###### Remark 4.5.

Roughly speaking, Assumption () requires that the prior probabilities are not exponentially small with respect to .

In this section, we apply our method ChoSelect to the adaptive banding problem and we investigate its theoretical properties.

### 5.1 Oracle inequalities

Let be some fixed positive integer which stands for the largest dimension of the models . For any , we consider the ordered collections

 Mdi,ord:={∅,{1},{1,2},…,{1∧(i−d),…,i−1}} ,

and . A model in the collection corresponds to the set of matrices such that on each line of , only the closest entries to the diagonal are possibly non-zero. This collection of models is suitable when the matrix is approximately banded.

For any and any model in we fix the penalty

 peni(mi)=K|mi|n−|mi| . (13)

We write for the estimator defined with the collection and the penalty (13).

###### Corollary 5.1.

Let , smaller than . Assume that . If is larger than some quantity , then

 E[K(Ω;˜Ωd\emph{ord})]≤LK,ηinfm∈Md\emph{ord}E[K(Ω;ˆΩm)]+τn(Ω,K,η,n,p) . (14)

This bound is a direct application of Theorem 4.4.

###### Remark 5.1.

The term is defined in Theorem 4.4 and is considered as negligible since it converges to exponentially fast towards . Hence, the penalized estimator achieves an oracle inequality without any assumption on the target .

###### Remark 5.2.

This oracle inequality is non-asymptotic and holds for any and any larger than . Moreover, by choosing a constant large enough, one can consider a maximal dimension of model up to the order of , because converges to one when increases.

Choice of the parameters and . Setting to gives a criterion close to (see for instance [24]). Besides, Verzelen [28] (Prop.3.2) has justified in a close framework this choice of is asymptotically optimal. A choice of is advised if one wants a more conservative procedure. We have stated Corollary 5.1 for models of size smaller than . In practice, taking the size yields rather good results even if it is not completely ensured by the theory.

Computational cost. The procedure is fast in this setting. Indeed, its complexity is the same as times the complexity of an ordered variable selection in a classical regression framework. From numerical comparisons, it seems to be slightly faster than the methods of Bickel and Levina [6] and Levina et al. [22] which require cross-validation type strategies.

### 5.2 Adaptiveness with respect to ellipsoids

We now state that the estimator is simultaneously minimax over a large class of sets that we call ellipsoids.

###### Definition 5.2.

Let be a non-increasing sequence of positive numbers such that and let be a positive number. Then, the set is made of all the non-singular matrices where is in and is a lower triangular matrix with unit diagonal that satisfies the following property

 (15)

By convention, we set . The sequence measures the rate of decay of each line of when one moves away the diagonal. Observe that in this definition, every line of decreases the same rate. To the price of more technicity, we can also allow different rates of decay for each line of . We shall restrict ourselves to covariance matrices with eigenvalues that lie in a compact when considering the ellipsoid

 (16)
###### Proposition 5.3.

For any ellipsoid , the minimax rates of estimation is lower bounded by

 infˆΩsupΩ∈E(a,R,p)E[K(Ω;ˆΩ)]≥Lpsupk=1,…,⌊√n⌋(R2a2k∧k+1n) . (17)

Let us consider the estimator defined in Section 5.1 with and the penalty (13). We also fix . If the sequence and also satisfy and , then

 supΩ∈E(a,R,p)∩Bop(γ)E[K(Ω;˜Ωd\emph{Co% })]≤LK,η,β,γinfˆΩsupΩ∈E(a,R,p)∩Bop(γ)E[K(Ω;ˆΩ)] , (18)

if is larger than

###### Remark 5.3.

The minimax rates of convergence over in the lower bound (17) is similar to the one obtained for classical ellipsoids in the Gaussian fixed design regression setting (see for instance [23] Th. 4.9). We conclude from the second result that our estimator is minimax adaptive to the ellipsoids that are not degenerate (i.e. ) and whose rates does not converge too slowly towards zero (i.e. ). Note that all the sequences such that satisfy the last assumption.

###### Remark 5.4.

However, the estimator is not adaptive to the parameter since the constant in (18) depends on . This is not really surprising. Indeed, the oracle inequality (14) is expressed in terms of the Kullback loss while the ellipsoids are defined in terms of the entries of . If we would have considered the minimax rates of estimation over sets analogous to but defined in terms of the decay of the Kullback bias, then we would have obtained minimax adaptiveness without any condition on the eigenvalues.

We are also able to prove asymptotic rates of convergence and asymptotic minimax properties with respect to the Frobenius loss function. For any , we define the ellipsoid as the ellipsoid with the sequence .

###### Corollary 5.4.

If and is smaller than then uniformly over the set ,

 ∥Ω−˜Ωd\emph{ord}∥2F=OP(∑pni=1ki+pnn) (19)

If , then uniformly over the set , the estimator satisfies

 ∥Ω−˜Ωd\emph{ord}∥2F=OP⎡⎣pn⎛⎝(Rns)22s+1∧pnn⎞⎠⎤⎦ . (20)

Moreover, these two rates are optimal from a minimax point of view.

The estimator achieves the minimax rates of estimation over special cases of ellipsoids. However, all these results depend on and are of asymptotic nature.

## 6 Complete graph selection

We now turn to the complete Cholesky factor estimation problem. First, we adapt the model selection procedure ChoSelect to this setting. Then, we derive an oracle inequality for the Kullback loss. Afterwards, we state that the procedure is minimax adaptive to the unknown sparsity both with respect to the Kullback entropy and the Frobenius norm. Finally, we discuss the computational complexity and we introduce a faster two-stage procedure.

### 6.1 Oracle inequalities

Again, is a positive integer that stands for the maximal size of the models . We consider the collections of models that contain all the subsets of of size smaller or equal to . A model corresponds to a pattern of zero in the Cholesky factors . As explained in Section 2, such a model is also in correspondence with an ordered graph which is compatible with the ordering. Hence, the collection is in correspondence with the set of ordered graphs of degree smaller than which are compatible with the natural ordering of .

For any and any model in we fix the penalty

 peni(mi)=log⎡⎢⎣1+K|mi|n−|mi|⎧⎨⎩1+ ⎷2[1+log(i−1|mi|)]⎫⎬⎭2⎤⎥⎦ , (21)

where . In the sequel, corresponds to the estimator ChoSelect with the collection and the penalty (21).

###### Corollary 6.1.

Let and (defined in the proof). Assume that

 d≤ηn1+[log(p/d)∨0] . (22)

If is larger than some quantity , then satisfies

 E[K(Ω;˜Ωd\emph{co})] ≤ LK,ηinfm∈Md\emph{co}{K(Ω;Ωm)+p∑i=2|mi|n−|mi|[1+log(i−1|mi|)]+pn} (23) + τ′n ,

where the remaining term is of the same order as in Theorem 4.4.

A proof is provided in Section 10.3. We get an oracle inequality up to logarithms factors, but we prove in Section 6.2 that these terms are in fact unavoidable. For the sake of clarity, we straightforwardly derive from (23) the less sharp but more readable upper bound

 E[K(Ω;˜Ωdco)]≤LK,ηinfm∈Mdco{K(Ω;Ωm)+p+|m|logpn}+τn(Ω,K,η,n,p) ,

where .

###### Remark 6.1.

As for the previous results, we do not perform any assumption on the target and the obtained upper bound is non-asymptotic. By Condition (22), we can consider dimension up to the order . If is much larger than , the maximal dimension has to be smaller than the order . This is not really surprising since it is also the case for linear regression with Gaussian design as stated in [28] Sect. 3.2. There is no precise results that proves that this bound is optimal but we believe that it is unimprovable. If is of the same order as , it is possible to consider dimensions up to the same order as .

###### Remark 6.2.

The same bound (23) holds if we use the penalty

 pen′i(mi)=K|mi|n−|mi|⎧⎨⎩1+ ⎷2[1+log(i−1|mi|)]⎫⎬⎭2 .

For a given , observe that . Hence, these two penalties are equivalent when is large. In Corollary 6.1, we have privileged a logarithmic penalty, because this penalty gives slightly better results in practice.

Choice of and . In practice, we set the maximal dimension to . Concerning the choice of , we advise to use the value , if the goal is to minimize risk. When the goal is to estimate the underlying graph, one should use a larger value of like in order to decrease the proportion of falsely discovered vertices.

### 6.2 Adaptiveness to unknown sparsity

In this section, we state that the estimator achieves simultaneously the minimax rates of estimation for sparsity of the matrix . In the sequel, stands for the set of positive square matrices of size such that its Cholesky factor contains at most non-zero off-diagonal coefficients on each line. The set contains the precision matrices of the directed Gaussian graphical models whose underlying directed acyclic graph satisfies the two following properties:

• It is compatible with the ordering on the variables.

• Each node of has at most parents.

We shall also consider the set that contains positive square matrices whose whose Cholesky factor is -sparse (i.e. contains at most non-zero elements). Hence, the set corresponds to the precision matrices of the directed Gaussian graphical models whose underlying directed acyclic graph is compatible with the ordering on the variables and has at most edges. When belongs to with “small”, we say that the underlying Cholesky factors are ultra-sparse.

For deriving the minimax rates of estimation, we shall restrict ourselves to precision matrices whose Kullback divergence with the identity is not too large. This is why we define

 BK(r):={Ω s.t. K(Ω;Ip)≤pr} ,

for any positive number .

###### Proposition 6.2.

Let and be two positive integers such that . The minimax rates of estimation over the sets and are lower bounded as follows

 infˆΩsupΩ∈U1[k,p]EΩ[K(Ω;ˆΩ)] ≥ Lkp1+log(p/k)n , if n≥Lk2[1+log(p/k)] , (24) infˆΩsupΩ∈U2[k,p]EΩ[K(Ω;ˆΩ)] ≥ Lp+klog(p)n , if k≤p. (25)

Consider , , and . Assume that