Abstract
We propose methodology for estimation of sparse precision matrices and statistical inference for their lowdimensional parameters in a highdimensional setting where the number of parameters can be much larger than the sample size. We show that the novel estimator achieves minimax rates in supremum norm and the lowdimensional components of the estimator have a Gaussian limiting distribution. These results hold uniformly over the class of precision matrices with row sparsity of small order and spectrum uniformly bounded, under a subGaussian tail assumption on the margins of the true underlying distribution. Consequently, our results lead to uniformly valid confidence regions for lowdimensional parameters of the precision matrix. Thresholding the estimator leads to variable selection without imposing irrepresentability conditions. The performance of the method is demonstrated in a simulation study and on real data.
\keywordsprecision matrix and sparsity and inference and asymptotic normality and confidence regions
{subclass}62J07 and 62F12
\thesection Introduction
We consider the problem of estimation of the inverse covariance matrix in a highdimensional setting, where the number of parameters can significantly exceed the sample size .
Suppose that we are given an design matrix , where the rows of are dimensional i.i.d. random vectors from an unknown distribution with mean zero and covariance matrix
We denote the precision matrix by , assuming the inverse of exists.
The problem of estimating the precision matrix arises in a wide range of applications. Precision matrix estimation in particular plays an important role in graphical models that have become a popular tool for representing dependencies within large sets of variables.
Suppose that we associate the variables with the vertex set of an undirected graph with an edge set . A graphical
model represents the conditional dependence relationships between the variables, namely every pair of variables not contained in the edge set is conditionally independent given all remaining variables. If the vector is normally distributed, each
edge corresponds to a nonzero entry in the precision matrix (\citetlauritzen).
Practical examples of applications of graphical modeling include modeling of brain connectivity
based on FMRI brain analysis \citepfmri, genetic networks, financial data processing, social network analysis
and climate data analysis.
A lot of work has been done on methodology for point estimation of precision matrices. We discuss some of the approaches below, but a selected list of papers includes for instance \citetbuhlmann,glasso,bickel2,yuan2,cai,sunzhang.
A common approach assumes that the precision matrix is sufficiently sparse and employs the penalty to induce a sparse estimator. The main goal of these works is to show that, under some regularity conditions, the sparse estimator behaves almost as the oracle estimator that has the knowledge of the true sparsity pattern.
Our primary interest in this paper lies not in point estimation, but we aim to quantify uncertainty of estimation by providing interval estimates for the entries of the precision matrix.
The challenge of this problem arises since asymptotics of regularized estimators which are the main tool in highdimensional estimation is not easily tractable \citepknight2000, as opposed to the classical setting when the dimension of the unknown parameter is fixed.
\thesubsection Overview of related work
Methodology for inference in highdimensional models has been mostly studied in the context of linear and generalized linear regression models.
From the work on linear regression models, we mention the paper by [zhang] where a semiparametric projection approach using the Lasso methodology \citeplasso was proposed,
which was further developed and studied in [vdgeer13]. The approach leads to asymptotically normal estimation of the regression coefficients and an extension of the method to generalized linear models is given in [vdgeer13].
The method requires sparsity of small order in the highdimensional parameter vector and uses norm error bound of the Lasso.
Further alternative methods for inference in the linear model have been proposed and studied in [stanford1], [belloni1] and bootstrapping approach was suggested in [bootstrap1], [bootstrap2].
Other lines of work on inference for highdimensional models suggest postmodel selection procedures, where in the first step a regularized estimator is used for model selection and in the second step e.g. a maximum likelihood estimator is applied on the selected model.
In the linear model, simple postmodel selection methods have been proposed e.g. in [gausslasso], [candes2007].
These approaches are however only guaranteed to work under irrepresentability
and betamin conditions (see [hds]). Especially in view of inference, betamin conditions which assume that the nonzero parameters are sufficiently large in absolute value, should be avoided.
In this paper, we consider estimation of precision matrices, which is a problem related to linear regression, however, it is a nonlinear problem and thus it requires a more involved treatment.
One approach to precision matrix estimation is based on regularization of the maximum likelihood in terms of the penalty. This approach is typically referred to as the graphical Lasso, and has been studied in detail in several papers, see e.g. [glasso], [rothman], [ravikumar] and [yuan].
Another common approach to precision matrix estimation is based on projections. This approach reduces the problem to a series of regression problems and estimates each column of the precision matrix using a Lasso estimator or Dantzig selector \citepcandes2007. The idea was first introduced in [buhlmann] as neighbourhood selection for Gaussian graphical models and further studied in [yuan2], [cai] and [sunzhang].
Methodology leading to statistical inference for the precision matrix has been studied only recently. The work [zhou] proposes to use a more involved variation of the regression approach to obtain an estimator which leads to statistical inference. This approach leads to an estimator of the precision matrix which is elementwise asymptotically normal, under row sparsity of order , bounded spectrum of the true precision matrix and Gaussian distribution of the sample. The paper [jvdgeer14] proposes a method for statistical inference based on the graphical Lasso. The work introduces a desparsified estimator based on the graphical Lasso, which is also shown to be elementwise asymptotically normal.
\thesubsection Contributions and outline
We propose methodology leading to honest confidence intervals and testing for lowdimensional parameters of the precision matrix, without requiring irrepresentability conditions or betamin conditions to hold.
Our work is motivated by the semiparametric approach in [vdgeer13] and is a followup of the work [jvdgeer14].
Compared to the previous work on statistical inference for precision matrices, this methodology has several advantages.
Firstly, the estimator
we propose is a simple modification of the nodewise Lasso estimator proposed in [buhlmann]. Hence the estimator is easy to implement and efficient solutions are available on the computational side.
Secondly, the novel estimator enjoys a range of optimality properties and leads to statistical inference under mild conditions.
Firstly, the asymptotic distribution of lowdimensional components of the estimator is shown to be Gaussian.
This holds uniformly over the class of precision matrices with row sparsity of order , spectrum uniformly bounded in and subGaussian margins of the underlying distribution.
This results in honest confidence regions \citepli for lowdimensional parameters.
The proposed estimator achieves rate optimality as shown in Section Document.
Moreover, the desparsified estimator may be thresholded to guarantee variable selection without imposing irrepresentable
conditions. The computational cost of the method is order Lasso regressions for estimation of all parameters and two Lasso regressions for a single parameter.
The paper is organized as follows.
Section Document introduces the methodology. Section Document contains the main theoretical results for estimation and inference and in Section Document the suggested method is applied to variable selection.
Section Document provides a comparison with related work.
Section Document illustrates the theoretical results in a simulation study.
In Section Document, we analyze two real datasets and apply our method to variable selection. Section Document contains a brief summary of the results. Finally, the proofs were deferred to the Supplementary material.
Notation. For a vector and we use to denote the norm of in the classical sense. We denote For a matrix we use the notations , and .
The symbol denotes the vectorized version of a matrix obtained by stacking the rows of on each other.
By we denote a dimensional vector of zeros with one at position .
For real sequences , we write if for some independent of and all
We write
if both and hold.
Finally, if Furthermore, for a sequence of random variables we write
if is bounded in probability and we write if We write if converges in probability to zero.
Let denote the convergence in distribution and the convergence in probability.
Let denote the cumulative distribution function of a standard normal random variable.
By and we denote the minimum and maximum eigenvalue of , respectively.
Let , denote respectively.
We use letters to denotes universal constants. These are used in the proofs repeatedly to denote possibly different constants.
\thesection Desparsified nodewise Lasso
Our methodology is a simple modification of the nodewise Lasso estimator proposed in [buhlmann]. The idea is to remove the bias term which arises in the nodewise Lasso estimator due to penalty regularization. This approach in inspired by literature on semiparametric statistics [bickelbook, vdv]. We note several papers have used this idea in the context of highdimensional sparse estimation, see [zhang, vdgeer13, vdgeer14, stanford1, jvdgeer14].We first summarize the nodewise Lasso method introduced in [buhlmann] and discuss some of its properties. This method estimates an unknown precision matrix using the idea of projections to approximately invert the sample covariance matrix. For each we define the vector as follows
(\theequation) 
and denote and the noise level by We define the column vector Then one may show
(\theequation) 
where is the th column of Hence the precision matrix may be recovered from the partial correlations and from the noise level In our problem, we are only given the design matrix . The idea of nodewise Lasso is to estimate the partial correlations and the noise levels by doing a projection of every column of the design matrix on all the remaining columns. In lowdimensional settings, this procedure would simply recover the sample covariance matrix However, due to the highdimensionality of our setting, the matrix is not invertible and we can only do approximate projections. If we assume sparsity in the precision matrix (and thus also in the partial correlations), this idea can be effectively carried out using the Lasso. Hence, for each define the estimators of the regression coefficients, , as follows
(\theequation) 
We further define the column vectors
and estimators of the noise level
for . Finally, we define the th column of the nodewise Lasso estimator as
(\theequation) 
The estimator of the precision matrix was studied in several papers (following [buhlmann]) and has been shown to enjoy oracle properties under mild conditions on the model. These conditions include bounded spectrum of the precision matrix, row sparsity of small order and a subGaussian distribution of the rows of (alternatively to subGaussianity, one may assume that the covariates are bounded as in [vdgeer13]). Our approach uses the nodewise Lasso estimator as an initial estimator. The next step involves debiasing or desparsifying, which may be viewed as one step using the NewtonRaphson scheme for numerical optimization. This is equivalent to “inverting” the KarushKuhnTucker (KKT) conditions by the inverse of the Fisher information as in [vdgeer13]. The challenge then also comes from the need to estimate the Fisher information matrix which is a matrix. We show that the estimator can be used in a certain way to create a surrogate of the inverse Fisher information matrix. Since the estimator can be characterized by its KKT conditions, it is convenient to work with these conditions to derive the new desparsified estimator. Consider hence the KKT conditions for the optimization problem (Document)
(\theequation) 
for , where is the subdifferential of the function at i.e.
where . If we define to be a vector
then the KKT conditions may be equivalently stated as follows
(\theequation) 
where is the sample covariance matrix.
This is shown in Lemma 11 in the Supplementary material.
Consequently, the KKT conditions \eqrefmkkt imply a bound for each which will be useful later.
Note that the KKT conditions may be equivalently summarized in a matrix form as
where the columns of are given by for and is a diagonal matrix with elements .
Multiplying the KKT conditions
(\theequation) by , we obtain
Then we note that adding to both sides and rearranging we get
where is a term which can be shown to be under certain conditions (Lemma Document). Hence we define the desparsified nodewise Lasso estimator
(\theequation) 
\thesection Theoretical results
In this part, we inspect the asymptotic behaviour of the desparsified nodewise Lasso estimator (\theequation). In particular we consider the limiting distribution of individual entries of and show the convergence to the Gaussian distribution is uniform over the considered model.
For construction of confidence intervals, we consider estimators of the asymptotic variance of the proposed estimator, both for Gaussian and subGaussian design.
We derive convergence rates of the method in supremum norm and consider application to variable selection.
For completeness, in Lemma \thesection in the Supplementary material, we summarize convergence rates of the nodewise Lasso estimator. The result is essentially the same as Theorem 2.4 in [vdgeer13] so the proof of the common parts is omitted.
Recall that and we define
the row sparsity by , maximum row sparsity by and the coordinates of nonzero entries of the precision matrix by .
For the analysis below, we will need the following conditions.

[label=A0,start=1]

(Bounded spectrum) The inverse covariance matrix exists and there exists a universal constant such that

(Sparsity)

(SubGaussianity condition) Suppose that the design matrix has uniformly subGaussian rows , i.e. there exists a universal constant such that
The lower bound in Document guarantees that the noise level does not diverge. The upper bound (equivalently lower bound on eigenvalues of ) guarantees that the compatibility condition (see [hds]) is satisfied for the matrix , which is the true covariance matrix without the th row and th column. The subGaussianity condition Document is used to obtain concentration results which are crucial to our analysis. Condition Document is also used to ensure that the compatibility condition is satisfied for with high probability (see [hds]). Conditions Document, Document and Document are the same conditions as used in [vdgeer13] to obtain rates of convergence for the nodewise regression estimator. Define the parameter set
The following lemma shows that the proposed estimator can be decomposed into a pivot term and a term which is of small order with high probability. {lema} Suppose that is the nodewise Lasso estimator with regularization parameters , uniformly in , for some sufficiently large constant . Suppose that Document and Document are satisfied. Then for each it holds
(\theequation) 
where there exists a constant such that
From Lemma Document it follows that we need to assume stronger sparsity condition than Document for the remainder term to be negligible after normalization by . This is accordance with other literature on the topic, see [vdgeer13], [zhou]. Hence we introduce the following strengthened sparsity condition.

[label=A0,start=2]

The next result shows that the elements of are indeed asymptotically normal. To this end, we further define the asymptotic variance
In some of the results to follow, we shall assume a universal lower bound on as follows.

[label=A0,start=4]

There exists a universal constant such that
Assumption Document is satisfied e.g. under Gaussian design and Document. Denote a parameter set
[Asymptotic normality] Suppose that is the nodewise Lasso estimator with regularization parameters uniformly in . Suppose that Document and Document are satisfied. Then for every and it holds
To construct confidence intervals, a consistent estimator of the asymptotic variance is required. Consistent estimators of are discussed in Section Document (see Lemmas Document and Document). Hence Theorem Document implies uniformly valid asymptotic confidence intervals , i.e.
The result also enables testing hypotheses about individual elements of the precision matrix. For testing multiple hypothesis simultaneously, we may use the standard procedures such as BonferroniHolm procedure (see [vdgeer13]).
\thesubsection Variance estimation
For the case of Gaussian observations, we may easily calculate the theoretical variance and plug in the estimate in place of the unknown as is displayed in Lemma Document below. {lema} Suppose that assumptions Document and Document are satisfied and assume that the rows of the design matrix are independent distributed. Let be the nodewise Lasso estimator and let uniformly in for some Then for we have
for some constants Lemma Document implies that under , we have a rate If Gaussianity is not assumed, we may replace the estimator of the variance with the empirical version, and plug in in place of the unknown Thus we take the following estimator of , where is the nodewise regression estimator
(\theequation) 
The following lemma justifies this procedure under Document, Document, Document. {lema} Suppose that the assumptions Document and Document are satisfied and for some , it holds that . Let be the nodewise Lasso estimator and let uniformly in for some Let be the estimator defined in (\theequation). Then for all
\thesubsection Rates of convergence
The desparsified estimator achieves optimal rates of convergence in supremum norm. Observe first that for the nodewise regression estimator it holds by (\theequation@IDi), Lemma 1 and Lemma 10 in the Supplementary material that
By Hölder’s inequality and the KKT conditions it follows
Consequently, for the rates of convergence of the nodewise Lasso in supremum norm we find
Desparsifying the estimator as in (\theequation) removes the term involving in the above rates. {theorem}[Rates of convergence] Assume that Document and Document are satisfied. Let and let be the desparsified nodewise Lasso estimator with regularization parameters for some sufficiently large constant , uniformly in . Then there exist constants such that
and
We compare the results of Theorem Document with results on optimal rates of convergence derived for Gaussian graphical models in [zhou]. Suppose that the observations are Gaussian, i.e. . For for some and it holds (see [zhou])
and
where are positive constants depending on and only. As follows from Theorem Document, the desparsified nodewise Lasso attains the lower bound on rates and thus is in this sense optimal (considering the Gaussian setting).
\thesubsection Other desparsified estimators
The desparsification may work for other estimators of the precision matrix, provided that certain conditions are satisfied. This is formulated in Lemma Document below. A particular example of interest is the squareroot nodewise Lasso estimator, that will be discussed below. This estimator has the advantage that it is selfscaling in the variance, similarly as the squareroot Lasso \citepsqrtlasso on which it is based. {lema} Assume that for some estimator it holds
(\theequation) 
Then for it holds under Document, Document, Document
Moreover, We briefly consider nodewise regression with the squareroot Lasso as an example. The squareroot Lasso estimators may be defined via
for Define and . The nodewise squareroot Lasso is then given by where
Note that compared to the nodewise Lasso, the difference lies in estimation of the partial correlations, where we used the squareroot Lasso that “removes the square” from the squared loss. The KarushKuhnTucker conditions similarly as in Lemma 11 in Supplementary material give
where and is the subdifferential of the function with respect to evaluated at The paper [sqrtlasso] further shows that the rates for the squareroot Lasso satisfy condition \eqrefcon1. The desparsified estimator may then be defined in the same way as in (\theequation). Then the conditions of Lemma Document are satisfied and this implies that a desparsified nodewise squareroot Lasso achieves the same rates as the desparsified nodewise Lasso and thus is also rateoptimal.
\thesubsection The thresholded estimator and variable selection
The desparsified estimator can be used for variable selection without imposing irrepresentable conditions. Under mild conditions, the procedure leads to exact recovery of the coefficients that are sufficiently larger in absolute value than the noise level. The following corollary is implied by Theorem Document and Lemmas Document and Document. {cor} Let be obtained using the nodewise Lasso and be defined as in (\theequation) with tuning parameters uniformly in , for some . Assume that conditions Document, Document, Document and Document are satisfied. Let be the estimator from Lemma Document and assume that for some . Then there exists some constant such that
If, in addition, the rows of are dsitributed and is instead the estimator from Lemma Document, then there exist constants such that
Corollary Document implies that we may define the resparsified estimator
where is defined as in Corollary Document. Denote Denote Then it follows directly from Corollary Document that with high probability
The inclusion represents that correctly identifies all the nonzero parameters which are above the noise level. The inclusion means that there are no false positives. If for all it holds
(\theequation) 
then we have exact recovery, i.e. with high probability:
\thesection Further comparison to previous work
Closely related is the paper [jvdgeer14], where asymptotically normal estimation of elements of the concentration matrix is considered based on the graphical Lasso.
While the analysis follows the same principles, the estimation method used here does not require the irrepresentability condition \citepravikumar that is assumed in [jvdgeer14]. Hence we are able to show that our results hold uniformly over the considered class.
Furthermore, regarding the computational cost, our method uses Lasso regressions, which can be implemented using fast algorithms as in [lars].
In comparison, the graphical Lasso method presents a more challenging computational problem, for more details see e.g. [glasso.comp].
Another related work is the paper [zhou]. This paper suggests an estimator for the precision matrix which is shown to have one dimensional asymptotically normal components with asymptotic variance to
The assumptions and results used in the paper are essentially identical with our assumptions and theoretical results in the present paper. However, there are some differences. The paper [zhou] assumes Gaussianity of the underlying distribution, while we only require subGaussianity of the margins. Another difference is in the construction of the estimators. Both approaches use regression to estimate the elements of the precision matrix, but the paper [zhou] concentrates on estimation of the joint distribution of each pair of variables for Thus it is computationally more intensive as it requires highdimensional regressions (see [zhou]), while our methodology only requires
\thesection Simulation results
In this section we report on the performance of our method on simulated data and provide a comparison to another methodology. The random sample satisfies , where the precision matrix is defined by
We consider the settings and . The second setting is further adjusted by randomly perturbing each nonzero offdiagonal element of by adding a realization from the uniform distribution on the interval We denote this new perturbed model by Hence the second precision matrix was chosen randomly. The sparsity assumption requires We have chosen the sample sizes for numerical experiments according to the sparsity assumption (for this purpose, we ignored possible constants in the sparsity restriction), i.e. .
\thesubsection Asymptotic normality and confidence intervals for individual parameters
\thesubsubsection The Gaussian setting
In this section, we consider normallydistributed observations,
for
In Figure 1
we display histograms of for where is defined in (\theequation) and the empirical variance is estimated as suggested by Lemma Document.
Superimposed is the density of
Secondly, we investigate the properties of confidence intervals constructed using the desparsified nodewise Lasso. For comparison, we also provide results using confidence intervals based on the desparsified graphical Lasso introduced in [jvdgeer14]. The coverage and length of the confidence interval were
estimated by their empirical versions,
respectively, using random samples. For a set , we define the average coverage over the set (and analogously average length ) as
We report average coverages over the sets and . These are denoted by and respectively.
Similarly, we calculate average lengths of confidence intervals for each parameter from iterations and report and .
The results of the simulations are shown in Tables 1 and 2.
The target coverage level is
The methodology for the choice of the tuning parameters was used as follows (see [zhou]), for both methods,
(\theequation) 
where denotes the quantile of a distribution with degrees of freedom.
Gaussian setting: Estimated coverage probabilities and lengths
\multirow2*Setting  

avgcov  avglength  avgcov  avglength  
\multirow2*  \multirow2*  DS NW  0.945  0.302  0.963  0.262 
DS GL  0.931  0.293  0.974  0.254  
\multirow2*  \multirow2*  DS NW  0.947  0.267  0.963  0.232 
DS GL  0.928  0.254  0.976  0.220  
\multirow2*  \multirow2*  DS NW  0.949  0.238  0.965  0.220 
DS GL  0.928  0.236  0.977  0.205  
\multirow2*  \multirow2*  DS NW  0.948  0.246  0.965  0.230 
DS GL  0.925  0.228  0.981  0.223 
Gaussian setting: Estimated coverage probabilities and lengths
\multirow2*Setting  

avgcov  avglength  avgcov  avglength  
\multirow2*  \multirow2*  DS NW  0.896  0.164  0.975  0.146 
DS GL  0.781  0.153  0.980  0.137  
\multirow2*  \multirow2*  DS NW  0.868  0.142  0.976  0.126 
DS GL  0.729  0.133  0.982  0.119  
\multirow2*  \multirow2*  DS NW  0.863  0.131  0.976  0.117 
DS GL  0.712  0.124  0.984  0.110  
\multirow2*  \multirow2*  DS NW  0.859  0.125  0.976  0.111 
DS GL  0.709  0.118  0.984  0.105 
\thesubsubsection A subGaussian setting
In this section, we consider a design matrix with rows having a subGaussian distribution other than the Gaussian distribution. Let be an matrix with jointly independent entries generated from a continuous uniform distribution on the interval . Further consider a matrix and let Then we define
for
Then the expectation of is zero and the covariance matrix of is exactly and the precision matrix is
It follows by Hoeffding’s inequality that defined as above is subGaussian with a universal constant
A further difference compared to the simulations in Section Document is that we now estimate the variance of the desparsified estimator using the formula proposed in \eqrefvar.gene for subGaussian settings:
(\theequation) 
where is the nodewise Lasso. The regularization parameters for the nodewise Lasso are used in accordance with \eqreftp. Figure 2 again displays the histograms related to several entries of the desparsified nodewise Lasso. Results related to the constructed confidence intervals are summarized in Table 3. The results demonstrate that the desparsified nodewise Lasso performs relatively well even under this nonGaussian setting.
SubGaussian setting: Estimated coverage probabilities and lengths
\multirow2*Setting  

avgcov  avglength  avgcov  avglength  
\multirow2*  \multirow2*  DS NW  0.906  0.234  0.949  0.249 
DS GL  0.811  0.190  0.944  0.216  
\multirow2*  \multirow2*  DS NW  0.909  0.203  0.950  0.217 
DS GL  0.791  0.165  0.946  0.187  
\multirow2*  \multirow2*  DS NW  0.911  0.189  0.950  0.202 
DS GL  0.765  0.152  0.947  0.173  
\multirow2*  \multirow2*  DS NW  0.911  0.180  0.951  0.192 
DS GL  0.740  0.143  0.947  0.164 
\thesubsection Variable selection
For variable selection as suggested in Corollary Document, we compare the desparsified nodewise Lasso and the desparsified graphical Lasso. The setting is again as in Section Document. Average true positives and false positives over 100 repetitions are reported. Choice of the tuning parameters is according to (\theequation) and the thresholding level is given by
(\theequation) 
taking for the desparsified nodewise regression, for the desparsified graphical Lasso. We take as in Lemma Document. The results of this simulation experiment are summarized in Table 4.