Bayesian estimation of a sparse precision matrix

# Bayesian estimation of a sparse precision matrix

## Abstract

We consider the problem of estimating a sparse precision matrix of a multivariate Gaussian distribution, including the case where the dimension is large. Gaussian graphical models provide an important tool in describing conditional independence through presence or absence of the edges in the underlying graph. A popular non-Bayesian method of estimating a graphical structure is given by the graphical lasso. In this paper, we consider a Bayesian approach to the problem. We use priors which put a mixture of a point mass at zero and certain absolutely continuous distribution on off-diagonal elements of the precision matrix. Hence the resulting posterior distribution can be used for graphical structure learning. The posterior convergence rate of the precision matrix is obtained. The posterior distribution on the model space is extremely cumbersome to compute. We propose a fast computational method for approximating the posterior probabilities of various graphs using the Laplace approximation approach by expanding the posterior density around the posterior mode, which is the graphical lasso by our choice of the prior distribution. We also provide estimates of the accuracy in the approximation.

Keywords : Graphical Lasso; Graphical models; Laplace approximation; Posterior convergence; Precision matrix.

## 1Introduction

Statistical inference on large covariance or precision matrix (inverse of covariance matrix) is a topic of growing interest in recent times. Often the dimension grows with the sample size and even can be bigger than . Data of this type are frequently encountered in fMRI, spectroscopy, gene array expressions and so on. Estimation of the covariance or precision matrix is of special interest because of their importance in methods like principal component analysis (PCA), linear discriminant analysis (LDA), etc. In cases where , the sample covariance matrix is necessarily singular, and hence an estimator of the precision matrix cannot be obtained by inverting it. Therefore we need to resort to other techniques for handling the high-dimensional problems.

Regularization methods for estimation of the sample covariance or precision matrix have been proposed and studied in recent literature for high-dimensional problems. These include banding, thresholding, tapering and penalization based methods; for example, see [24]. The primary goal of these regularization based methods is to impose a sparsity structure in the matrix. Most of these methods are applicable to situations where there is a natural ordering in the underlying variables, for example in data from time series, spatial data, etc., so that variables which are far off from each other have smaller correlations or partial correlations. In high-dimensional situations for data arising from genetics or econometrics, a natural ordering of the underlying variables may not always be readily available and hence estimation methods which are invariant to the ordering of the variables are desirable.

For estimation of a sparse inverse covariance matrix, graphical models [22] provide an excellent tool, as the conditional dependence between the component variables is captured an undirected graph; see [13]. There are several methods in the frequentist literature for the estimation of the precision matrix through graphical models. These methods include minimization of the penalized log-likelihood of the data with a lasso type penalty on the elements of the precision matrix. Several algorithms have been developed in the literature to solve the above optimization problem, including coordinate descent based algorithm for the lasso, which is popularly known as the graphical lasso [27]. Other methods include the Sparse Permutation Invariant Covariance Estimator (SPICE) [31].

Frequentist behavior of Bayesian methods in the context of high dimensional covariance matrix estimation have been studied only by a few authors. [15] studied asymptotic normality of posterior distributions for exponential families, which include the normal model with unknown covariance matrix, when the dimension , but restricting to . Recently, [29] considered sparse Bayesian factor models for dimensionality reduction in high dimensional problems and showed consistency in the -operator norm (also known as the spectral norm) by using a point mass mixture prior on the factor loadings, assuming such a factor model representation of the true covariance matrix.

Bayesian methods for inference using graphical models have also been developed, as in [33]. A conjugate family of priors, known as the -Wishart prior [33] have been developed for incomplete decomposable graphs. The equivalent prior on the covariance matrix is termed as the hyper inverse Wishart distribution in [12]. [25] introduced a more general family of conjugate priors for the precision matrix, known as the -Wishart family of distributions, which also has the conjugacy property. The properties of this family of distributions, including expressions for the Bayes estimators were further explored in [30]. Recently [3] studied posterior convergence rates for a -Wishart prior inducing a banding structure, where the true precision matrix need not have the banding structure.

[34] developed a Bayesian version of the graphical lasso, putting Laplace priors on the off-diagonal elements of the precision matrix and exponential priors on the diagonals. Similar in lines with the Bayesian lasso [28], the posterior mode in this case coincides with the graphical lasso estimate. A block Gibbs sampler is also developed for sampling from the resulting posterior. However, the Bayesian graphical lasso does not introduce any sparsity in the graphical structure because of the absence of a point mass at zero in the prior distribution for the off-diagonal elements. On the other hand, if point masses are introduced, the resulting posterior distribution on the structure of the graph becomes extremely difficult to compute based on the traditional reversible jump Markov chain Monte Carlo method.

In this paper, we derive posterior convergence rates for the Bayesian graphical lasso prior in terms of the Frobenius norm under appropriate sparsity conditions. For computing the posterior distribution, we propose a Laplace approximation based method to compute the posterior probability of different graphical structures. Such Laplace approximations based methods have been developed for variable selection in regression models; for example, see [36]. The lasso type penalty on the elements lead to non-differentiability of the integrand, when the graphical lasso sets an off-diagonal entry to zero, but the model includes that off-diagonal entry as a free variable. We shall call such models non-regular following the terminology used by [36] for variable selection in linear regression models. We show that the posterior probability of non-regular models are substantially smaller than their regular counterparts and hence in comparison may be ignored from consideration. We also estimate the error in the Laplace approximation for regular models.

The paper is organized as follows. In the next section, we introduce notations and discuss preliminaries on graphical models required for the other sections of the paper. In Section 3, we state model assumptions and specify the prior distribution on the underlying parameters, derive the form of the posterior and obtain the posterior convergence rate using the general theory developed in [16]. In Section 4, we develop the approximation of the posterior probabilities for different graphical models and discuss the issue of non-regular graphical models. We also show that the error in approximation of the posterior probabilities using the Laplace approximation is asymptotically negligible under appropriate conditions. A simulation study is performed in the Section 5 followed by a real data example in Section 6. Proofs of main results and additional lemmas are included in the Appendix.

## 2Notations and preliminaries

An undirected graph comprises of a non-empty set of vertices indexing the components of a -dimensional random vector along with an edge-set defined by . Let be , where the precision matrix is such that implies . We then say that follows a Gaussian graphical model (GGM) with respect to the graph . Since the absence of an edge between and implies conditional independence of and given , a GGM serve as an excellent tool in representing the sparsity structure in the precision matrix. Following the notation in [25], the canonical parameter is restricted to , where is the cone of positive definite symmetric matrices of order having zero entry corresponding to each missing edge in . We also denote the linear space of symmetric matrices of order by , and to be the cone of positive definite matrices of order . Corresponding to each GGM , we define the set .

By (respectively, ), we mean that is bounded (respectively, as ). For a random sequence , (respectively, ) means that for some constant (respectively, for all ). For numerical sequences and , by (or, we mean that , while by we mean that . By we mean that . The indicator function is denoted by . Vectors are represented in bold lowercase English or Greek letters with the components of a vector by the corresponding non-bold letters, that is, for , . For a vector , we define the following vector norms: , . Matrices are denoted in bold uppercase English or Greek letters, like , where stands for the th entry of . The identity matrix of order will be denoted by . If is a symmetric matrix, let stand for its eigenvalues and let the trace of be denoted by . Viewing as a vector in , we define and -norms on matrices as

Note that , the Frobenius norm. Viewing an operator from to , where , we can also define, We refer to the norm as the -operator norm. This gives the -operator norm of as

For symmetric matrices, . For symmetric matrices and of order , we have the following:

stands for the unique positive definite square root of a positive definite matrix . For two matrices and , we say that (respectively, ) if is nonnegative definite (respectively, positive definite). Thus for a positive definite matrix , where stands for the zero matrix. We denote sets in non-bold uppercase English letters. The cardinality of a set , that is, the number of elements in is denoted by . We define the symmetric matrix .

The Hellinger distance between two probability densities and is given by .

For a subset of a metric space , denote the -covering number of with respect to , that is, the minimum number of -balls of size is needed to cover .

## 3Model, prior and posterior concentration

Consider independent random samples from , where is nonsingular and the precision matrix is sparse. The problem is to estimate and to learn the underlying graphical structure. We denote the natural unbiased estimator of by .

The graphical lasso produces sparse solutions for the precision matrix, in similar lines to that of the lasso in case of linear regression. The graphical lasso estimator minimizes two times the penalized negative log-likelihood

over the class of positive definite matrices, and acts as the penalty parameter. [31] derived frequentist convergence rates of the penalized estimator under some sparsity assumptions on the true precision matrix. More specifically, consider the following class of positive definite matrices of order :

Though [31] considered penalizing only the off-diagonal elements of , some modification of the proof of their result leads to the same convergence rate for the graphical lasso estimator, obtained by additionally penalizing the diagonal elements. Let us denote as the graphical lasso estimator based on a sample of size from a -dimensional Gaussian distribution with precision matrix , where is given by (Equation 2). Then, it follows from Theorem 1 in [31] that the rate of convergence of is . By the triangle inequality,

Also, the triangle inequality and sub-multiplicative property for matrix operator norms gives,

Thus, we get,

Now, we have, by assumption, and it follows from Theorem 1 in [31] that as . Noting that , we get,

In the Bayesian context, [34] introduced the graphical lasso prior, which uses exponential distributions on diagonal elements and Laplace density on off-diagonal elements, all independently of each other, and finally imposes a positive definiteness constraint. The graphical lasso prior has a drawback that it puts absolutely continuous priors on the elements of the precision matrix, and hence the posterior probabilities of the event is always exactly zero.

[34] also mentioned an extension of the graphical lasso by putting an additional level of prior on the underlying graphical model structure using point mass priors on the events corresponding to the absence of an edge in the edge-set , although did not develop the method. We put point-mass prior on the events to make posterior inference about the sparse structure of the underlying graphical model. Define to be a vector of edge-inclusion indicator, that is,

Similar to the Bayesian graphical lasso prior, given the underlying graphical structure, we put a Laplace prior on the non-zero off-diagonal elements of the precision matrix and for the diagonal elements we have a exponential prior, overall maintaining the positive definiteness of the parameter. Then the joint prior density on is given by,

We propose two different priors on the graphical structure indicator . The edge indicators are considered to be independent and identically distributed (i.i.d) Bernoulli random variables, and conditioned to the restriction that the model size does not exceed . For some , the prior distribution on is assumed to satisfy

This prior is similar to that used by [10], which chooses the model size first according to a distribution with a similar tail decay and then subsets are selected randomly with equal probability. We can also specify the individual priors on the same as above, but now truncating the model size to some fixed , where is chosen so as to satisfy the metric entropy condition required for posterior convergence.

Thus, in the first situation, the prior on the graphical structure indicator , given , is given by,

In the second case, the prior on is simply given by

Smaller values of prefer graphical models with fewer number of edges, hence inducing more sparsity in the precision matrix.

Due to the positive definiteness constraint on the parameter , the normalizing constant corresponding posterior distribution of the graphical model becomes intractable and hence was not explored in [34]. One possible solution is to employ a reversible jump Markov chain Monte Carlo (RJMCMC) algorithm, which jumps from models of varying dimensions to evaluate the posterior probabilities. As there are as many as possible models, the posterior model probabilities estimated by RJMCMC visits are extremely unreliable. We consider a radically different approach to posterior computation based on Laplace approximations, elaborated in the next section.

Under the above prior specifications, the joint posterior distribution of and given the data is given by

Thus,

where

The following result gives posterior convergence rate as . We assume that the true model is sparse, as given by the class of positive definite matrices in (Equation 2).

The proof uses the general theory of posterior convergence of [16] and will be given in the appendix. The above posterior convergence rate matches exactly with the frequentist convergence rate of the penalized estimator obtained in [31].

Note that, Theorem ? gives with posterior probability tending to one in probability and from [31] it follows that, , where is the graphical lasso estimate. Hence, by the triangle inequality, with posterior probability tending to one in probability. This gives,

## 4Posterior Computation

The marginal posterior density of the graphical structure indicator can be obtained by integrating out elements of the precision matrix in the joint posterior density in (Equation 8), to get

where

Note that is minimized at , the graphical lasso estimate corresponding to the penalty parameter . The marginal posterior of is, however, intractable. We give an approximate method for the posterior probability computations of various models using Laplace approximation. The Laplace approximation requires expanding the integrand in (Equation 10) around the maximum, which in this case, coincides with the graphical lasso solution.

### 4.1Approximating model posterior probabilities

Define , where is the graphical lasso solution corresponding to the underlying graphical model structure and penalty parameter . Then,

where is

Clearly is minimized at by the definition of , so the first derivative of vanishes at , provided that it is differentiable at . Define the matrix , where

Using standard matrix calculus (for example, see Section 15.9 of [18]), we can find that the Hessian of is the matrix , whose th entry for is given by

Thus the Laplace approximation to the posterior probability is given by

The approximation in (Equation 13) is meaningful only if all the graphical lasso estimates of the off-diagonal elements corresponding to the graph generated by are non-zero; otherwise the derivative of does not exist. A similar situation arises in the context of regression models; see [36] and [11]. In the next section, we show that such “non-regular models” can essentially be ignored for the purpose of posterior probability evaluation.

### 4.2Ignorability of non-regular models

As discussed in the previous section, the objective function of the graphical lasso problem is not differentiable if the graphical lasso solution is zero for at least one pair . These models are referred to as non-regular models. This essentially means that given a fixed graphical structure index , the graphical lasso solution is for at least one . Let us assume, for notational simplicity, that the first elements of are 1 and the rest are 0. Also, among those 1’s, the last of them have corresponding graphical lasso solution equal to zero. For such a non-regular model, we argue that the submodel , with first 1’s and rest 0’s, provides the same graphical lasso solution for the non-zero elements as the bigger model . This means that for such that , the graphical lasso solution corresponding to , given by is identical with that corresponding to , given by . We refer to such a submodel as the regular submodel of the non-regular model .

We give a proof of the above lemma in the appendix. For notational convenience, let us denote the precision matrix corresponding to the structure indicator by , and corresponding matrix is defined by . We denote the graphical lasso solution in the non-regular model and the corresponding regular submodel by . The ratio of the posterior model probabilities of the two model is given by,

The following result shows the ignorability of the non-regular models.

Using (Equation 9), we have,

Now, note that for such that , we have,

Hence, using Lemma ?, we get

The last inequality follows from the fact that if the prior as in (Equation 6) is used, then since . For the other prior as in (Equation 7), the inequality follows trivially as it involves the ratio of two indicator variables only.

For , the above ratio is less than . This completes the proof.

The above result is particularly important in the sense that we can focus on the regular models only, ignoring the non-regular ones especially if is chosen to be small. While approximating the posterior probabilities of the regular models, we re-normalize the values considering the regular models only.

### 4.3Error in Laplace approximation

The approximation in the posterior probability of the graphical model is based on a Taylor series expansion of the function around the graphical lasso solution . Let , and denote the vectorized version of , but excluding entries corresponding to the missing edges in the underlying graphical model. Thus is a vector of dimension corresponding to the graphical structure indicator . If the graphical model is -sparse, that is, there are edges present in the graph, then . The following result gives the bound on the remainder term of the Taylor series expansion under the above assumptions.

This result can be used to find a bound for the error in Laplace approximation of the posterior probabilities of the graphical model structures. The following result gives the condition for which the error in approximation is asymptotically negligible.

The proof of the above result depends on several additional results, including Lemma ? involving the bound on the remainder term in the Taylor series expansion of . We give a proof of the above result along with these additional results in the appendix.

## 5Simulation results

We perform a simulation study to assess the performance of the Bayesian method for graphical structure learning. We use 4 different models for our simulations, and we specify these models in terms of the elements of the covariance matrix or the precision matrix , as follows:

1. Model 1: AR(1) model, .

2. Model 2: AR(2) model, .

3. Model 3: Star model, where every node is connected to the first node, and , and otherwise.

4. Model 4: Circle model, .

Corresponding to each model, we generate samples of size and dimension . The penalty parameter for the graphical lasso algorithm is chosen to be 0.5 and the value of appearing in the prior of the graphical structure indicator to be 0.4. We run 100 replications for each of the models and find the median probability model for each replication. To assess the performance of the median probability model (denoted by ‘MPP’), we compute the specificity, sensitivity and Matthews Correlation Coefficient (MCC) averaged across the replications as defined below and also compute the same for the graphical lasso (denoted by ‘GL’). The results are presented in Table ?.

where TP, TN, FP and FN respectively denote the true positives, true negatives, false positives and false negatives in the selected model, which in our case is the median probability model.

## 6Illustration with real data

In this section we illustrate the Bayesian graphical structure learning method with the stock price data from Yahoo! Finance. Description of the data set can be found in [26] and available in the huge package on CRAN [38] as stockdata. The data set consists of closing prices of stocks that were consistently included in the S&P 500 index in the time period January 1, 2003 to January 1, 2008 for a total of 1258 days. The stocks are also categorized into 10 Global Industry Classification Standard (GICS) sectors, namely, “Health Care”, “Materials”, “Industrials”, “Consumer Staples”, “Consumer Discretionary”, “Utilities”, “Information Technology”, “Financials”, “Energy”, “Telecommunication Services”.

Denoting as the closing stock price for the th stock on day , we construct the data matrix with entries . For analysis, we construct the data matrix by standardizing , so that each stock has mean zero and standard deviation one. We find the median probability model as selected by the Bayesian graphical structure learning method. The corresponding graphical structure is displayed in Figure ?. The vertices of the graph are colored corresponding to the different GICS sectors. We find that stocks from the same sectors tend to be related with other members from that category, and generally not related across different sectors, though there are some connections, which may be due to some other possible latent factors affecting all of them. The grouping of the stocks corresponding to their sectors is expected, implying that the stock prices for a particular sector are conditionally independent of those of other sectors.

We also individually study data pertaining to some of the specific sectors to have a closer look at the strength of the groupings where perturbations due to latent factors is least expected. For this, we consider the sectors “Utilities” and “Information Technology”. The graphical structure is displayed in Figure ?. The stock prices for the two sectors clearly separate as desired.

## AProofs

We now give a proof of the result on posterior convergence rate of the precision matrix.

In order to establish the rates of convergence of the posterior distribution, we first need to check the prior concentration rate, that is,

where . Note that, for and a symmetric matrix , we have,

We use the above result to find the expressions for and . Denoting the eigenvalues of the matrix by , using and (Equation 15), we get,

Now, , and from an argument in Lemma ? it implies that if , then . Hence we can expand in the powers of to get

Also,

Thus,

Now, using the assumptions on the true precision matrix and the matrix norm relations given by (Equation 1), we have,

Hence, equations (Equation 16) and (Equation 17), along with (Equation 1) give,

The components of are not independently distributed, but a truncation applies because of the positive definite restriction. However, as the true lies in the set of positive definite matrices which is open, the truncation can only increase concentration in a small ball centered at the truth, so we can pretend componentwise independence for the purpose of lower bounding the above prior probability. This gives

The prior concentration rate condition thus gives,

so as to get

Next, we need to construct tests for against the alternative . Let denote the joint density of the observations and let be the prior. From the results of [6] and [23], we know that for a probability measure and any convex set of probability measures, there exists tests such that

where

Let , where is the constant appearing in Lemma ? and