Abstract
Covariance graphical lasso applies a lasso penalty on the elements of the covariance matrix. This method is useful because it not only produces sparse estimation of covariance matrix but also discovers marginal independence structures by generating zeros in the covariance matrix. We propose and explore two new algorithms for solving the covariance graphical lasso problem. Our new algorithms are based on coordinate descent and ECM. We show that these two algorithms are more attractive than the only existing competing algorithm of Bien & Tibshirani (2011) in terms of simplicity, speed and stability. We also discuss convergence properties of our algorithms.
Two New Algorithms for Solving Covariance Graphical Lasso Based on Coordinate Descent and ECM
Hao Wang
Department of Statistics, University of South Carolina,
Columbia, South Carolina 29208, U.S.A.
haowang@sc.edu
(First version: May 17, 2012)
Key words: Coordinate descent; Covariance graphical lasso; Covariance matrix estimation; penalty; ECM algorithm; Marginal independence; Regularization; Shrinkage; Sparsity
1 Introduction
Bien & Tibshirani (2011) proposed a covariance graphical lasso procedure for simultaneously estimating covariance matrix and marginal dependence structures. Let be the sample covariance matrix such that where is the data matrix of variables and samples. A basic version of their covariance graphical lasso problem is to minimize the following objective function:
(1) 
over the space of positive definite matrices with being the shrinkage parameter. Here, is the covariance matrix and is the norm of . A general version of the covariance graphical lasso in Bien & Tibshirani (2011) allows different shrinkage parameters for different elements in . To ease exposition, we describe our methods in the context of the simple case of common shrinkage parameter in (1). All of our results can be extended to the general version of different shrinkage parameters with little difficulty.
Because of the norm term, the covariance graphical lasso is possible to set some of the offdiagonal elements of exactly equal to zero in its minimum point of (1). Zeros in encode marginal independence structures among the components of a multivariate normal random vector with covariance matrix . It is distinctly different from the concentration graphical models (also referred to as covariance selection models due to Dempster 1972) where zeros are in the concentration matrix and are associated with conditional independence.
The objective function (1) is not convex, imposing computational challenges for minimizing it. Bien & Tibshirani (2011) proposed a complicated majorizeminimize approach that iteratively solves convex approximation to the original nonconvex problem. In the current paper, we develop two alternative algorithms for minimizing (1): the coordinate descent algorithm and the Expectation/Conditional Maximization (ECM) algorithm. We discuss their convergence properties and investigate their computational efficiency through simulation studies. In comparison with Bien & Tibshirani (2011)’s algorithm, our new algorithms are much simpler to implement, substantially faster to run and numerically more stable in our tests.
Neither the coordinate descent algorithm nor the ECM algorithm is new to the model fitting for regularized problems. The coordinate descent algorithm is shown to be very competitive for fitting convex and even some nonconvex penalized regression models (Friedman et al., 2007; Wu & Lange, 2008; Breheny & Huang, 2011) as well as the concentration graphical models (Friedman et al., 2008). The ECM algorithm (or its variants) has been developed in the Bayesian framework for finding the maximum a posteriori (MAP) estimation of regularized linear regression models (Polson & Scott, 2011; Armagan et al., 2011). However, our application of these two algorithms to covariance graphical lasso models is new and unexplored before. In this sense, our work documents that the coordinate descent algorithm and the ECM algorithm are also quite powerful in solving covariance graphical lasso models.
2 Two proposed algorithms
2.1 Coordinate descent algorithm
Our first algorithm to minimize the objective function (1) uses the simple idea of coordinate descent methods. We show how to update one column and row at a time while holding all of the rest elements in fixed. Without loss of generality, we focus on the last column and row. Partition and as follows:
(2) 
where (a) and are the covariance matrix and the sample covariance matrix of the first variables, respectively; (b) and are the covariances and the sample covariances between the first variables and the last variable, respectively; and (c) and are the variance and the sample variance of the last variable, respectively.
Let
and apply the block matrix inversion to using blocks :
(3) 
After removing some constants not involving , the three terms in (1) can be expressed as a function of :
(4) 
which lead to the following objective function with respect to :
(5) 
For , removing terms in (5) that do not depend on gives
where . Clearly, it is solved as follows:
(6) 
For , removing terms in (5) that do not depend on gives
(7) 
where , The problem in (7) is a lasso problem and can be efficiently solved by fast coordinate descent algorithms (Friedman et al., 2007; Wu & Lange, 2008). Specifically, for , the minimum point of along the coordinate direction in which varies is:
(8) 
where is the softthreshold operator:
The update (8) is iterated for until convergence. We then update the column as followed by cycling through all columns until convergence. This algorithm can been viewed as a block coordinate descent method with blocks of s and another blocks of s. The algorithm is summarized as follows:
Coordinate descent algorithm Given input , start with , at the th iteration ()

Let .

Let and repeat (1)(3) until convergence.
2.2 ECM algorithm
The second algorithm to find the estimator utilizes representation of the solution to (1) as a MAP estimator for the posterior distribution: We propose to use the ECM algorithm for iterative compute its modes. The key to the ECM algorithm is the choice of latent variables. Note that, for any offdiagonal element , we have the following wellknown representation:
(9) 
where the latent scale parameter. This suggests that the posterior distribution is the marginal distribution of the following complete posterior distribution of where :
(10) 
Consequently, the covariance graphical lasso estimator from (1) is equivalent to the marginal mode for in (2.2). We now describe how to compute this marginal mode for using the ECM algorithm (Meng & Rubin, 1993).
To perform the Estep, we calculate the expected complete logposterior by replacing with their conditional expectations given the current . Following the standard results of the inverseGaussian distribution, we have
which leads to the following criterion function after removing the terms that do not depend on ,
(11) 
The CMstep then maximizes (2.2) along each column (row) of . Again, without loss of generality, we focus on the last column and row. Partition and as in (2) and consider the same transformation from to . After dropping of terms not involving , the four terms in (2.2) can be written as functions of
where . Holding all but fixed, we can then rewrite (2.2) as
(12) 
For , it is easy to derive from (2.2) that the conditional maximum point given is the same as in (6). On the other hand, given , (2.2) can be written as a function of ,
where and are defined in (7). This implies that the conditional maximum point of is
(13) 
Cycling through every column generates a sequence of CM steps. The total number of CMsteps is because there are columns and for each column the CM involves two steps: one for and the other for . The ECM algorithm can be summarized as follows:
ECM algorithm Given input , start with , at the th iteration,

Let .

Estep: Compute as in (2.2).

Let and repeat (1)(4) until convergence.
2.3 Algorithm convergence
The convergence of the proposed (block) coordinate descent algorithm can be addressed by the theoretical innovation for block coordinate descent methods for nondifferentiable minimization by Tseng (2001). The key to the application of the general theory there to our algorithm is the separability of the nondifferentiable penalty terms in (1). First, from (6) and (8), the objective function has a unique minimum point in each coordinate block. This satisfies the conditions of Part (c) of Theorem 4.1 in Tseng (2001) and hence implies that the algorithm converges to a coordinatewise minimum point. Second, because all directional derivatives exist, by Lemma 3.1 of Tseng (2001), each coordinatewise minimum point is a stationary point. A closerelated argument has been given by Breheny & Huang (2011) to show the convergence of coordinate decent algorithm for nonconvex penalized regression models.
The convergence of the proposed ECM algorithm follows directly from the results of Meng & Rubin (1993). First, the analytical solutions (6) and (13) of CMsteps are unique for any . Second, it can be easily seen that the construction of the CMsteps satisfies the “space filling” condition for any because the CMsteps cycle through the whole parameter space. Thus, the two conditions of Theorem 3 of Meng & Rubin (1993) hold for the proposed ECM and so all limiting points of the ECM sequence are stationary points of (1).
Using the latent variables makes the EMtype optimization simpler than direct optimization, however, it has some issues. The ECM must be initialized at for all . Otherwise, from (2.2), the criterion function , the algorithm will then stuck. Another related issue is that, although the sequence will possibly converge to zero, they cannot be identically equal to zero. We want to point out that these issues are not unique to the covariance graphical models. Existing EMtype algorithms based on the latent scale parameters for fitting regularized linear models (e.g., Polson & Scott 2011; Armagan et al. 2011) have the same problems.
3 Comparison of algorithms
We compare the computational aspects of the two algorithms in Section 2.1 and 2.2 with Bien & Tibshirani (2011)’s algorithm. We consider two scenarios:

A sparse model taken from Bien & Tibshirani (2011) with and zero otherwise. Here, is chosen so that the condition number of is .

A dense model with and for .
The algorithm of Bien & Tibshirani (2011) is coded in R with its builtin functions. To be comparable to it, we implemented the ECM and the coordinate descent algorithms in R without writing any functions in a compiled language. All computations were done on a Intel Xeon X5680 3.33GHz processor.
For either the sparse or the dense model, we first generated two datasets of dimension and , respectively. Then, for each of the four datasets, we applied the three algorithms for a range of values. All computations were initialized at the sample covariance matrix, i.e., . For Bien & Tibshirani (2011)’s algorithm, we followed the default setting of tuning parameters provided by the “spcov” package (http://cran.rproject.org/web/packages/spcov/index.html). For the ECM and the coordinate descent algorithms, we used the same criterion as Bien & Tibshirani (2011)’s algorithm to stop the iterations: The procedure stops when the change of the objective function is less than .
First, we report the performance of numerical stability. In the case of sparse models and , Bien & Tibshirani (2011)’s algorithm would not run. A careful inspection of the algorithm details reveals that the NewtonRaphson step there to find fails in this case. The same problem occurs quite often when we apply Bien & Tibshirani (2011)’s algorithm to different datasets generated from different . This suggests that Bien & Tibshirani (2011)’s algorithm may lack numerical stability. It may be possible to reduce this numerical problem by calibrating some of Bien & Tibshirani (2011)’s tuning parameters such as the initial value for the NewtonRaphson method. In any event, it may be safe to conclude that Bien & Tibshirani (2011)’s algorithm requires either very careful tuning or lacks stability in certain cases.
Now, we compare the computing speed. The four panels in Figure 1 display the CPU times of the three algorithms for each of the four datasets respectively. CPU time in seconds is plotted against the total number of offdiagonal nonzero elements estimated by the coordinate descent algorithm. The ECM and the coordinate descent algorithms are much faster than the Bien & Tibshirani (2011)’s algorithm except for the dense model and in which the ECM is comparable with Bien & Tibshirani (2011). Moreover, the coordinate descent algorithm seems to be particularly attractive for larger as its run time generally decreases when the total number of estimated nonzero elements decreases.
Next, we examine the ability of the algorithms to search for minimum points. To do so, we compared the minimum values of the objective functions achieved by each algorithm. For each dataset and each , We calculated relative minimum values of objective functions defined as:
(14) 
where , and are the minimum points found by Bien & Tibshirani (2011), the coordinate descent and the ECM algorithms, respectively. Thus, a negative value indicates that the algorithm finds better points than the ECM procedure and a smaller relative objective function indicates a better relative performance of the method. The four panels in Figure 2 display the relative minimum values of the objective functions for each of the four datasets, respectively. The relative minimum values are plotted as functions of the total number of nonzero elements estimated by the coordinate descent algorithm. As can be seen, the coordinate descent algorithm performs best as it always returns the lowest objective function among the three methods. The ECM converges to points that have slightly higher objective function than Bien & Tibshirani (2011) in dense scenario (Panel (c) and (d)), however, the difference is small (less than 0.05). Bien & Tibshirani (2011) seems to find points that are far less optimal than the coordinate descent or ECM algorithms in certain cases as is suggested by the spikes and dips in panel (a) and (b) where the differences in the objective function are larger than 1.
(a)  (b) 
(c)  (d) 
(a)  (b) 
(c)  (d) 
Finally, it is known that, for nonconvex problems, any optimization algorithms are not guaranteed to converge to a global minimum. It is often recommended to run algorithms at multiple initial values. We then wish to compare the performance of the algorithms under different initial values. In the previous experiments, all computations were initialized at the full sample covariance matrix . To be different, it is natural to initialize them at the other extreme case in which . However, this initial value can only be used for the coordinate descent but not for ECM as discussed before, since the ECM iteration will stuck at and never move when the offdiagonal elements are exactly equal to zero. To avoid this problem, we start the ECM at to avoid exact zeros on the offdiagonal elements. For each of the four datasets, we picked three different values of so that they represent three different levels of sparsity based on the experiment before. For each of the four datasets and each of the three s, we ran the three algorithms starting from the new set of initial values:
We recorded the CPU time, sparsity of the minimum points and the minimum value of the objective function. Table 1 reports these measures. Three things are worth noting. First, Bien & Tibshirani (2011)’s algorithm seems to get stuck at the initial value of a diagonal matrix in all cases. In contrast, the proposed algorithms work fine and find reasonable minimum points of because the minimum values of the objective function and level of sparsity are quite close to those obtained from starting at the full sample covariance matrix. We have tried to initialize Bien & Tibshirani (2011)’s algorithm at but found that it still gets stuck after a few iterations. It may be possible to improve the poor performance of Bien & Tibshirani (2011)’s algorithm by adjusting some of its tuning parameters. However, it may be safe to conclude that Bien & Tibshirani (2011)’s algorithm requires either very careful tuning or performs badly at this important initial value. Second, initial values indeed matter. Comparing the results between full and diagonal initial values, we see substantial differences in all three measures. In particular, the limiting points from the diagonal initial matrices are sparser than those from the full initial matrices. This is not surprising because of the drastic difference in sparsity between these two starting points. Third, comparing the minimum values of the objective function achieved by the three algorithms (last two columns), we see that the coordinate descent often reaches the lowest minimum values regardless of the initial points. The few exceptions (e.g., sparse , ) when the coordinate descent is not the best seems to be the case that is small and the fraction of the number of nonzero elements is large.
CPU Time  Nonzero  Objective Func  

Model  Method  Full  Diag  Full  Diag  Full  Diag 
Sparse, 100, 0.01  BT  113  28  0.919  0.000  2.042  79.022 
ECM  14  21  0.918  0.912  2.030  2.030  
CD  135  102  0.920  0.918  2.033  2.032  
Sparse, 100, 0.24  BT  67  26  0.303  0.000  39.455  79.022 
ECM  26  122  0.316  0.256  39.463  39.504  
CD  23  19  0.304  0.302  39.454  39.454  
Sparse, 100, 1.11  BT  354  25  0.032  0.000  85.309  79.022 
ECM  114  5  0.032  0.000  85.310  79.022  
CD  21  0  0.032  0.000  85.306  79.022  
Sparse, 200, 0.01  BT    221    0.000    156.840 
ECM  300  368  0.880  0.877  4.407  4.407  
CD  2230  1452  0.883  0.886  4.403  4.404  
Sparse, 200, 0.15  BT    173    0.000    156.840 
ECM  507  1486  0.311  0.281  54.941  54.971  
CD  536  371  0.309  0.307  54.931  54.931  
Sparse, 200, 1.00  BT    170    0.000    156.840 
ECM  427  60  0.036  0.000  160.100  156.840  
CD  58  4  0.035  0.000  160.090  156.840  
Dense, 100, 0.02  BT  27  19  0.909  0.000  96.782  167.165 
ECM  14  17  0.912  0.890  96.784  96.785  
CD  45  18  0.909  0.880  96.783  96.782  
Dense, 100, 0.19  BT  55  25  0.313  0.000  152.914  167.165 
ECM  35  55  0.355  0.328  152.924  152.930  
CD  11  11  0.311  0.303  152.911  152.911  
Dense, 100, 0.32  BT  149  10  0.027  0.000  166.929  167.165 
ECM  73  19  0.057  0.000  166.956  167.169  
CD  23  17  0.024  0.009  166.910  166.953  
Dense, 200, 0.01  BT  330  147  0.928  0.000  180.274  342.279 
ECM  260  325  0.924  0.914  180.277  180.280  
CD  342  222  0.928  0.927  180.275  180.274  
Dense, 200, 0.15  BT  657  207  0.294  0.000  303.813  342.279 
ECM  610  764  0.321  0.317  303.821  303.823  
CD  95  95  0.292  0.294  303.803  303.803  
Dense, 200, 0.29  BT  890  139  0.034  0.000  341.231  342.279 
ECM  1154  263  0.060  0.000  341.255  342.285  
CD  171  225  0.032  0.026  341.221  341.223 
4 Discussion
We have developed two alternative algorithms for fitting sparse covariance graphical lasso models using penalty. These two algorithms are shown to be much easier to implement, significantly faster to run and numerically more stable than the algorithm of Bien & Tibshirani (2011). Both MATLAB and R software packages implementing the new algorithms for solving covariance graphical models are freely available from the author’s the website of the paper.
References
 Armagan, A., Dunson, D. & Lee, J. (2011). Generalized double Pareto shrinkage. Statistica Sinica (forthcoming) .
 Bien, J. & Tibshirani, R. J. (2011). Sparse estimation of a covariance matrix. Biometrika 98, 807–820.
 Breheny, P. & Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The Annals of Applied Statistics , 232–253.
 Dempster, A. (1972). Covariance selection. Biometrics 28, 157–75.
 Friedman, J., Hastie, T., Höfling, H. & Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics 1, pp. 302–332.
 Friedman, J., Hastie, T. & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441.
 Meng, X.L. & Rubin, D. B. (1993). Maximum likelihood estimation via the ecm algorithm: A general framework. Biometrika 80, 267–278.
 Polson, N. G. & Scott, J. G. (2011). Data augmentation for nonGaussian regression models using variancemean mixtures. ArXiv eprints .
 Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications 109, 475–494.
 Wu, T. T. & Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics 2, pp. 224–244.