Covariance graphical lasso applies a lasso penalty on the elements of the covariance matrix. This method is useful because it not only produces sparse estimation of covariance matrix but also discovers marginal independence structures by generating zeros in the covariance matrix. We propose and explore two new algorithms for solving the covariance graphical lasso problem. Our new algorithms are based on coordinate descent and ECM. We show that these two algorithms are more attractive than the only existing competing algorithm of Bien & Tibshirani (2011) in terms of simplicity, speed and stability. We also discuss convergence properties of our algorithms.
Two New Algorithms for Solving Covariance Graphical Lasso Based on Coordinate Descent and ECM
Department of Statistics, University of South Carolina,
Columbia, South Carolina 29208, U.S.A.
(First version: May 17, 2012)
Key words: Coordinate descent; Covariance graphical lasso; Covariance matrix estimation; penalty; ECM algorithm; Marginal independence; Regularization; Shrinkage; Sparsity
Bien & Tibshirani (2011) proposed a covariance graphical lasso procedure for simultaneously estimating covariance matrix and marginal dependence structures. Let be the sample covariance matrix such that where is the data matrix of variables and samples. A basic version of their covariance graphical lasso problem is to minimize the following objective function:
over the space of positive definite matrices with being the shrinkage parameter. Here, is the covariance matrix and is the -norm of . A general version of the covariance graphical lasso in Bien & Tibshirani (2011) allows different shrinkage parameters for different elements in . To ease exposition, we describe our methods in the context of the simple case of common shrinkage parameter in (1). All of our results can be extended to the general version of different shrinkage parameters with little difficulty.
Because of the -norm term, the covariance graphical lasso is possible to set some of the off-diagonal elements of exactly equal to zero in its minimum point of (1). Zeros in encode marginal independence structures among the components of a multivariate normal random vector with covariance matrix . It is distinctly different from the concentration graphical models (also referred to as covariance selection models due to Dempster 1972) where zeros are in the concentration matrix and are associated with conditional independence.
The objective function (1) is not convex, imposing computational challenges for minimizing it. Bien & Tibshirani (2011) proposed a complicated majorize-minimize approach that iteratively solves convex approximation to the original nonconvex problem. In the current paper, we develop two alternative algorithms for minimizing (1): the coordinate descent algorithm and the Expectation/Conditional Maximization (ECM) algorithm. We discuss their convergence properties and investigate their computational efficiency through simulation studies. In comparison with Bien & Tibshirani (2011)’s algorithm, our new algorithms are much simpler to implement, substantially faster to run and numerically more stable in our tests.
Neither the coordinate descent algorithm nor the ECM algorithm is new to the model fitting for regularized problems. The coordinate descent algorithm is shown to be very competitive for fitting convex and even some non-convex penalized regression models (Friedman et al., 2007; Wu & Lange, 2008; Breheny & Huang, 2011) as well as the concentration graphical models (Friedman et al., 2008). The ECM algorithm (or its variants) has been developed in the Bayesian framework for finding the maximum a posteriori (MAP) estimation of regularized linear regression models (Polson & Scott, 2011; Armagan et al., 2011). However, our application of these two algorithms to covariance graphical lasso models is new and unexplored before. In this sense, our work documents that the coordinate descent algorithm and the ECM algorithm are also quite powerful in solving covariance graphical lasso models.
2 Two proposed algorithms
2.1 Coordinate descent algorithm
Our first algorithm to minimize the objective function (1) uses the simple idea of coordinate descent methods. We show how to update one column and row at a time while holding all of the rest elements in fixed. Without loss of generality, we focus on the last column and row. Partition and as follows:
where (a) and are the covariance matrix and the sample covariance matrix of the first variables, respectively; (b) and are the covariances and the sample covariances between the first variables and the last variable, respectively; and (c) and are the variance and the sample variance of the last variable, respectively.
and apply the block matrix inversion to using blocks :
After removing some constants not involving , the three terms in (1) can be expressed as a function of :
which lead to the following objective function with respect to :
For , removing terms in (5) that do not depend on gives
where . Clearly, it is solved as follows:
For , removing terms in (5) that do not depend on gives
where , The problem in (7) is a lasso problem and can be efficiently solved by fast coordinate descent algorithms (Friedman et al., 2007; Wu & Lange, 2008). Specifically, for , the minimum point of along the coordinate direction in which varies is:
where is the soft-threshold operator:
The update (8) is iterated for until convergence. We then update the column as followed by cycling through all columns until convergence. This algorithm can been viewed as a block coordinate descent method with blocks of s and another blocks of s. The algorithm is summarized as follows:
Coordinate descent algorithm Given input , start with , at the th iteration ()
Let and repeat (1)-(3) until convergence.
2.2 ECM algorithm
The second algorithm to find the estimator utilizes representation of the solution to (1) as a MAP estimator for the posterior distribution: We propose to use the ECM algorithm for iterative compute its modes. The key to the ECM algorithm is the choice of latent variables. Note that, for any off-diagonal element , we have the following well-known representation:
where the latent scale parameter. This suggests that the posterior distribution is the marginal distribution of the following complete posterior distribution of where :
Consequently, the covariance graphical lasso estimator from (1) is equivalent to the marginal mode for in (2.2). We now describe how to compute this marginal mode for using the ECM algorithm (Meng & Rubin, 1993).
To perform the E-step, we calculate the expected complete log-posterior by replacing with their conditional expectations given the current . Following the standard results of the inverse-Gaussian distribution, we have
which leads to the following criterion function after removing the terms that do not depend on ,
The CM-step then maximizes (2.2) along each column (row) of . Again, without loss of generality, we focus on the last column and row. Partition and as in (2) and consider the same transformation from to . After dropping of terms not involving , the four terms in (2.2) can be written as functions of
where . Holding all but fixed, we can then rewrite (2.2) as
where and are defined in (7). This implies that the conditional maximum point of is
Cycling through every column generates a sequence of CM steps. The total number of CM-steps is because there are columns and for each column the CM involves two steps: one for and the other for . The ECM algorithm can be summarized as follows:
ECM algorithm Given input , start with , at the th iteration,
E-step: Compute as in (2.2).
Let and repeat (1)-(4) until convergence.
2.3 Algorithm convergence
The convergence of the proposed (block) coordinate descent algorithm can be addressed by the theoretical innovation for block coordinate descent methods for non-differentiable minimization by Tseng (2001). The key to the application of the general theory there to our algorithm is the separability of the non-differentiable penalty terms in (1). First, from (6) and (8), the objective function has a unique minimum point in each coordinate block. This satisfies the conditions of Part (c) of Theorem 4.1 in Tseng (2001) and hence implies that the algorithm converges to a coordinatewise minimum point. Second, because all directional derivatives exist, by Lemma 3.1 of Tseng (2001), each coordinatewise minimum point is a stationary point. A close-related argument has been given by Breheny & Huang (2011) to show the convergence of coordinate decent algorithm for nonconvex penalized regression models.
The convergence of the proposed ECM algorithm follows directly from the results of Meng & Rubin (1993). First, the analytical solutions (6) and (13) of CM-steps are unique for any . Second, it can be easily seen that the construction of the CM-steps satisfies the “space filling” condition for any because the CM-steps cycle through the whole parameter space. Thus, the two conditions of Theorem 3 of Meng & Rubin (1993) hold for the proposed ECM and so all limiting points of the ECM sequence are stationary points of (1).
Using the latent variables makes the EM-type optimization simpler than direct optimization, however, it has some issues. The ECM must be initialized at for all . Otherwise, from (2.2), the criterion function , the algorithm will then stuck. Another related issue is that, although the sequence will possibly converge to zero, they cannot be identically equal to zero. We want to point out that these issues are not unique to the covariance graphical models. Existing EM-type algorithms based on the latent scale parameters for fitting regularized linear models (e.g., Polson & Scott 2011; Armagan et al. 2011) have the same problems.
3 Comparison of algorithms
A sparse model taken from Bien & Tibshirani (2011) with and zero otherwise. Here, is chosen so that the condition number of is .
A dense model with and for .
The algorithm of Bien & Tibshirani (2011) is coded in R with its built-in functions. To be comparable to it, we implemented the ECM and the coordinate descent algorithms in R without writing any functions in a compiled language. All computations were done on a Intel Xeon X5680 3.33GHz processor.
For either the sparse or the dense model, we first generated two datasets of dimension and , respectively. Then, for each of the four datasets, we applied the three algorithms for a range of values. All computations were initialized at the sample covariance matrix, i.e., . For Bien & Tibshirani (2011)’s algorithm, we followed the default setting of tuning parameters provided by the “spcov” package (http://cran.r-project.org/web/packages/spcov/index.html). For the ECM and the coordinate descent algorithms, we used the same criterion as Bien & Tibshirani (2011)’s algorithm to stop the iterations: The procedure stops when the change of the objective function is less than .
First, we report the performance of numerical stability. In the case of sparse models and , Bien & Tibshirani (2011)’s algorithm would not run. A careful inspection of the algorithm details reveals that the Newton-Raphson step there to find fails in this case. The same problem occurs quite often when we apply Bien & Tibshirani (2011)’s algorithm to different datasets generated from different . This suggests that Bien & Tibshirani (2011)’s algorithm may lack numerical stability. It may be possible to reduce this numerical problem by calibrating some of Bien & Tibshirani (2011)’s tuning parameters such as the initial value for the Newton-Raphson method. In any event, it may be safe to conclude that Bien & Tibshirani (2011)’s algorithm requires either very careful tuning or lacks stability in certain cases.
Now, we compare the computing speed. The four panels in Figure 1 display the CPU times of the three algorithms for each of the four datasets respectively. CPU time in seconds is plotted against the total number of off-diagonal non-zero elements estimated by the coordinate descent algorithm. The ECM and the coordinate descent algorithms are much faster than the Bien & Tibshirani (2011)’s algorithm except for the dense model and in which the ECM is comparable with Bien & Tibshirani (2011). Moreover, the coordinate descent algorithm seems to be particularly attractive for larger as its run time generally decreases when the total number of estimated non-zero elements decreases.
Next, we examine the ability of the algorithms to search for minimum points. To do so, we compared the minimum values of the objective functions achieved by each algorithm. For each dataset and each , We calculated relative minimum values of objective functions defined as:
where , and are the minimum points found by Bien & Tibshirani (2011), the coordinate descent and the ECM algorithms, respectively. Thus, a negative value indicates that the algorithm finds better points than the ECM procedure and a smaller relative objective function indicates a better relative performance of the method. The four panels in Figure 2 display the relative minimum values of the objective functions for each of the four datasets, respectively. The relative minimum values are plotted as functions of the total number of non-zero elements estimated by the coordinate descent algorithm. As can be seen, the coordinate descent algorithm performs best as it always returns the lowest objective function among the three methods. The ECM converges to points that have slightly higher objective function than Bien & Tibshirani (2011) in dense scenario (Panel (c) and (d)), however, the difference is small (less than 0.05). Bien & Tibshirani (2011) seems to find points that are far less optimal than the coordinate descent or ECM algorithms in certain cases as is suggested by the spikes and dips in panel (a) and (b) where the differences in the objective function are larger than 1.
Finally, it is known that, for nonconvex problems, any optimization algorithms are not guaranteed to converge to a global minimum. It is often recommended to run algorithms at multiple initial values. We then wish to compare the performance of the algorithms under different initial values. In the previous experiments, all computations were initialized at the full sample covariance matrix . To be different, it is natural to initialize them at the other extreme case in which . However, this initial value can only be used for the coordinate descent but not for ECM as discussed before, since the ECM iteration will stuck at and never move when the off-diagonal elements are exactly equal to zero. To avoid this problem, we start the ECM at to avoid exact zeros on the off-diagonal elements. For each of the four datasets, we picked three different values of so that they represent three different levels of sparsity based on the experiment before. For each of the four datasets and each of the three s, we ran the three algorithms starting from the new set of initial values:
We recorded the CPU time, sparsity of the minimum points and the minimum value of the objective function. Table 1 reports these measures. Three things are worth noting. First, Bien & Tibshirani (2011)’s algorithm seems to get stuck at the initial value of a diagonal matrix in all cases. In contrast, the proposed algorithms work fine and find reasonable minimum points of because the minimum values of the objective function and level of sparsity are quite close to those obtained from starting at the full sample covariance matrix. We have tried to initialize Bien & Tibshirani (2011)’s algorithm at but found that it still gets stuck after a few iterations. It may be possible to improve the poor performance of Bien & Tibshirani (2011)’s algorithm by adjusting some of its tuning parameters. However, it may be safe to conclude that Bien & Tibshirani (2011)’s algorithm requires either very careful tuning or performs badly at this important initial value. Second, initial values indeed matter. Comparing the results between full and diagonal initial values, we see substantial differences in all three measures. In particular, the limiting points from the diagonal initial matrices are sparser than those from the full initial matrices. This is not surprising because of the drastic difference in sparsity between these two starting points. Third, comparing the minimum values of the objective function achieved by the three algorithms (last two columns), we see that the coordinate descent often reaches the lowest minimum values regardless of the initial points. The few exceptions (e.g., sparse , ) when the coordinate descent is not the best seems to be the case that is small and the fraction of the number of non-zero elements is large.
|CPU Time||Nonzero||Objective Func|
|Sparse, 100, 0.01||BT||113||28||0.919||0.000||2.042||79.022|
|Sparse, 100, 0.24||BT||67||26||0.303||0.000||39.455||79.022|
|Sparse, 100, 1.11||BT||354||25||0.032||0.000||85.309||79.022|
|Sparse, 200, 0.01||BT||-||221||-||0.000||-||156.840|
|Sparse, 200, 0.15||BT||-||173||-||0.000||-||156.840|
|Sparse, 200, 1.00||BT||-||170||-||0.000||-||156.840|
|Dense, 100, 0.02||BT||27||19||0.909||0.000||96.782||167.165|
|Dense, 100, 0.19||BT||55||25||0.313||0.000||152.914||167.165|
|Dense, 100, 0.32||BT||149||10||0.027||0.000||166.929||167.165|
|Dense, 200, 0.01||BT||330||147||0.928||0.000||180.274||342.279|
|Dense, 200, 0.15||BT||657||207||0.294||0.000||303.813||342.279|
|Dense, 200, 0.29||BT||890||139||0.034||0.000||341.231||342.279|
We have developed two alternative algorithms for fitting sparse covariance graphical lasso models using penalty. These two algorithms are shown to be much easier to implement, significantly faster to run and numerically more stable than the algorithm of Bien & Tibshirani (2011). Both MATLAB and R software packages implementing the new algorithms for solving covariance graphical models are freely available from the author’s the website of the paper.
- Armagan, A., Dunson, D. & Lee, J. (2011). Generalized double Pareto shrinkage. Statistica Sinica (forthcoming) .
- Bien, J. & Tibshirani, R. J. (2011). Sparse estimation of a covariance matrix. Biometrika 98, 807–820.
- Breheny, P. & Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The Annals of Applied Statistics , 232–253.
- Dempster, A. (1972). Covariance selection. Biometrics 28, 157–75.
- Friedman, J., Hastie, T., Höfling, H. & Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics 1, pp. 302–332.
- Friedman, J., Hastie, T. & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441.
- Meng, X.-L. & Rubin, D. B. (1993). Maximum likelihood estimation via the ecm algorithm: A general framework. Biometrika 80, 267–278.
- Polson, N. G. & Scott, J. G. (2011). Data augmentation for non-Gaussian regression models using variance-mean mixtures. ArXiv e-prints .
- Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications 109, 475–494.
- Wu, T. T. & Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics 2, pp. 224–244.