A multilevel framework for sparse optimization with application to inverse covariance estimation and logistic regression The research leading to these results has received funding from the European Union’s - Seventh Framework Programme (FP7/2007-2013) under grant agreement no 623212—MC Multiscale Inversion. This research was also funded (in part) by the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI).

A multilevel framework for sparse optimization with application to inverse covariance estimation and logistic regression ††thanks: The research leading to these results has received funding from the European Union’s - Seventh Framework Programme (FP7/2007-2013) under grant agreement no 623212—MC Multiscale Inversion. This research was also funded (in part) by the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI).

Eran Treister Department of Earth and Ocean Sciences, University of British Columbia, Vancouver, Canada. eran@cs.technion.ac.il.    Javier S. Turek Intel Labs, 2111 NE 25th Ave., Hillsboro, OR 97124, javier.turek@intel.com.    Irad Yavneh Department of Computer Science, Technion—Israel Institute of Technology, Haifa, Israel. irad@cs.technion.ac.il.
Abstract

Solving regularized optimization problems is common in the fields of computational biology, signal processing and machine learning. Such regularization is utilized to find sparse minimizers of convex functions. A well-known example is the LASSO problem, where the norm regularizes a quadratic function. A multilevel framework is presented for solving such regularized sparse optimization problems efficiently. We take advantage of the expected sparseness of the solution, and create a hierarchy of problems of similar type, which is traversed in order to accelerate the optimization process. This framework is applied for solving two problems: (1) the sparse inverse covariance estimation problem, and (2) -regularized logistic regression. In the first problem, the inverse of an unknown covariance matrix of a multivariate normal distribution is estimated, under the assumption that it is sparse. To this end, an regularized log-determinant optimization problem needs to be solved. This task is challenging especially for large-scale datasets, due to time and memory limitations. In the second problem, the -regularization is added to the logistic regression classification objective to reduce overfitting to the data and obtain a sparse model. Numerical experiments demonstrate the efficiency of the multilevel framework in accelerating existing iterative solvers for both of these problems.

Key words. Sparse optimization, Covariance selection, Sparse inverse covariance estimation, Proximal Newton, Block Coordinate Descent, Multilevel methods, -regularized logistic regression.

AMS subject classifications. 90C06, 90C25, 90C22

1 Introduction

Sparse solutions of optimization problems are often sought in various fields such as signal processing, machine learning, computational biology, and others [55, 42]. Particular applications include sparse modelling of signals [12], compressed sensing [11, 6], speech recognition [4], gene network analysis [10], and brain connectivity [28]. To promote sparsity, an norm regularization term is often introduced, generally leading to convex optimization problems of the form

 x∗=argminx∈RnF(x)=argminx∈Rnf(x)+λ∥x∥1, \hb@xt@.01(1.1)

where the function is smooth (continuously differentiable) and convex, and is a positive scalar parameter that balances between sparsity and adherence to minimizing . A larger parameter tends to produce a sparser minimizer , but also a higher value for . Problem (LABEL:eq:FX) is convex but non-smooth due to the regularizer, and traditional optimization methods such as gradient descent tend to converge slowly. For this reason, in many cases special methods are developed for specific instances of (LABEL:eq:FX). A well-known special case is the LASSO problem [43], where is a quadratic function,

 x∗=argminx∈Rn12xTHx+xTg+λ∥x∥1, \hb@xt@.01(1.2)

with , and a positive semi-definite matrix. This problem can often be solved quite efficiently by so-called “iterative shrinkage” or “iterative soft thresholding (IST)” methods [15, 29, 8, 13, 5, 49, 47, 54] and other methods [16, 18, 50, 40, 36].

As a rule, iterative methods that have been developed for solving the quadratic problem (LABEL:eq:FX_QUAD) can be generalized and used to solve the general problem (LABEL:eq:FX). This can be done by rather straightforward approaches: at each iteration the smooth part in (LABEL:eq:FX) is approximated by some quadratic function, whereas the non-smooth term remains intact. Then, a descent direction is computed by approximately solving the resulting -regularized quadratic minimization problem of the form (LABEL:eq:FX_QUAD) by iterative shrinkage methods. This approximation can be applied using only first order information (i.e., ), or second order information (both and ). The former approach results in a gradient-descent-like method, while the latter approach, on which we will focus in this paper, is called the “proximal Newton” method [32]. We give a precise description in the next section.

In this work we propose a multilevel framework for accelerating existing solvers for problem (LABEL:eq:FX), based on the work of [45] which introduced a similar framework for the LASSO problem. In this framework, the convergence of existing iterative methods is accelerated using a nested hierarchy of successively smaller versions of the problem. Exploiting the sparsity of the sought approximation, the dimension of problem (LABEL:eq:FX) is reduced by temporarily ignoring ostensibly irrelevant unknowns, which remain zero for the duration of the multilevel iteration. That is, each reduced problem is defined by (LABEL:eq:FX), restricted to a specially chosen subset of variables. This yields a nested hierarchy of problems. Subspace corrections are performed by applying iterative methods to each of the low dimensional problems in succession, with the aim of accelerating the optimization process. Under suitable conditions, this algorithm converges to a global minimizer of (LABEL:eq:FX).

In the second and third parts of the paper, we apply our framework to the solution of (1) the sparse inverse covariance estimation problem, and (2) the -regularized logistic regression problem. Both of these have the form of (LABEL:eq:FX), and we focus on the first one in more details because it is significantly more complicated and challenging. In this problem, the inverse of the covariance matrix of a multivariate normal distribution is estimated from a relatively small set of samples, assuming that it is sparse. Its estimation is performed by solving an regularized log-determinant optimization problem, on which we elaborate later in this paper. Many methods were recently developed for solving the covariance selection problem [3, 2, 7, 19, 22, 24, 25, 26, 33, 35, 44], and a few of those [26, 25, 35, 44] involve a proximal Newton approach. In the present work we mostly focus on large-scale instances of this problem, which are required in fMRI [25] and gene expression analysis [10, 23] applications, for example. Such large scale problems are very challenging, primarily because of memory limitations, and the only two published methods that are capable of handling them are [25, 44]. In this paper we review the method of [44] and accelerate it by our multilevel framework. Moreover, we show that our framework is more efficient than other acceleration strategies for this problem. These ideas can be exploited for other large-scale instances of log-determinant sparse optimization problems such as [48, 37, 27]. Following this, we briefly describe the -regularized logistic regression problem, and accelerate the two methods [51] and [52] using our multilevel framework.

2 First and second order methods for l1 regularized sparse optimization

As mentioned above, problem (LABEL:eq:FX) can be solved by adapting iterated shrinkage methods for (LABEL:eq:FX_QUAD). To achieve that, at iteration the smooth function in (LABEL:eq:FX) is replaced by a quadratic approximation around the current iterate to obtain a descent direction . More specifically, is obtained by approximately solving

 z(k)=argminz∈Rn~F(x(k)+z)=argminz∈Rnf(x(k))+⟨g(k),z⟩+12⟨z,H(k)z⟩+λ∥x(k)+z∥1, \hb@xt@.01(2.1)

where , and is a positive definite matrix that is method specific. This problem is similar to (LABEL:eq:FX_QUAD) and can be solved using the shrinkage methods mentioned earlier. The role of in (LABEL:eq:F_tilde_general) is to either incorporate second order information (, the Hessian of ) or mimic it in a simpler and cheaper way. If then this results in the “proximal Newton” method [32], and in that case, if (LABEL:eq:F_tilde_general) is solved to sufficient accuracy, the proximal Newton method has a superlinear asymptotic convergence rate [32]. Once is found, the next iterate is obtained by

 x(k+1)=x(k)+αz(k), \hb@xt@.01(2.2)

where is a scalar which may be chosen a priori or determined by a line-search procedure. For example, one may use the Armijo rule [1] or an exact linesearch if possible [47].

A simpler approach for defining (LABEL:eq:F_tilde_general) is to include only first order information, and to use a simpler , often chosen to be a diagonal positive definite matrix, denoted . In this case, the problem (LABEL:eq:F_tilde_general) has a closed-form solution

 z(k)=Sλ(D(k))−1(x(k)−(D(k))−1g(k))−x(k), \hb@xt@.01(2.3)

where

 Sλ(t)=sign(t)⋅max(0,|t|−λ) \hb@xt@.01(2.4)

is the “soft shrinkage” function, which reduces the absolute value of by , or sets it to zero if . This results in a first order shrinkage method, which is similar to gradient descent or quasi Newton methods, and can be seen as a generalization of existing shrinkage methods for (LABEL:eq:FX_QUAD). More specifically, a generalization of the separable surrogate functionals (SSF) method of [8] can be obtained by defining , where is the identity matrix and is an upper bound on the spectral radius of the Hessian. Similarly, a generalization of the parallel coordinate descent (PCD) method [13] for (LABEL:eq:FX) reads (the diagonal part of the Hessian). Such generalizations, which are mentioned in [13] for example, are quite straightforward. Moreover, in [49] for example, the general problem (LABEL:eq:FX) is addressed rather than the quadratic (LABEL:eq:FX_QUAD), which is the actual target of this work. We note that the convergence guarantee of such methods for (LABEL:eq:FX) requires modest assumptions on the smooth function (i.e., the level set is compact, and the Hessian is bounded: ). Generally, such methods converge linearly, but can be accelerated by subspace methods like SESOP [13, 54] and non-linear conjugate gradients [54, 45].

Although the shrinkage methods are generally more efficient than other traditional methods, they can be further accelerated. A main problem of these methods is that, during the iterations, the iterates may be denser than the final solution. In particular, if we start from a zero initial guess, we would typically get several initial iterates which are far less sparse than the final minimizer . The rest of the iterates will gradually become sparser, until the sparsity pattern converges to that of . The main drawback is that those initial iterations with the denser iterates may be significantly more expensive than the later iterations with the sparse iterates. For example, when solving (LABEL:eq:FX_QUAD) a matrix-vector multiplication needs to be applied at each shrinkage iteration (LABEL:eq:IST_direction). If the matrix is given explicitly (i.e., not as a fast operator) and is sparse, then computations can be saved by not multiplying the columns of that correspond to the zero entries of .

One of the most common and simplest ways to address this phenomenon is by a continuation procedure [47, 49]. In this procedure a sequence of problems (LABEL:eq:FX) that correspond to a sequence of decreasing regularization parameters is solved. This sequence starts with a relatively large value of which is gradually decreased, and at each stage the new initial guess is given by the approximate solution to (LABEL:eq:FX) obtained with the previous, bigger, . Since a bigger yields a sparser solution, this continuation procedure may decrease the number of non-zeros in the iterates at the expense of applying additional iterations for the larger ’s.

3 A multilevel framework for l1 regularized sparse optimization

We begin this section by providing some definitions and motivation that will be useful in the remainder of the paper. Let

 supp(x)={i:xi≠0} \hb@xt@.01(3.1)

denote the support of the vector : the set of indices of its non-zero elements. The key challenge of sparse optimization is to identify the best (small) support for minimizing the objective . Additionally, we must calculate the values of the non-zeros of this minimizer. In many cases, if we were initially given the support of the minimizer, it would make the solution of (LABEL:eq:FX) easier111In such cases where a support is known or assumed, the regularization may be dropped from the problem. However, here we focus on the solution of (LABEL:eq:FX) for an unknown support, and use the known support case only as a motivation. Therefore, in this discussion we keep the regularization also if the support is known.. For example, the LASSO problem (LABEL:eq:FX_QUAD) has variables, however, if the support of its minimizer is known and consists of only about of the entries, we can ignore all entries not in and solve the problem

 minxc∗∈R|c∗|12xTc∗Hc∗xc∗+xTc∗gc∗+λ∥xc∗∥1, \hb@xt@.01(3.2)

where , and are the same components , and from (LABEL:eq:FX_QUAD), restricted to the entries in . This is the same problem as (LABEL:eq:FX_QUAD), but is about 100 times smaller and therefore much cheaper to solve. This motivates our approach.

In our multilevel framework, the convergence of common iterative methods is accelerated using a nested hierarchy of smaller versions of the problem referred to as coarse problems. At each multilevel iteration denoted by “ML-cycle”, we define a hierarchy of such coarse problems using the sparsity of the iterated approximations . Each coarse problem is defined by (LABEL:eq:FX), restricted to a subset of the variables, while keeping the other variables as zeros. This process is repeated several times, yielding a nested hierarchy of problems. In each ML-cycle we traverse the entire hierarchy of levels, from the coarsest to the finest, applying iterations using methods like (LABEL:eq:IST_direction) or proximal Newton over each of the coarse problems in turn. We henceforth refer to such iterations as relaxations. These aim to activate the variables that comprise the support of a minimizer. We iteratively repeat these ML-cycles until some convergence criterion is satisfied.

3.1 Definition of the coarse problems and multilevel cycle

As noted, at each level we define a reduced coarse problem by limiting problem (LABEL:eq:FX) to a subset of entries, denoted by . In this subsection we assume that the subset is given and defer the discussion on how it is chosen to the next section. Given , the coarse problem for level is defined by

 minx∈Rn,supp(x)⊆ClF(x), \hb@xt@.01(3.3)

where is the same objective as in (LABEL:eq:FX) and is defined in (LABEL:eq:supp). Effectively, (LABEL:eq:FX_coarse) has only unknowns, hence it is of lower dimension. Furthermore, if contains the support of a minimizer of (LABEL:eq:FX), i.e., , then is also a solution of (LABEL:eq:FX_coarse). Otherwise, the solutions of the two problems are not identical.

We define our multilevel hierarchy by choosing nested subsets of variables . Given the current iterate, we define the hierarchy

 {1,...,n}=C0⊃C1⊃...⊃CL=supp(x(k)). \hb@xt@.01(3.4)

In the ML-cycle, we treat the levels from to in succession by applying relaxations for the reduced problem (LABEL:eq:FX_coarse) corresponding to each subset . We typically apply only one or two relaxations on levels . On the coarsest level , more relaxations may be applied because is typically small and they are not expensive. The multilevel cycle procedure is presented in detail in Algorithm LABEL:alg:ML.

To define an iterative relaxation for (LABEL:eq:FX_coarse) at each of the levels in (LABEL:eq:hierarchy1), one can adapt most iterative relaxations that are suitable for the finest problem (LABEL:eq:FX) by allowing only the elements in to vary and fixing the other elements to zero. It is important to choose a relaxation whose cost is at worst proportional to the number of unknowns . As an example, consider again the relation between the original LASSO problem (LABEL:eq:FX_QUAD) and one restricted to a much smaller set of variables, . If the matrices are given explicitly in memory, then each shrinkage iteration for (LABEL:eq:FX_QUAD_supp) is significantly cheaper than a shrinkage iteration for (LABEL:eq:FX_QUAD), roughly proportional to . We note that any specific problem of the form (LABEL:eq:FX), for which there exists some relaxation whose cost is proportional to when applied to the restricted problem (LABEL:eq:FX_coarse), is suitable to be accelerated with our ML approach.

3.2 Choosing the coarse variables

Given , our task is to select a subset of indices, , that is significantly smaller than (see below), and is deemed most likely to contain the support of the ultimate solution we are seeking, . We begin by choosing to include the indices that are in the support of the current iterate, . To these we add indices not in , that are estimated to be relatively likely to end up in , namely, indices corresponding to variables with a relatively large absolute value of the current gradient, . To motivate this choice, consider the minimization problem (LABEL:eq:F_tilde_general) restricted to a single element of , denoted , and assume that . We get the scalar minimization problem

 zopti=argminzi12az2i+bzi+c+λ|zi|, \hb@xt@.01(3.5)

where , , and is a constant. A closed-form solution of (LABEL:eq:1d_l1), is given by

 zopti=⎧⎪ ⎪⎨⎪ ⎪⎩b−λaifb>λb+λaifb<−λ0otherwise, \hb@xt@.01(3.6)

which corresponds to element in (LABEL:eq:IST_direction). This indicates that if is relatively large, then we have a relatively good chance that the variable will become non-zero and will enter the support in the next iterate .

To summarize, for a given we first decide on the size of . In this work we choose , and terminate the coarsening (setting ) when . Then, to populate , we first choose to include , and then add the indices of the additional variables with the largest values of . The choice of the coarsening ratio of approximately turns out to strike a good balance between ML-cycle cost and efficacy. If the cost of the relaxation at level is proportional to , then the cost of a ML-cycle with relaxations per level is approximately equal to relaxations on the finest level. This means that, although we include a relatively large fraction of the variables when we coarsen to the next level, the cost of the entire cycle remains relatively small.

Finally, we note that in practice we define the nested hierarchy using the gradient from the relaxation on the finest level of the previous cycle, which includes all the variables. This relaxation is also used for monitoring convergence, as it is the only place where the gradient is calculated for all variables.

4 Theoretical results

In this section we state some theoretical observations regarding the relaxation methods defined by (LABEL:eq:F_tilde_general)-(LABEL:eq:Line-search), and our multilevel framework in Algorithm LABEL:alg:ML.

4.1 Theoretical results for the relaxation methods

We show that under suitable conditions any relaxation method defined by (LABEL:eq:F_tilde_general)-(LABEL:eq:Line-search) is monotonically decreasing and convergent. We first prove the following lemmas.

Lemma 4.1

(Monotonicity of the relaxation.) Assume that the Hessian is bounded , and that is defined by (LABEL:eq:F_tilde_general)-(LABEL:eq:Line-search), with and , where is a positive constant. Then

 \hb@xt@.01(4.1)

where is a positive constant. Furthermore, the linesearch parameter in (LABEL:eq:Line-search) can be chosen to be bounded away from zero, i.e., .

Proof. The following analysis is inspired by [26] and [49]. Let us drop the superscript , and write for any , and

 ∥x+αz∥1=∥α(x+z)+(1−α)x∥1≤α∥x+z∥1+(1−α)∥x∥1. \hb@xt@.01(4.2)

Next, since the Hessian is bounded we can write for any , and

 f(x+αz)≤f(x)+αzT∇f+12α2M∥z∥2. \hb@xt@.01(4.3)

Now, let us assume that was yielded by (LABEL:eq:F_tilde_general). We obtain

 \hb@xt@.01(4.4)

where the first and second inequalities are obtained by (LABEL:eq:bounded_Hessian) and (LABEL:eq:triangle_inequality), respectively, and the third inequality follows from the fact that achieves a better objective in (LABEL:eq:F_tilde_general) than 0. Now, because we assume that is used in the relaxation, then following (LABEL:eq:alpha_proof) we write

 F(x)−F(x+αz)≥12(γmin−αM)α∥z∥2=K⋅α∥z∥, \hb@xt@.01(4.5)

which is always positive for any . This proves (LABEL:eq:MLMI2) for .

For the following results we use the notion of sub-gradients. , the sub-differential of , is the set

 \hb@xt@.01(4.6)

A vector is a minimizer of (LABEL:eq:FX), if and only if [17]. We now extend Lemma 2 in [49] to any relaxation of type (LABEL:eq:F_tilde_general)-(LABEL:eq:Line-search). This lemma shows that, under suitable conditions, any point is either a stationary point of , or else the result of is a substantial distance away from . The proof for this lemma is similar to the proof in [49].

Lemma 4.2

(No stagnation of the relaxation.) Let be a series of points produced by , defined by (LABEL:eq:F_tilde_general)-(LABEL:eq:Line-search) with . Let be any infinite and converging subseries of , and let denote its limit. Then is a stationary point of in (LABEL:eq:FX).

Proof. Since the subseries converges to , then converges to . Following Lemma LABEL:lem:monotonicity, the full series is monotone and hence convergent because is convergent. Therefore , which implies following (LABEL:eq:alpha_proof2) that , and . By (LABEL:eq:F_tilde_general), the condition

 0∈∂~F={∇f(x(kj))+H(kj)z(kj)+λt} \hb@xt@.01(4.7)

is satisfied for , where . Then, by (LABEL:eq:Line-search) we have . Because by Lemma LABEL:lem:monotonicity, implies , which, due to the upper bound on , leads to . Now as in [49], by taking the limit as , and using outer semicontinuity of , we have that defined in (LABEL:eq:subgradient1), hence is a stationary point of .

Next, we show that relaxation of type (LABEL:eq:F_tilde_general)-(LABEL:eq:Line-search) converges to a stationary point of (LABEL:eq:FX), which is also a minimum because is convex. Our proof follows the convergence proofs in [13, 45].

Theorem 4.3

(Convergence of the relaxation.) Assume that the level set is compact, and the Hessian is bounded: . Let be a series of points produced by , defined by (LABEL:eq:F_tilde_general)-(LABEL:eq:Line-search), with , starting from an initial guess . Then any limit point of the sequence is a stationary point of in (LABEL:eq:FX), i.e., , and converges to .

Proof. By Lemma LABEL:lem:monotonicity, the series is monotonically decreasing. Since the objective in (LABEL:eq:FX) is non-negative, it is bounded from below, and hence the series converges to a limit. Because the level set is compact by assumption, we have that is bounded in , and therefore there exists a sub-series converging to a limit point . By Lemma LABEL:lem:sufficient_reduction, the point is a stationary point of . Since is continuous, yields . The limit of equals to that of any of its sub-series, specifically , and thus .

The results above are intriguing because they show that we can use any positive definite in (LABEL:eq:F_tilde_general), and the resulting method converges. In particular, the analysis shows that one can use a positive definite inexact Hessian as in [44, 25], and the method still converges. In a way, this is similar to the property of preconditioners when solving linear systems. Now, one may wonder if it is possible to generate preconditioners for which are “easily invertible” in the sense of minimizing (LABEL:eq:F_tilde_general), and solve (LABEL:eq:FX) more efficiently this way.

4.2 Theoretical results for the multilevel framework

From the definitions of (LABEL:eq:FX) and (LABEL:eq:FX_coarse), we know that reducing on any of the coarser levels also reduces for the fine level. Therefore, if Algorithm 1 is used with a monotonically decreasing relaxation, then it is also monotonically decreasing. In addition, we have the following properties.

Lemma 4.4

(Coarse solution correspondence.) Let be a subset of the variables {1,…,n}, where is a solution of (LABEL:eq:FX). Let be a solution of problem (LABEL:eq:FX_coarse) restricted to . Then is also a solution of (LABEL:eq:FX).

Proof. Because , then is a feasible point of (LABEL:eq:FX_coarse). Since is a solution of (LABEL:eq:FX_coarse), then . Therefore, is also a solution of (LABEL:eq:FX), because otherwise we contradict the optimality of .

For the next two properties we assume that the coarsest problem is solved exactly in Algorithm LABEL:alg:ML. From lemma LABEL:prop:inter_solution, the following corollary immediately holds.

Corollary 4.5

If at the -th cycle of Algorithm LABEL:alg:ML, then problem (LABEL:eq:FX) is solved at that cycle.

Theorem 4.6

(No Stagnation of ML-cycle.) Assume that the conditions of Lemma LABEL:lem:monotonicity hold for the relaxation method used in Algorithm LABEL:alg:ML. Let be the solution of the coarsest level problem at Step LABEL:step:coarsest of Algorithm LABEL:alg:ML. If , then at least one iterated shrinkage relaxation on one of the levels must change .

Proof. Because is a minimizer of the coarsest problem (LABEL:eq:FX_coarse), then for all

 (∇f(x))j+λsign(xj)=0if xj≠0,|(∇f(x))j|≤λif xj=0. \hb@xt@.01(4.8)

Now, since , is not a minimizer of the unrestricted problem (LABEL:eq:FX), so . Therefore, there exists at least one variable for which . Suppose that (LABEL:eq:coarseOptimality) holds for , such that is the coarsest level in the multilevel hierarchy that includes such a variable. Because satisfy (LABEL:eq:coarseOptimality), is a stationary point of all the relaxations (LABEL:eq:IST_direction) on those levels. However, on level (LABEL:eq:coarseOptimality) is violated, and the relaxation yields a direction fulfiling

 0∈(∂~F)c^l={(∇f(x))c^l+Hc^lzc^l+λtc^l}, \hb@xt@.01(4.9)

where , and and are the gradient and the Hessian approximation of the relaxation restricted to the entries in . Since and the linesearch parameter , then , and hence, following Lemma LABEL:lem:monotonicity, . Now, if the support of did not change following this relaxation, i.e., , this would contradict the optimality of with respect to the levels .

Our last Theorem proves that Algorithm LABEL:alg:ML converges when used with a suitable relaxation method. The Algorithm falls into the block coordinate descend framework in [46] where the blocks are the sets in all levels. In particular, since , then all multilevel cycles end with a relaxation that includes all the variables, and hence the Gauss-Seidel rule in [46] is satisfied at most every relaxations.

Theorem 4.7

(Convergence of Algorithm LABEL:alg:ML.) Assume that the conditions of Theorem LABEL:prop:Convergence2 hold for and . Let be a series of points produced by , defined by Algorithm LABEL:alg:ML with and , starting from an initial guess . Then any limit point of the sequence is a stationary point of in (LABEL:eq:FX), and converges to .

Proof. Let us now define to be the series of points generated by all the relaxations that are performed within the cycles for producing . Lemma LABEL:lem:monotonicity and the relation between the problems (LABEL:eq:FX) and (LABEL:eq:FX_coarse) imply that is monotonically non-increasing. Similarly to the proof of Theorem LABEL:prop:Convergence2, since in (LABEL:eq:FX) is bounded from below, the series converges to a limit, and therefore there exists a sub-series converging to a limit point . Because we apply the same type of relaxation on all levels then following (LABEL:eq:alpha_proof2), implies that , and . In a similar way this leads to for . By the definition of Algorithm LABEL:alg:ML, a full relaxation is applied on one of the points (i.e., a relaxation that includes all variables in ). This means that at least one of the subseries includes an infinite subseries of fine-level points converging to , and each direction obtained by the corresponding relaxation satisfies (LABEL:eq:subgrad). By the same arguments that follow Equation (LABEL:eq:subgrad) in the proof of Lemma LABEL:lem:sufficient_reduction, is a stationary point of . Since is continuous, yields . The limit of equals to that of any of its sub-series, specifically . Thus which is a subseries of converges to .

Organization and notation. Until now we described a general framework for regularized convex optimization. The remainder of the paper is devoted to two specific problems: one is the sparse inverse covariance estimation on which we focus extensively, and the other is -logistic regression. For the first problem it is natural to consider the unknowns as a matrix (the estimated inverse of the covariance matrix), and hence we revert to the familiar matrix notation , instead of . This is the only difference in notation between the first and second parts (e.g., we minimize in sections 1-4, in sections 5-8, and in sections 9-10. In all cases, is the -regularized non-smooth objective).

5 The sparse inverse covariance estimation problem

Estimating the parameters of a multivariate Gaussian (Normal) distribution is a common problem in many applications in machine learning, computational biology, and other fields [2]. Given a set of samples , where , the objective is to estimate the mean , and either the covariance matrix or its inverse , which is also called the precision matrix. In particular, the inverse of the covariance matrix, which represents the underlying graph of a Gaussian Markov random field (GMRF), is useful in many applications [38].

Both the mean and the covariance are often estimated using the maximum likelihood estimator (MLE), given the samples . The MLE aims to maximize the probability of sampling given the parameters. In the Gaussian case, this leads to the maximization of the density function of the Normal distribution

 ^μ,^Σ\tiny{MLE}=argmaxΣ,μm∏i=1P(yi|Σ,μ)=argmaxΣ,μm∏i=11√(2π)mdet(Σ)exp(−12(yi−μ)TΣ−1(yi−μ)). \hb@xt@.01(5.1)

This yields as estimation for the mean and222Equation (LABEL:eq:cov_estimate) is the standard MLE estimator, derived from (LABEL:eq:MLEorig). However, sometimes the unbiased MLE estimation is preferred, where replaces in the denominator.

 S\lx@stackrel△=^Σ\tiny{MLE}=1mm∑i=0(yi−^μ)(yi−^μ)T, \hb@xt@.01(5.2)

which is also called the empirical covariance matrix. More specifically, by applying to the MLE objective in (LABEL:eq:MLEorig) and minimizing it over the inverse covariance matrix we get that is estimated by solving the optimization problem

 minA≻0f(A)\lx@stackrel△=minA≻0−log(detA)+tr(SA), \hb@xt@.01(5.3)

However, if the number of samples is smaller than the problem dimension, i.e., , then in (LABEL:eq:cov_estimate) is rank deficient and not invertible, whereas the true is assumed to be full-rank and positive definite. Nevertheless, in this case one can reasonably estimate by adding further assumptions. It can be observed in the probability density function in (LABEL:eq:MLEorig) that if , then the random variables in the -th and -th entries of a vector are conditionally independent, given that the rest of the variables are known [9]. Therefore, one may look at as a direct dependency matrix where each of its off-diagonal non-zeros indicates a direct dependency between two variables. For this reason, many applications adopt the notion of estimating a sparse inverse of the covariance, . (Note that in most cases remains dense.) For this purpose, we follow [2, 3, 7], and minimize (LABEL:eq:MLE) with a sparsity-promoting prior:

 minA≻0F(A)\lx@stackrel△=minA≻0f(A)+λ∥A∥1. \hb@xt@.01(5.4)

Here, is the MLE objective defined in (LABEL:eq:MLE), , and is a regularization parameter. The sparsity assumption is justified when most variables are directly statistically dependent on only a small number of variables, and thus conditionally independent of the rest. Problem (LABEL:eq:MLE_l1_prior) is also called Covariance Selection [9] and it has a unique solution [2, 7]. It is an instance of (LABEL:eq:FX), so it is non-smooth and convex, but unlike (LABEL:eq:FX) it is also constrained to the positive definite domain.

Many methods were recently developed for solving (LABEL:eq:MLE_l1_prior)—see [3, 2, 7, 19, 22, 24, 25, 26, 33, 35, 44] and references therein. However, as mentioned earlier, in this work we are interested in efficiently solving large scale instances of (LABEL:eq:MLE_l1_prior), where is large such that variables cannot fit in memory (we assume that the data samples do fit in memory). This makes the solution of (LABEL:eq:MLE_l1_prior) particularly challenging, since the gradient of includes , which is a dense matrix, coming from the term. Because of this, most of the existing methods cannot be used to solve (LABEL:eq:MLE_l1_prior), as they use the full gradient of . The same applies for the strategies of [2, 19] that target the dense covariance matrix itself rather than its inverse, using the dual formulation of (LABEL:eq:MLE_l1_prior). Two exceptions are (1) BigQUIC - a proximal Newton approach in [25], which was made suitable for large-scale matrices by treating the Newton direction problem in blocks, and (2) a Block-Coordinate-Descent for Inverse Covariance Estimation (BCD-IC) method [44] that directly treats (LABEL:eq:MLE_l1_prior) in blocks. In the following sections we briefly describe the proximal Newton approach for (LABEL:eq:MLE_l1_prior), and review the BCD-IC method of [44]. Following that, we describe how to accelerate BCD-IC by our multilevel framework, and show improvements for it in the case where problem (LABEL:eq:MLE_l1_prior) is solved for a given support—similarly to problem (LABEL:eq:FX_coarse) which is constrained to a given support.

5.1 Proximal Newton methods for sparse inverse covariance estimation

A few of the current state-of-the-art methods [24, 25, 26, 35] for (LABEL:eq:MLE_l1_prior) involve the “proximal Newton” approach described earlier in Section LABEL:sec:relaxations_general. To obtain the Newton descent direction, the smooth part in (LABEL:eq:MLE_l1_prior) is replaced by a second order Taylor expansion, while the non-smooth term remains intact. This requires computing the gradient and Hessian of , which are given by

 ∇f(A)=S−A−1,∇2f(A)=A−1⊗A−1, \hb@xt@.01(5.5)

where is the Kronecker product. The presence of in the gradient not only imposes memory problems in large scales, it is also expensive to compute. Therefore, the advantage of the proximal Newton approach here is the low overhead: by calculating the in , we also get the information needed to apply the Hessian [25, 26, 35].

Similarly to (LABEL:eq:F_tilde_general), the Newton direction is the solution the LASSO problem,

 Δ(k)=argminΔ∈Rn×n~F(A(k)+Δ)=argminΔ∈Rn×nf(A(k))+tr(Δ(S−W))+12tr(ΔWΔW)+λ∥A(k)+Δ∥1, \hb@xt@.01(5.6)

where . The gradient and Hessian of in (LABEL:eq:gradHes) are featured in the second and third terms in (LABEL:eq:Quad_MLE_l1_prior), respectively. Once the direction is computed, it is added to employing a linesearch procedure to sufficiently reduce the objective in (LABEL:eq:MLE_l1_prior) while ensuring positive definiteness. To this end, the updated iterate is , where may be obtained using Armijo’s rule [26].

5.2 Restricting the updates to free sets

In addition, [26] introduced a crucial step: restricting of the Newton direction in (LABEL:eq:Quad_MLE_l1_prior) to a “free set” of variables, while keeping the rest as zeros. The free set of a matrix is defined as

 F(A)={(i,j):Aij≠0∨|(S−A−1)ij|>λ}. \hb@xt@.01(5.7)

If one solves (LABEL:eq:Quad_MLE_l1_prior) with respect only to the variables outside this free set, they all remain zero, suggesting that it is worthwhile to (temporarily) restrict (LABEL:eq:Quad_MLE_l1_prior) only to the variables in this set [26]. This reduces the computational complexity of most LASSO solvers: given the matrix , the Hessian term in (LABEL:eq:Quad_MLE_l1_prior) can be calculated in operations instead of , where . This saves significant computations in each Newton update, and at the same time does not significantly increase the number of iterations needed for convergence [44].

6 Block coordinate descent for sparse inverse covariance estimation (BCD-IC)

In this section we review the iterative Block Coordinate Descent method for solving large-scale instances of (LABEL:eq:MLE_l1_prior). In this method, we iteratively update the solution in blocks of matrix variables, where each block is defined as the free set of variables within a relatively small subset of columns of . We iterate over all blocks, and in turn minimize (LABEL:eq:MLE_l1_prior) restricted to each block by using a quadratic approximation, while the other matrix entries remain fixed. Since we consider one sub-problem at a time, we can fully store the gradient and Hessian for each block, assuming that the blocks are chosen to be small enough.

We limit our blocks to subsets of columns because this way, the corresponding portion of the gradient (LABEL:eq:gradHes) can be computed as solutions of linear systems. Because the matrix is symmetric, the corresponding rows are updated simultaneously. Figure LABEL:fig:BCD_Sweep shows an example of a BCD iteration where the subsets of columns are chosen in sequential order. In practice, theses subsets can be non-contiguous and vary between the BCD iterations. We elaborate later on how to partition the columns, and on some advantages of this block-partitioning. Partitioning the matrix into small blocks enables our method to solve (LABEL:eq:MLE_l1_prior) in high dimensions (up to millions of variables), requiring additional memory, where is the number of blocks (that is in addition to the memory needed for storing the iterated solution itself, and the data ).

6.1 BCD-IC iteration

We now describe a BCD-IC iteration, in which we divide the matrix into blocks, and iteratively update the solution matrix block by block. Assume that the set of columns is divided into subsets , where contains the indices of the columns that comprise the -th block. We denote the updated matrix after treating the -th block at iteration by and the next iterate is defined once we finish treating all blocks, i.e., . However, for simplicity of notation, let us denote the updated matrix , before treating block at iteration , by .

To update block , we form and minimize a quadratic approximation of problem (LABEL:eq:MLE_l1_prior), restricted to the rows/columns in :

 minΔj~F(~A+Δj), \hb@xt@.01(6.1)

where is the quadratic approximation of (LABEL:eq:MLE_l1_prior) around , similarly to (LABEL:eq:Quad_MLE_l1_prior), and has non-zero entries only in the rows/columns in . In addition, we restrict the non-zeros of to the free set defined in (LABEL:eq:active_set). That is, in (LABEL:eq:sub_Quad_MLE_l1_prior) is restricted to the free set

 FIj(~A)=F(~A)∩{(i,k):i∈Ij∨k∈Ij}, \hb@xt@.01(6.2)

while all other entries in are fixed to zero. To calculate (LABEL:eq:active_set_j), we check the condition in (LABEL:eq:active_set) only in the columns , which requires the gradient (LABEL:eq:gradHes) for block . For that, we calculate the columns of by solving linear systems, with the canonical vectors as right-hand-sides for each , i.e., . The solution of these linear systems is one of the main computational tasks of our algorithm, and can be achieved in various ways. For large dimensions, iterative methods such as Conjugate Gradients (CG) are usually preferred, possibly with preconditioning [39].

6.1.1 Treating a block-subproblem by Newton’s method

To get the Newton direction for the -th block, we solve the LASSO problem (LABEL:eq:sub_Quad_MLE_l1_prior) by using PCD accelerated by non-linear Conjugate Gradients (PCD-CG) [54, 45]. For a complete and detailed description of this algorithm see the Appendix of [44].

Let us denote . To apply a PCD-CG iteration, we need to calculate the objective of (LABEL:eq:sub_Quad_MLE_l1_prior) and its gradient efficiently. For that, we need to calculate the matrices , , and term only at the entries (LABEL:eq:active_set_j), where is non-zero. We compute only the columns of , because the rows are obtained by symmetry. The main computational task here involves the “Hessian-vector product” . For that, we reuse the columns of calculated for obtaining (LABEL:eq:active_set_j), denoted now by . Since we only need the result in the columns , we observe that , and the product can be computed efficiently because is sparse.

In order to compute for the entries in (LABEL:eq:active_set_j), we follow the idea of [25] and use the rows (or columns) of that are represented in (LABEL:eq:active_set_j). Besides the columns of we also need the “neighborhood” of defined as

 Nj={i:∃k∉Ij:(i,k)∈FIj(A)}. \hb@xt@.01(6.3)

The size of this set will determine the amount of additional columns of that we need, and therefore we wish it to be as small as possible. To achieve that, we follow [25] and define the blocks using clustering methods, which aim to partition the columns/rows into disjoint subsets, such that there are as few non-zero entries as possible outside the diagonal blocks of the matrix that correspond to each subset. In our notation, we aim that the size of will be as small as possible for every block , which is chosen to be relatively small. We use METIS [30], but other methods may be used instead. Note that only numbers out of are necessary for computing the relevant entries of . However, there might be situations where the matrix has a few dense columns, resulting in some sets of size . Computing for those sets is not possible because of memory limitations. This case is treated separately—see [44] for details.

6.1.2 Updating the solution with line-search

Denote the solution of the Newton direction problem (LABEL:eq:sub_Quad_MLE_l1_prior) by . Now we wish to update the matrix , where is obtained by a linesearch procedure, which requires evaluating the objective of (LABEL:eq:MLE_l1_prior) for several values of .

First, for any sparse matrix the cost of computing the trace and terms in (LABEL:eq:MLE_l1_prior) is proportional to the number of non-zero entries in (and the number of sample vectors). However, calculating the determinant of a general sparse matrix for evaluating the term of (LABEL:eq:MLE_l1_prior) may be costly. This may be done by using a sparse Cholesky factorization, but here we assume that is too large for that. In our case, however, since has a special block structure, we can reduce the term to a log-determinant of a small dense matrix, and compute it efficiently.

Let us introduce a partitioning of any matrix into blocks, according to a subset of indices . Assume without loss of generality that the matrix has been permuted such that the columns/rows with indices in appear first, and let

 A=⎡⎢ ⎢ ⎢ ⎢⎣A11A12A21A22⎤⎥ ⎥ ⎥ ⎥⎦ \hb@xt@.01(6.4)

be a partitioning of . The sub-matrix corresponds to the elements in rows and columns in . According to the Schur complement [39], for any invertible matrix and block-partitioning as above, the following holds:

 logdet(A)=logdet(A22)+logdet(A11−A12A−122A21). \hb@xt@.01(6.5)

Furthermore, for any symmetric matrix the following applies:

 A≻0⇔A22≻0 and A11−A12A−122A21≻0. \hb@xt@.01(6.6)

Using the above notation and partitioning for and , we write using (LABEL:eq:Schurs_lemma):

 logdet(~A+αΔ(k)j)=logdet(~A22)+logdet(B0+αB1+α2B2) \hb@xt@.01(6.7)

where , , and
. (Note that here we replaced by to simplify notation.) If the set is relatively small, then so are the matrices in (LABEL:eq:logdet_blocks), and given these matrices we can easily compute the objective . Furthermore, following (LABEL:eq:Schurs_PDness), the constra