Sparse Inverse Covariance Selection via Alternating Linearization Methods

# Sparse Inverse Covariance Selection via Alternating Linearization Methods

Katya Scheinberg
Department of ISE
Lehigh University
katyas@lehigh.edu
&Shiqian Ma,  Donald Goldfarb
Department of IEOR
Columbia University
{sm2756,goldfarb}@columbia.edu
###### Abstract

Gaussian graphical models are of great interest in statistical learning. Because the conditional independencies between different nodes correspond to zero entries in the inverse covariance matrix of the Gaussian distribution, one can learn the structure of the graph by estimating a sparse inverse covariance matrix from sample data, by solving a convex maximum likelihood problem with an -regularization term. In this paper, we propose a first-order method based on an alternating linearization technique that exploits the problem’s special structure; in particular, the subproblems solved in each iteration have closed-form solutions. Moreover, our algorithm obtains an -optimal solution in iterations. Numerical experiments on both synthetic and real data from gene association networks show that a practical version of this algorithm outperforms other competitive algorithms.

Sparse Inverse Covariance Selection via
Alternating Linearization Methods

Katya Scheinberg Department of ISE Lehigh University katyas@lehigh.edu Shiqian Ma,  Donald Goldfarb Department of IEOR Columbia University {sm2756,goldfarb}@columbia.edu

## 1 Introduction

In multivariate data analysis, graphical models such as Gaussian Markov Random Fields provide a way to discover meaningful interactions among variables. Let be an -dimensional random vector following an -variate Gaussian distribution , and let be a Markov network representing the conditional independence structure of . Specifically, the set of vertices corresponds to the set of variables in , and the edge set contains an edge if and only if is conditionally dependent on given all remaining variables; i.e., the lack of an edge between and denotes the conditional independence of and , which corresponds to a zero entry in the inverse covariance matrix (). Thus learning the structure of this graphical model is equivalent to the problem of learning the zero-pattern of . To estimate this sparse inverse covariance matrix, one can solve the following sparse inverse covariance selection (SICS) problem: where denotes the set of positive definite matrices, is the number of nonzeros in , is the sample covariance matrix, is the sample mean and is the -th random sample of . This problem is NP-hard in general due to the combinatorial nature of the cardinality term (). To get a numerically tractable problem, one can replace the cardinality term by , the envelope of over the set (see ). This results in the convex optimization problem (see e.g., [4, 5, 6, 7]):

 minX∈Sn++−logdet(X)+⟨^Σ,X⟩+ρ∥X∥1. (1)

Note that (1) can be rewritten as where is the largest absolute value of the entries of . By exchanging the order of max and min, we obtain the dual problem which is equivalent to

 maxW∈Sn++{logdetW+n:∥W−^Σ∥∞≤ρ}. (2)

Both the primal and dual problems have strictly convex objectives; hence, their optimal solutions are unique. Given a dual solution , is primal feasible resulting in the duality gap

 gap:=⟨^Σ,W−1⟩+ρ∥W−1∥1−n. (3)

The primal and the dual SICS problems (1) and (2) are semidefinite programming problems and can be solved via interior point methods (IPMs) in polynomial time. However, the per-iteration computational cost and memory requirements of an IPM are prohibitively high for the SICS problem. Although an approximate IPM has recently been proposed for the SICS problem , most of the methods developed for it are first-order methods. Banerjee et al.  proposed a block coordinate descent (BCD) method to solve the dual problem (2). Their method updates one row and one column of in each iteration by solving a convex quadratic programming problem by an IPM. The method of Friedman et al.  is based on the same BCD approach as in , but it solves each subproblem as a LASSO problem by yet another coordinate descent (CD) method . Sun et al.  proposed solving the primal problem (1) by using a BCD method. They formulate the subproblem as a min-max problem and solve it using a prox method proposed by Nemirovski . The SINCO method proposed by Scheinberg and Rish  is a greedy CD method applied to the primal problem. All of these BCD and CD approaches lack iteration complexity bounds. They also have been shown to be inferior in practice to gradient based approaches. A projected gradient method for solving the dual problem (2) that is considered to be state-of-the-art for SICS was proposed by Duchi et al. . However, there are no iteration complexity results for it either. Variants of Nesterov’s method [14, 15] have been applied to solve the SICS problem. d’Aspremont et al.  applied Nesterov’s optimal first-order method to solve the primal problem (1) after smoothing the nonsmooth term, obtaining an iteration complexity bound of for an -optimal solution, but the implementation in  was very slow and did not produce good results. Lu  solved the dual problem (2), which is a smooth problem, by Nesterov’s algorithm, and improved the iteration complexity to . However, since the practical performance of this algorithm was not attractive, Lu gave a variant (VSM) of it that exhibited better performance. The iteration complexity of VSM is unknown. Yuan  proposed an alternating direction method based on an augmented Lagrangian framework (see the ADAL method (8) below). This method also lacks complexity results. The proximal point algorithm proposed by Wang et al. in  requires a reformulation of the problem that increases the size of the problem making it impractical for solving large-scale problems. Also, there is no iteration complexity bound for this algorithm. The IPM in  also requires such a reformulation.

Our contribution. In this paper, we propose an alternating linearization method (ALM) for solving the primal SICS problem. An advantage of solving the primal problem is that the penalty term in the objective function directly promotes sparsity in the optimal inverse covariance matrix.

Although developed independently, our method is closely related to Yuan’s method . Both methods exploit the special form of the primal problem (1) by alternatingly minimizing one of the terms of the objective function plus an approximation to the other term. The main difference between the two methods is in the construction of these approximations. As we will show, our method has a theoretically justified interpretation and is based on an algorithmic framework with complexity bounds, while no complexity bound is available for Yuan’s method. Also our method has an intuitive interpretation from a learning perspective. Extensive numerical test results on both synthetic data and real problems have shown that our ALM algorithm significantly outperforms other existing algorithms, such as the PSM algorithm proposed by Duchi et al.  and the VSM algorithm proposed by Lu . Note that it is shown in  and  that PSM and VSM outperform the BCD method in  and in .

Organization of the paper. In Section 2 we briefly review alternating linearization methods for minimizing the sum of two convex functions and establish convergence and iteration complexity results. We show how to use ALM to solve SICS problems and give intuition from a learning perspective in Section 3. Finally, we present some numerical results on both synthetic and real data in Section 4 and compare ALM with PSM algorithm  and VSM algorithm .

## 2 Alternating Linearization Methods

We consider here the alternating linearization method (ALM) for solving the following problem:

 minF(x)≡f(x)+g(x), (4)

where and are both convex functions. An effective way to solve (4) is to “split” and by introducing a new variable, i.e., to rewrite (4) as

 minx,y{f(x)+g(y):x−y=0}, (5)

and apply an alternating direction augmented Lagrangian method to it. Given a penalty parameter , at the -th iteration, the augmented Lagrangian method minimizes the augmented Lagrangian function

 L(x,y;λ):=f(x)+g(y)−⟨λ,x−y⟩+12μ∥x−y∥22,

with respect to and , i.e., it solves the subproblem

 (xk,yk):=argminx,yL(x,y;λk), (6)

and updates the Lagrange multiplier via:

 λk+1:=λk−(xk−yk)/μ. (7)

Since minimizing with respect to and jointly is usually difficult, while doing so with respect to and alternatingly can often be done efficiently, the following alternating direction version of the augmented Lagrangian method (ADAL) is often advocated (see, e.g., [20, 21]):

 ⎧⎪ ⎪⎨⎪ ⎪⎩xk+1:=argminxL(x,yk;λk)yk+1:=argminyL(xk+1,y;λk)λk+1:=λk−(xk+1−yk+1)/μ. (8)

If we also update after we solve the subproblem with respect to , we get the following symmetric version of the ADAL method.

 ⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩xk+1:=argminxL(x,yk;λky)λk+1x:=λky−(xk+1−yk)/μyk+1:=argminyL(xk+1,y;λk+1x)λk+1y:=λk+1x−(xk+1−yk+1)/μ. (9)

Algorithm (9) has certain theoretical advantages when and are smooth. In this case, from the first-order optimality conditions for the two subproblems in (9), we have that:

 λk+1x=∇f(xk+1)andλk+1y=−∇g(yk+1). (10)

Substituting these relations into (9), we obtain the following equivalent algorithm for solving (4), which we refer to as the alternating linearization minimization (ALM) algorithm.

Algorithm 1 can be viewed in the following way: at each iteration we construct a quadratic approximation of the function at the current iterate and minimize the sum of this approximation and . The approximation is based on linearizing (hence the name ALM) and adding a “prox” term . When is small enough (, where is the Lipschitz constant for ) this quadratic function, is an upper approximation to , which means that the reduction in the value of achieved by minimizing in Step 1 is not smaller than the reduction achieved in the value of itself. Similarly, in Step 2 we build an upper approximation to at , and minimize the sum of it and .

Let us now assume that is in the class with Lipschitz constant , while is simply convex. Then from the first-order optimality conditions for the second minimization in (9), we have , the subdifferential of at . Hence, replacing in the definition of by in (9), we obtain the following modified version of (9).

Algorithm 2 is identical to the symmetric ADAL algorithm (9) as long as at each iteration (and to Algorithm 1 if is in and ). If this condition fails, then the algorithm simply sets . Algorithm 2 has the following convergence property and iteration complexity bound. For a proof see the Appendix.

###### Theorem 2.1.

Assume is Lipschitz continuous with constant . For where , Algorithm 2 satisfies

 F(yk)−F(x∗)≤∥x0−x∗∥22μ(k+kn),∀k, (11)

where is an optimal solution of (4) and is the number of iterations until the for which . Thus Algorithm 2 produces a sequence which converges to the optimal solution in function value, and the number of iterations needed is for an -optimal solution.

If is also a smooth function in the class with Lipschitz constant , then Theorem 2.1 also applies to Algorithm 1 since in this case (i.e., no “skipping” occurs). Note that the iteration complexity bound in Theorem 2.1 can be improved. Nesterov [22, 15] proved that one can obtain an optimal iteration complexity bound of , using only first-order information. His acceleration technique is based on using a linear combination of previous iterates to obtain a point where the approximation is built. This technique has been exploited and extended by Tseng , Beck and Teboulle , Goldfarb et al.  and many others. A similar technique can be adopted to derive a fast version of Algorithm 2 that has an improved complexity bound of , while keeping the computational effort in each iteration almost unchanged. However, we do not present this method here, since when applied to the SICS problem, it did not work as well as Algorithm 2.

## 3 ALM for SICS

The SICS problem

 minX∈Sn++F(X)≡f(X)+g(X), (12)

where and , is of the same form as (4). However, in this case neither nor have Lipschitz continuous gradients. Moreover, is only defined for positive definite matrices while is defined everywhere. These properties of the objective function make the SICS problem especially challenging for optimization methods. Nevertheless, we can still apply (9) to solve the problem directly. Moreover, we can apply Algorithm 2 and obtain the complexity bound in Theorem 2.1 as follows.

The term in implicitly requires that and the gradient of , which is given by , is not Lipschitz continuous in . Fortunately, as proved in Proposition 3.1 in , the optimal solution of (12) , where Therefore, if we define , the SICS problem (12) can be formulated as:

 minX,Y{f(X)+g(Y):X−Y=0,X∈C,Y∈C}. (13)

We can include constraints in Step 1 and in Step 3 of Algorithm 2. Theorem 2.1 can then be applied as discussed in . However, a difficulty now arises when performing the minimization in . Without the constraint , only a matrix shrinkage operation is needed, but with this additional constraint the problem becomes harder to solve. Minimization in with or without the constraint is accomplished by performing an SVD. Hence the constraint can be easily imposed.

Instead of imposing constraint we can obtain feasible solutions by a line search on . We know that the constraint is not tight at the solution. Hence if we start the algorithm with and restrict the step size to be sufficiently small then the iterates of the method will remain in .

Note however, that the bound on the Lipschitz constant of the gradient of is and hence can be very large. It is not practical to restrict in the algorithm to be smaller than , since determines the step size at each iteration. Hence, for a practical approach we can only claim that the theoretical convergence rate bound holds in only a small neighborhood of the optimal solution. We now present a practical version of our algorithm applied to the SICS problem.

We now show how to solve the two optimization problems in Algorithm 3. The first-order optimality conditions for Step 1 in Algorithm 3, ignoring the constraint are:

 ∇f(X)−Λk+(X−Yk)/μk+1=0. (14)

Consider - the spectral decomposition of and let

 γi=(di+√d2i+4μk+1)/2,i=1,…,n. (15)

Since , it is easy to verify that satisfies (14). When the constraint is imposed, the optimal solution changes to with We observe that solving (14) requires approximately the same effort () as is required to compute . Moreover, from the solution to (14), is obtained with only a negligible amount of additional effort, since .

The first-order optimality conditions for Step 2 in Algorithm 3 are:

 0∈∇f(Xk+1)+(Y−Xk+1)/μk+1+∂g(Y). (16)

Since , it is well known that the solution to (16) is given by

 Yk+1=shrink(Xk+1−μk+1(^Σ−(Xk+1)−1),μk+1ρ),

where the “shrinkage operator” updates each element of the matrix by the formula

The complexity of Step 1, which requires a spectral decomposition, dominates the complexity of Step 2 which requires a simple shrinkage. There is no closed-form solution for the subproblem corresponding to when the constraint is imposed. Hence, we neither impose this constraint explicitly nor do so by a line search on , since in practice this degrades the performance of the algorithm substantially. Thus, the resulting iterates may not be positive definite, while the iterates remain so. Eventually due to the convergence of and , the iterates become positive definite and the constraint is satisfied.

Let us now remark on the learning based intuition behind Algorithm 3. We recall that . The two steps of the algorithm can be written as

 Xk+1:=argminX∈C{f(X)+12μk+1∥X−(Yk+μk+1Λk)∥2F} (17)

and

 Yk+1:=argminY{g(Y)+12μk+1∥Y−(Xk+1−μk+1(^Σ−(Xk+1)−1))∥2F}. (18)

The SICS problem is trying to optimize two conflicting objectives: on the one hand it tries to find a covariance matrix that best fits the observed data, i.e., is as close to as possible, and on the other hand it tries to obtain a sparse matrix . The proposed algorithm address these two objectives in an alternating manner. Given an initial “guess” of the sparse matrix we update this guess by a subgradient descent step of length : . Recall that . Then problem (17) seeks a solution that optimizes the first objective (best fit of the data) while adding a regularization term which imposes a Gaussian prior on whose mean is the current guess for the sparse matrix: . The solution to (17) gives us a guess for the inverse covariance . We again update it by taking a gradient descent step: . Then problem (18) seeks a sparse solution while also imposing a Gaussian prior on whose mean is the guess for the inverse covariance matrix . Hence the sequence of ’s is a sequence of positive definite inverse covariance matrices that converge to a sparse matrix, while the sequence of ’s is a sequence of sparse matrices that converges to a positive definite inverse covariance matrix.

An important question is how to pick . Theory tells us that if we pick a small enough value, then we can obtain the complexity bounds. However, in practice this value is too small. We discuss the simple strategy that we use in the next section.

## 4 Numerical Experiments

In this section, we present numerical results on both synthetic and real data to demonstrate the efficiency of our SICS ALM algorithm. Our codes for ALM were written in MATLAB. All numerical experiments were run in MATLAB 7.3.0 on a Dell Precision 670 workstation with an Intel Xeon(TM) 3.4GHZ CPU and 6GB of RAM.

Since , ; hence is a feasible solution to the dual problem (2) as long as it is positive definite. Thus the duality gap at the -th iteration is given by:

 Dgap:=−logdet(Xk)+⟨^Σ,Xk⟩+ρ∥Xk∥1−logdet(^Σ−Λk)−n. (19)

We define the relative duality gap as: where and are respectively the objective function values of the primal problem (12) at point , and the dual problem (2) at . Defining , we measure the relative changes of objective function value and the iterates and as follows:

 Frel:=|F(Xk)−F(Xk−1)|dk(|F(X)|), Xrel:=∥Xk−Xk−1∥Fdk(∥X∥F), Yrel:=∥Yk−Yk−1∥Fd(∥Y∥F).

We terminate ALM when either

 (i)Dgap≤ϵgap or (ii)max{Frel,Xrel,Yrel}≤ϵrel. (20)

Note that in (19), computing is easy since the spectral decomposition of is already available (see (14) and (15)), but computing requires another expensive spectral decomposition. Thus, in practice, we only check (20)(i) every iterations. We check (20)(ii) at every iteration since this is inexpensive.

A continuation strategy for updating is also crucial to ALM. In our experiments, we adopted the following update rule. After every iterations, we set ; i.e., we simply reduce by a constant factor every iterations until a desired lower bound on is achieved.

We compare ALM (i.e., Algorithm 3 with the above stopping criteria and updates), with the projected subgradient method (PSM) proposed by Duchi et al. in  and implemented by Mark Schmidt 111The MATLAB can be downloaded from http://www.cs.ubc.ca/schmidtm/Software/PQN.html and the smoothing method (VSM) 222The MATLAB code can be downloaded from http://www.math.sfu.ca/zhaosong proposed by Lu in , which are considered to be the state-of-the-art algorithms for solving SICS problems. The per-iteration complexity of all three algorithms is roughly the same; hence a comparison of the number of iterations is meaningful. The parameters used in PSM and VSM are set at their default values. We used the following parameter values in ALM: where is the initial which is set according to ; specifically, in our experiments, if , if , and if .

### 4.1 Experiments on synthetic data

We randomly created test problems using a procedure proposed by Scheinberg and Rish in . Similar procedures were used by Wang et al. in  and Li and Toh in . For a given dimension , we first created a sparse matrix with nonzero entries equal to -1 or 1 with equal probability. Then we computed as the true covariance matrix. Hence, was sparse. We then drew iid vectors, , from the Gaussian distribution by using the function in MATLAB, and computed a sample covariance matrix We compared ALM with PSM  and VSM  on these randomly created data with different . The PSM code was terminated using its default stopping criteria, which included (20)(i) with . VSM was also terminated when . Since PSM and VSM solve the dual problem (2), the duality gap which is given by (3) is available without any additional spectral decompositions. The results are shown in Table 1. All CPU times reported are in seconds.

From Table 1 we see that on these randomly created SICS problems, ALM outperforms PSM and VSM in both accuracy and CPU time with the performance gap increasing as increases. For example, for and , ALM achieves in about 1 hour and 15 minutes, while PSM and VSM need about 3 hours and 25 minutes and 10 hours and 23 minutes, respectively, to achieve similar accuracy.

### 4.2 Experiments on real data

We tested ALM on real data from gene expression networks using the five data sets from  provided to us by Kim-Chuan Toh: (1) Lymph node status; (2) Estrogen receptor; (3) Arabidopsis thaliana; (4) Leukemia; (5) Hereditary breast cancer. See  and references therein for the descriptions of these data sets. Table 2 presents our test results. As suggested in , we set . From Table 2 we see that ALM is much faster and provided more accurate solutions than PSM and VSM.

### 4.3 Solution Sparsity

In this section, we compare the sparsity patterns of the solutions produced by ALM, PSM and VSM. For ALM, the sparsity of the solution is given by the sparsity of . Since PSM and VSM solve the dual problem, the primal solution , obtained by inverting the dual solution , is never sparse due to floating point errors. Thus it is not fair to measure the sparsity of or a truncated version of . Instead, we measure the sparsity of solutions produced by PSM and VSM by appealing to complementary slackness. Specifically, the -th element of the inverse covariance matrix is deemed to be nonzero if and only if . We give results for a random problem () and the first real data set in Table 3. For each value of , the first three rows show the number of nonzeros in the solution and the last three rows show the number of entries that are nonzero in the solution produced by one of the methods but are zero in the solution produced by the other method. The sparsity of the ground truth inverse covariance matrix of the synthetic data is 6.76%.

From Table 3 we can see that when is relatively large (), all three algorithms produce solutions with exactly the same sparsity patterns. Only when is very small, are there slight differences. We note that the ROC curves depicting the trade-off between the number of true positive elements recovered versus the number of false positive elements as a function of the regularization parameter are also almost identical for the three methods.

## Acknowledgements

We would like to thank Professor Kim-Chuan Toh for providing the data set used in Section 4.2. The research reported here was supported in part by NSF Grants DMS 06-06712 and DMS 10-16571, ONR Grant N00014-08-1-1118 and DOE Grant DE-FG02-08ER25856.

## References

•  S. Lauritzen. Graphical Models. Oxford University Press, 1996.
•  B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24:227–234, 1995.
•  J.-B. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms II: Advanced Theory and Bundle Methods. Springer-Verlag, New York, 1993.
•  M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19–35, 2007.
•  J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 2007.
•  M. Wainwright, P. Ravikumar, and J. Lafferty. High-dimensional graphical model selection using -regularized logistic regression. NIPS, 19:1465–1472, 2007.
•  O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian for binary data. Journal of Machine Learning Research, 9:485–516, 2008.
•  L. Li and K.-C. Toh. An inexact interior point method for -regularized sparse covariance selection. preprint, 2010.
•  R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58(1):267–288, 1996.
•  L. Sun, R. Patel, J. Liu, K. Chen, T. Wu, J. Li, E. Reiman, and J. Ye. Mining brain region connectivity for alzheimer’s disease study via sparse inverse covariance estimation. KDD’09, 2009.
•  A. Nemirovski. Prox-method with rate of convergence for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2005.
•  K. Scheinberg and I. Rish. Sinco - a greedy coordinate ascent method for sparse inverse covariance selection problem. 2009. Preprint available at http://www.optimization-online.org/DB_HTML/2009/07/2359.html.
•  J. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning sparse Gaussian. Conference on Uncertainty in Artificial Intelligence (UAI 2008), 2008.
•  Y. E. Nesterov. Smooth minimization for non-smooth functions. Math. Program. Ser. A, 103:127–152, 2005.
•  Y. E. Nesterov. Introductory lectures on convex optimization. 87:xviii+236, 2004. A basic course.
•  A. D’Aspremont, O. Banerjee, and L. El Ghaoui. First-order methods for sparse covariance selection. SIAM Journal on Matrix Analysis and its Applications, 30(1):56–66, 2008.
•  Z. Lu. Smooth optimization approach for sparse covariance selection. SIAM J. Optim., 19(4):1807–1827, 2009.
•  X. Yuan. Alternating direction methods for sparse covariance selection. 2009. Preprint available at http://www.optimization-online.org/DB_HTML/2009/09/2390.html.
•  C. Wang, D. Sun, and K.-C. Toh. Solving log-determinant optimization problems by a Newton-CG primal proximal point algorithm. preprint, 2009.
•  M. Fortin and R. Glowinski. Augmented Lagrangian methods: applications to the numerical solution of boundary-value problems. North-Holland Pub. Co., 1983.
•  R. Glowinski and P. Le Tallec. Augmented Lagrangian and Operator-Splitting Methods in Nonlinear Mechanics. SIAM, Philadelphia, Pennsylvania, 1989.
•  Y. E. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence . Dokl. Akad. Nauk SSSR, 269:543–547, 1983.
•  P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM J. Optim., 2008.
•  A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1):183–202, 2009.
•  D. Goldfarb, S. Ma, and K. Scheinberg. Fast alternating linearization methods for minimizing the sum of two convex functions. Technical report, Department of IEOR, Columbia University, 2010.

## Appendix A Appendix

We show in the following that the iteration complexity of Algorithm 2 is to get an -optimal solution. First, we need the following definitions and a lemma which is a generalization of Lemma 2.3 in . Let and be convex functions and define

 Qψ(u,v):=ϕ(u)+ψ(v)+⟨γψ(v),u−v⟩+12μ∥u−v∥22,

where is any subgradient in the subdifferential of at the point , and

 pψ(v):=argminuQψ(u,v). (21)
###### Lemma A.1.

Let . For any , if

 Φ(pψ(v))≤Qψ(pψ(v),v), (22)

then for any ,

 2μ(Φ(u)−Φ(pψ(v)))≥∥pψ(v)−u∥2−∥v−u∥2. (23)
###### Proof.

From (22), we have

 Φ(u)−Φ(pψ(v))≥Φ(u)−Qψ(pψ(v),v)=Φ(u)−(ϕ(pψ(v))+ψ(v)+⟨γψ(v),pψ(v)−v⟩+12μ∥pψ(v)−v∥22). (24)

Now since and are convex we have

 ϕ(u)≥ϕ(pψ(v))+⟨u−pψ(v),γϕ(pψ(v))⟩, (25)
 ψ(u)≥ψ(v)+⟨u−v,γψ(v)⟩, (26)

where is a subgradient of and satisfies the first-order optimality conditions for (21), i.e.,

 γϕ(pψ(v))+γψ(v)+1μ(pψ(v)−v)=0. (27)

Summing (25) and (26) yields

 Φ(u)≥ϕ(pψ(v))+⟨u−pψ(v),γϕ(pψ(v))⟩+ψ(v)+⟨u−v,γψ(v)⟩. (28)

Therefore, from (24), (27) and (28) it follows that

 Φ(u)−Φ(pψ(v)) ≥⟨γψ(v)+γϕ(pψ(v)),u−pψ(v)⟩−12μ∥pψ(v)−v∥22 =⟨−1μ(pψ(v)−v),u−pψ(v)⟩−12μ∥pψ(v)−v∥22 =12μ(∥pψ(v)−u∥2−∥u−v∥2).

###### Proof of Theorem 2.1.

Let be the set of all iteration indices until -st for which no skipping occurs and let be its complement. Let . It follows that for all .

For we can apply Lemma A.1 to obtain the following inequalities. In (23), by letting , , and , we get , and

 2μ(F(x∗)−F(yn+1))≥∥yn+1−x∗∥2−∥xn+1−x∗∥2. (29)

Similarly, by letting , , and in (23) we get , and

 2μ(F(x∗)−F(xn+1))≥∥xn+1−x∗∥2−∥yn−x∗∥2. (30)

Taking the summation of (29) and (30) we get

 2μ(2F(x∗)−F(xn+1)−F(yn+1))≥∥yn+1−x∗∥2−∥yn−x∗∥2. (31)

For , (29) holds, and we get

 2μ(F(x∗)−F(yn+1))≥∥yn+1−x∗∥2−∥yn−x∗∥2, (32)

due to the fact that in this case.

Summing (31) and (32) over we get

 2μ((2|I|+|Ic|)F(x∗)−∑n∈IF(xn+1)−k−1∑n=0F(yn+1)) (33) ≥ k−1∑n=0(∥yn+1−x∗∥2−∥yn−x∗∥2) = ∥yk−x∗∥2−∥y0−x∗∥2 ≥ −∥x0−x∗∥2.

For any , since Lemma A.1 holds for any , letting instead of we get from (29) that

 2μ(F(xn+1)−F(yn+1))≥∥yn+1−xn+1∥2≥0, (34)

or, equivalently,

 2μ(F(xn)−F(yn))≥∥yn−xn∥2≥0. (35)

Similarly, for by letting instead of we get from (30) that

 2μ(F(yn)−F(xn+1))≥∥xn+1−yn∥2≥0. (36)

On the other hand, for , (36) also holds because , and hence holds for all .

Adding (34) and (36) we obtain

 2μ(F(yn)−F(yn+1))≥0. (37)

and adding (35) and (36) we obtain

 2μ(F(xn)−F(xn+1))≥0. (38)

(37) and (38) show that the sequences and are non-increasing. Thus we have,

 k−1∑n=0F(yn+1)≥kF(yk) and ∑n∈IF(xn+1)≥knF(xk). (39)

Combining (33) and (39) yields

 2μ((k+kn)F(x∗)−knF(xk)−kF(yk))≥−∥x0−x∗∥2. (40)

From (35) we know that . Thus (40) implies that

 2μ(k+kn)(F(yk)−F(x∗))≤∥x0−x∗∥2,

which gives us the desired result (11).

Also, for any given , as long as , we have from (11) that ; i.e., the number of iterations needed is for an -optimal solution. ∎  