Inexact alternating direction multiplier methods for separable convex optimization April 8, 2016. The authors gratefully acknowledge support by the National Science Foundation under grants 1522629 and 1522654, and by the Office of Naval Research under grant N00014-15-1-2048.

# Inexact alternating direction multiplier methods for separable convex optimization ††thanks: April 8, 2016. The authors gratefully acknowledge support by the National Science Foundation under grants 1522629 and 1522654, and by the Office of Naval Research under grant N00014-15-1-2048.

William W. Hager hager@math.ufl.edu, http://people.clas.ufl.edu/hager/, PO Box 118105, Department of Mathematics, University of Florida, Gainesville, FL 32611-8105. Phone (352) 294-2308. Fax (352) 392-8357.    Hongchao Zhang hozhang@math.lsu.eduhozhang, Department of Mathematics, Louisiana State University, Baton Rouge, LA 70803-4918. Phone (225) 578-1982. Fax (225) 578-4276.
###### Abstract

Inexact alternating direction multiplier methods (ADMMs) are developed for solving general separable convex optimization problems with a linear constraint and with an objective that is the sum of smooth and nonsmooth terms. The approach involves linearized subproblems, a back substitution step, and either gradient or accelerated gradient techniques. Global convergence is established. The methods are particularly useful when the ADMM subproblems do not have closed form solution or when the solution of the subproblems is expensive. Numerical experiments based on image reconstruction problems show the effectiveness of the proposed methods.

Key words. Separable convex optimization, Alternating direction method of multipliers, ADMM, Multiple blocks, Inexact solve, Global convergence

AMS subject classifications. 90C06, 90C25, 65Y20

## 1 Introduction

We consider a convex separable linearly constrained optimization problem

 minΦ(x) subject to Ax=b \hb@xt@.01(1.1)

where and is by . By a separable convex problem, we mean that the objective function is a sum of independent parts, and the matrix is partitioned compatibly as in

 Φ(x)=m∑i=1fi(xi)+hi(xi)andAx=m∑i=1Aixi. \hb@xt@.01(1.2)

Here is convex and Lipschitz continuously differentiable, is a proper closed convex function (possibly nonsmooth), is by with , and the columns of are linearly independent for . Constraints of the form , where is a closed convex set, can be incorporated in the optimization problem by setting when . The problem (LABEL:Prob)–(LABEL:ProbM) has attracted extensive research due to its importance in areas such as image processing, statistical learning and compressed sensing. See the recent survey [3] and its references.

Let be the Lagrangian given by

 \@fontswitchL(x,λ)=Φ(x)+⟨λ,Ax−b⟩,

where is the Lagrange multiplier for the linear constraint and denotes the Euclidean inner product. It is assumed that there exists a solution to (LABEL:Prob)–(LABEL:ProbM) and an associated Lagrange multiplier such that attains a minimum at , or equivalently, the following first-order optimality conditions hold: and for and for all , we have

 ⟨∇fi(x∗i)+ATiλ∗,u−x∗i⟩+hi(u)≥hi(x∗i), \hb@xt@.01(1.3)

A popular strategy for solving (LABEL:Prob)–(LABEL:ProbM) is the alternating direction multiplier method (ADMM) [15, 16] given by

 ⎧⎨⎩xk+1i=argminxi∈RniL(xk+11,…,xk+1i−1,xi,xki+1,…,xkm,λk),i=1,…,m,λk+1=λk+ρ(Axk+1−b), \hb@xt@.01(1.4)

where , the augmented Lagrangian, is defined by

 L(x,λ)=\@fontswitchL(x,λ)+ρ2∥Ax−b∥2. \hb@xt@.01(1.5)

Here is the penalty parameter. Early ADMMs only consider problem (LABEL:Prob)–(LABEL:ProbM) with corresponding to a -block structure. In this case, the global convergence and complexity can be found in [12, 25]. When the ADMM strategy (LABEL:adm), a natural extension of the -block ADMM, is not necessarily convergent [5], although its practical efficiency has been observed in many recent applications [34, 35].

Recently, much research has focused on modifications to ADMM to ensure convergence when . References include [4, 6, 7, 11, 17, 21, 23, 24, 29, 30, 31]. One approach [4, 7, 11, 21] assumes of the functions in the objective are strongly convex and the penalty parameter is sufficiently small. Linear convergence results under additional conditions are obtained in [31]. Analysis of a randomly permuted ADMM which allows for nonseparated variables is given in [6]. Another approach, first developed in [23, 24], involves a back substitution step to complement the ADMM forward substitution step. The algorithms developed in our paper utilize this back substitution step.

The dominant computation in an iteration of ADMM is the solution of the subproblems in (LABEL:adm). The efficiency depends on our ability to solve these subproblems inexactly while still maintaining the global convergence, especially when no closed formula exists for the subproblems [36]. One line of research is to solve the subproblems to an accuracy based on an absolute summable error criterion [9, 12, 18]. In [28], the authors combine an adaptive error criterion with the absolute summable error criterion for 2-block ADMM with logarithmic-quadratic proximal regularization and further correction steps to modify the solutions generated from the ADMM subproblems. In [14], the authors develop a 2-block ADMM with a relative error stopping condition for the subproblems, motivated by [13], based on the total subgradient error. Another line of research is to add proximal terms to make the subproblems strongly convex [8, 22] and relatively easy to solve. However, this approach often requires accurate solution of the proximal subproblems. When , ADMM reduces to the standard augmented Lagrangian method (ALM), for which practical relative error criteria for solving the subproblems have been developed and encouraging numerical results have been obtained [13, 33]. In this paper, motivated by our recent work on variable stepsize Bregman operator splitting methods (BOSVS), by recent complexity results for gradient and accelerated methods for convex optimization, and by the adaptive relative error strategy used in ALM, we develop new inexact approaches for solving the ADMM subproblems. To the best of our knowledge, these are the first ADMMs for solving the general separable convex optimization problem (LABEL:Prob)–(LABEL:ProbM) based on an adaptive accuracy condition that does not employ an absolute summable error criterion and that guarantees global convergence, even when .

To guarantee global convergence, a block Gaussian backward substitution strategy is used to make corrections to the approximate subproblem solutions. In the special case , the method will reduces to a 2-block ADMM without back substitution. This idea of using block Gaussian back substitution was first proposed in [23, 24]. The method in this earlier work requires the exact solution of the subproblems to obtain global convergence, while our new approach allows an inexact solution. More recently, a linearly convergent ADMM was developed in [26]. This algorithm linearizes the subproblems to achieve an inexact solution, and requires that the functions and in the objective function satisfy certain “local error bound” conditions. In addition, to ensure linear convergence, the stepsize in (LABEL:adm) must be sufficiently small, which could significantly deteriorate the practical performance.

In this paper, we focus on problems where the minimization of the dual function over one or more of the primal variable is nontrial, and the accuracy of an inexact minimizer needs to be taken into account. On the other hand, when these minimizations are simple enough, it is practical to minimize the Lagrangian over and compute the dual function. This leads to a possibly nonsmooth dual function which can be approached through smoothing techniques as in [27], or through active set techniques as in [20].

Our paper is organized as follows. In the Section LABEL:BOSVS, we first generalize the BOSVS algorithm [10, 19] to handle multiple blocks. The original BOSVS algorithm was tailored to the two block case, but used an adaptive stepsize when solving the subproblem, and consequently, it achieved much better overall efficiency when compared to the Bregman operator splitting (BOS) type algorithms based on a fixed smaller stepsize. In Sections LABEL:mBOSVS and LABEL:aBOSVS, more adaptive stopping criteria for the subproblems are proposed. The adaptive criteria for bounding the accuracy in the ADMM subproblems are based on both the current and accumulated iteration change in the subproblem. These novel stopping criteria are motivated by the complexity analysis of gradient methods for convex optimization, and by the relative accuracy strategy often used in an inexact augmented Lagrangian method for nonlinear programming. Although our analysis is carried out with vector variables, these results could be extended to matrix variables which could have more potential applications.

### 1.1 Notation

The set of solution/multiplier pairs for (LABEL:Prob) is denoted , while is a generic solution/multiplier pair. For and , is the standard inner product, where the superscript denotes transpose. The Euclidean norm, denoted , is defined by and for a positive definite matrix . denotes the set of nonnegative real numbers, while denotes the set of positive real numbers. For a differentiable function , is the gradient of at , a column vector. More generally, denotes the subdifferential at . If is a vector, then denotes the subvector obtained by dropping the first block of variables from . Thus if with for , then .

## 2 Algorithm Structure

Three related inexact ADMMs are developed called generalized, multistep, and accelerated BOSVS. They differ in the details of the formula for the new iterate , but the overall structure of the algorithms is the same. Both multistep and accelerated BOSVS typically represent a more exact ADMM iteration when compared to generalized BOSVS, while the accelerated BOSVS subiterations often converge more rapidly than those of multistep BOSVS. The common elements of these three algorithms appear in Algorithm LABEL:ADMMcommon.

The algorithms generate three sequences , , and . In Step 1 of Algorithm LABEL:ADMMcommon there may be more than one ADMM subiteration, as determined by an adaptive stopping criterion. The iterate is the final iterate generated in the ADMM (forward substitution) subproblems, is generated by the back substitution process in Step 3, and is an average of the iterates in the ADMM subproblems of Step 1. In generalized BOSVS, since there is only one ADMM subiteration, while multistep and accelerated BOSVS typically perform more than one subiteration and is obtained by a nontrivial averaging process. The matrix in Step 3 is the by block lower triangular matrix defined by

 Mij={ATi+1Aj+1if 1≤j≤i

The matrix is the by block diagonal matrix whose diagonal blocks match those of . The matrices and are invertible since the columns of are linearly independent for , which implies that is invertible for .

## 3 Generalized BOSVS

Our first algorithm is a generalization of the BOSVS algorithm developed in [10, 19] for a two-block optimization problem. Let be defined by

 Φki(u,v,δ)=fi(v)+⟨∇fi(v),u−v⟩+δ2∥u−v∥2+hi(u)+ρ2∥Aiu−bki+λk/ρ∥2,

where

 bki=b−∑jiAjykj. \hb@xt@.01(3.1)

The function corresponds to the part of the augmented Lagrangian associated with the -th component of , but with the smooth term linearized around and with a proximal term added to the objective. Algorithm LABEL:1 which follows is the Step 1 inner loop of Algorithm LABEL:ADMMcommon for generalized BOSVS. Throughout the paper, the generalized BOSVS algorithm refers to Algorithm LABEL:ADMMcommon with the Step 1 inner loop given by Algorithm LABEL:1.

Although does not appear explicitly in Algorithm LABEL:1, it is hidden inside the term of . The iterate is obtained by minimizing the function and checking the line search condition of Step 1b. In Step 1a, mid denotes median and the initial stepsize of Step 1a is a safeguarded version of the Barzilai-Borwein formula [2].

Let denote the Lipschitz constant for . By a Taylor expansion of around , we see that the line search condition of Step 1b is satisfied whenever , or equivalently, when

 δki≥ζi/(1−σ). \hb@xt@.01(3.2)

Since , increases as increases, and consequently, (LABEL:xx) holds for sufficiently large. Hence, if at the termination of the line search, we have . If the line search terminates for , then . In summary, we have

 δmin≤δki≤max{ηζi/(1−σ),δmax}for all k. \hb@xt@.01(3.3)

Since in Step 1a, it follows that whenever . In Step 1d, is increased by the factor whenever . Hence, after a finite number of iterations where , we have , which implies that the line search terminates at with . We conclude that

 δki≤δk−1ifor k sufficiently large, \hb@xt@.01(3.4)

where denotes the final accepted value in Step 1b. Note that the inequality in (LABEL:delta_monotone) cannot be replaced by equality since the number of iterations where may not be enough to yield .

In the BOS algorithm, the line search is essentially eliminated by taking larger than the Lipschitz constant . Taking large, however, causes to be small due to the proximal term in the objective associated with . These small steps lead to slower convergence than what is achieved with BOSVS where is adjusted by the line search criterion in order to achieve a small, but acceptable, choice for .

In our analysis of generalized BOSVS, we first observe that when , we have reached a solution of (LABEL:Prob)–(LABEL:ProbM).

###### Lemma 3.1

If in the generalized BOSVS algorithm, then solves and .

Proof. Let denote . If , then for each , and by Step 1c of generalized BOSVS, . For generalized BOSVS, we set in Step 1b. Since , it follows that . The identity also implies that . In Step 3 of Algorithm LABEL:ADMMcommon, . Since , Step 2 of Algorithm LABEL:ADMMcommon implies that . Hence, . Since , it also follows from Step 2 that . Consequently, we have

 bki=b−∑jiAjykj=b−∑jiAjx∗j=Aix∗i. \hb@xt@.01(3.5)

By Step 1b, is the minimizer of . Since , it follows that is the minimizer of . After taking into account (LABEL:BIK), the first-order optimality condition associated with the minimizer of is exactly the same as (LABEL:Wstar), but with replaced . Hence, .

Two lemmas are needed for the convergence of the generalized BOSVS algorithm.

###### Lemma 3.2

Suppose that and satisfy

 fi(v)+⟨∇fi(v),u−v⟩+(1−σ)δ2∥u−v∥2≥fi(u) \hb@xt@.01(3.6)

for some and , where minimizes . Then, for any we have

 Lki(w)−Lki(u)≥δ2(∥w−u∥2−∥w−v∥2)+ρ2∥Ai(w−u)∥2+σδ2∥u−v∥2, \hb@xt@.01(3.7)

where is given by

 Lki(u)=fi(u)+hi(u)+ρ2∥Aiu−bki+λk/ρ∥2. \hb@xt@.01(3.8)

Proof. Adding to each side of the inequality (LABEL:yy) and rearranging, we obtain

 Φki(u,v,δ)−σδ2∥u−v∥2≥Lki(u).

Adding to each side of this inequality gives

 Lki(w)−Lki(u)≥Lki(w)−Φki(u,v,δ)+σδ2∥u−v∥2. \hb@xt@.01(3.9)

Utilizing the convexity inequality , we have

 Lki(w)−Φki(u,v,δ) ≥ ρ2(∥Aiw−bki−λk/ρ∥2−∥Aiu−bki−λk/ρ∥2) +⟨∇fi(v),w−u⟩−δ2∥u−v∥2+hi(w)−hi(u).

Expand the smooth terms involving on the right side in a Taylor series around to obtain

 Lki(w)−Φki(u,v,δ)≥⟨gki,w−u⟩+hi(w)−hi(u)+ρ2∥Ai(w−u)∥2−δ2∥u−v∥2,

where . Since is the sum of smooth and nonsmooth terms, the first-order optimality condition for the minimizer of can be expressed

 ⟨gki+δ(u−v),w−u⟩+hi(w)≥hi(u)

for all , which implies that

 ⟨gki,w−u⟩+hi(w)−hi(u)≥−δ⟨u−v,w−u⟩

for all . Utilizing this inequality, we have

Insert this in (LABEL:zz). Since

 2⟨u−v,w−u⟩+∥u−v∥2=∥w−v∥2−∥w−u∥2,

the proof is complete.

We use Lemma LABEL:lem-prop1 to establish a decay property that is key to the convergence analysis.

###### Lemma 3.3

Let be any solution/multiplier pair for , let , , , and be the iterates of the generalized BOSVS algorithm, and define

 Ek=ρ∥yk+−x∗+∥2P+1ρ∥λk−λ∗∥2+αm∑i=1δki∥xki−x∗i∥2,

where . Then for large enough that the monotonicity condition holds for all , we have

 Ek≥Ek+1+c1∥xk+1−xk∥2+c2ρ(∥yk+−zk+∥2H+∥Azk−b∥2),

where and .

Proof. By the inequality (LABEL:prop1) of Lemma LABEL:lem-prop1 with , , and , we have

 Lki(x∗i)−Lki(zki)−ρ2∥Aizke,i∥2 ≥ δki2(∥zke,i∥2−∥xke,i∥2)+σδki2∥zki−xki∥2 \hb@xt@.01(3.10) = δki2(∥xk+1e,i∥2−∥xke,i∥2)+σδki2∥xk+1i−xki∥2

where , , and by Step 1b of generalized BOSVS. Since minimizes and since the augmented Lagrangian is the sum of smooth and nonsmooth terms, the first-order optimality condition implies that for each ,

 \hb@xt@.01(3.11)

for all , where is the gradient of the smooth part of the objective evaluated at :

 g∗i=∇fi(x∗i)+ρATi(m∑i=1Aix∗i−b+λk/ρ).

We add to both sides of (LABEL:zzz) and take . After much cancellation, we obtain the relation

 Lki(x∗i)−Lki(zki)−ρ2∥Aizke,i∥2≤ fi(x∗i)−fi(zki)+∇fi(x∗i)Tzke,i−ρ⟨∑j≤iAjzke,j+∑j>iAjyke,j+λke/ρ,Aizke,i⟩≤ −ρ⟨∑j≤iAjzke,j+∑j>iAjyke,j+λke/ρ,Aizke,i⟩, \hb@xt@.01(3.12)

where , , and the last inequality is due to the convexity of . We combine this upper bound with the lower bound (LABEL:L-3456) to obtain

 −ρ⟨Aizke,i,∑j≤iAjzke,j+∑j>iAjyke,j+λke/ρ⟩ \hb@xt@.01(3.13) ≥ δki2(∥xk+1e,i∥2−∥xke,i∥2)+σδki2∥xk+1i−xki∥2.

Focusing on the left side of (LABEL:L-key), observe that

 ∑j≤iAjzke,j+∑j>iAjyke,j = m∑j=1Aj(zkj−x∗j)+∑j>iAj(ykj−zkj) \hb@xt@.01(3.14) = Azk−b+∑j>iAj(ykj−zkj)

since . Let denote the right side of (LABEL:L-key):

 τki=δki2(∥xk+1e,i∥2−∥xke,i∥2)+σδki2∥xk+1i−xki∥2.

With this notation and with the simplification (LABEL:xyz), (LABEL:L-key) becomes

 −ρ⟨Aizke,i,Azk−b+λke/ρ+∑j>iAj(ykj−zkj)⟩≥τki. \hb@xt@.01(3.15)

We will sum the inequality (LABEL:xzz) over between 1 and . Since

 m∑i=1Aizke,i=m∑i=1Ai(zki−x∗i)=Azk−b:=rk,

it follows that in (LABEL:xzz),

 m∑i=1⟨Aizke,i,rk+λke/ρ⟩=⟨rk,rk+λke/ρ⟩. \hb@xt@.01(3.16)

Also, observe that

 ∑j>iAj(ykj−zkj)=m∑j=2Aj(ykj−zkj)−i∑j=2Aj(ykj−zkj),

with the convention that the sum from to is 0. Take the inner product of this identity with and sum over to obtain

 m∑i=1⟨Aizke,i,∑j>iAj(ykj−zkj)⟩= ⟨rk,m∑j=2Aj(ykj−zkj)⟩−(zk+−x∗+)TM(yk+−zk+), \hb@xt@.01(3.17)

where is defined in (LABEL:m-def). We sum (LABEL:xzz) over between 1 and and utilize (LABEL:1st) and (LABEL:2nd) to obtain

 (yk+−x∗+)TMw−1ρ(⟨rk,λke⟩+m∑i=1τki)≥wTMw+⟨rk,rk+m∑j=2Ajwj−1⟩, \hb@xt@.01(3.18)

where .

Observe that

 wTMw = 12wT(M+MT)w=12wT(M+MT−H)w+12wTHw = 12∥∥ ∥∥m∑i=2Aiwi−1∥∥ ∥∥2+12wTHw

since by the definition of and . With this substitution, the right side of (LABEL:close) becomes a sum of squares:

 wTMw+⟨rk,rk+m∑j=2Ajwj−1⟩= 12⎛⎝wTHw+∥rk∥2+∥∥ ∥∥rk+m∑i=2Aiwi−1∥∥ ∥∥2⎞⎠.

Hence, it follows from (LABEL:close) that

 (yk+−x∗+)TMw−1ρ(⟨rk,λke⟩+m∑i=