Randomized Primal-Dual Proximal Block Coordinate Updates

# Randomized Primal-Dual Proximal Block Coordinate Updates

Xiang Gao {gaoxx460, zhangs}@umn.edu. Department of Industrial & Systems Engineering, University of Minnesota    Yangyang Xu yangyang.xu@ua.edu. Department of Mathematics, University of Alabama    Shuzhong Zhang11footnotemark: 1
July 3, 2019
###### Abstract

In this paper we propose a randomized primal-dual proximal block coordinate updating framework for a general multi-block convex optimization model with coupled objective function and linear constraints. Assuming mere convexity, we establish its convergence rate in terms of the objective value and feasibility measure. The framework includes several existing algorithms as special cases such as a primal-dual method for bilinear saddle-point problems (PD-S), the proximal Jacobian ADMM (Prox-JADMM) and a randomized variant of the ADMM method for multi-block convex optimization. Our analysis recovers and/or strengthens the convergence properties of several existing algorithms. For example, for PD-S our result leads to the same order of convergence rate without the previously assumed boundedness condition on the constraint sets, and for Prox-JADMM the new result provides convergence rate in terms of the objective value and the feasibility violation. It is well known that the original ADMM may fail to converge when the number of blocks exceeds two. Our result shows that if an appropriate randomization procedure is invoked to select the updating blocks, then a sublinear rate of convergence in expectation can be guaranteed for multi-block ADMM, without assuming any strong convexity. The new approach is also extended to solve problems where only a stochastic approximation of the (sub-)gradient of the objective is available, and we establish an convergence rate of the extended approach for solving stochastic programming.

Keywords: primal-dual method, alternating direction method of multipliers (ADMM), randomized algorithm, iteration complexity, first-order stochastic approximation.

Mathematics Subject Classification: 90C25, 95C06, 68W20.

## 1 Introduction

In this paper, we consider the following multi-block structured convex optimization model

 minx,y f(x1,⋯,xN)+N∑i=1ui(xi)+g(y1,⋯,yM)+M∑j=1vj(yj) (1) s.t. N∑i=1Aixi+M∑j=1Bjyj=b xi∈Xi,i=1,…,N; yj∈Yj,j=1,…,M,

where the variables and are naturally partitioned into and blocks respectively, and are block matrices, ’s and ’s are some closed convex sets, and are smooth convex functions, and ’s and ’s are proper closed convex (possibly nonsmooth) functions.

### 1.1 Motivating examples

Optimization problems in the form of (1) have many emerging applications from various fields. For example, the constrained lasso (classo) problem that was first studied by James et al. [26] as a generalization of the lasso problem, can be formulated as

 minx12∥Ax−b∥22+τ∥x∥1 s.t. Cx≤d, (2)

where , are the observed data, and , are the predefined data matrix and vector. Many widely used statistical models can be viewed as special cases of (2), including the monotone curve estimation, fused lasso, generalized lasso, and so on [26]. By partitioning the variable into blocks as where as well as other matrices and vectors in (2) correspondingly, and introducing another slack variable , the classo problem can be transformed to

 minx,y12∥∥∥K∑i=1Aixi−b∥∥∥22+τK∑i=1∥xi∥1 s.t. K∑i=1Cixi+y=d,y≥0, (3)

which is in the form of (1).

Another interesting example is the extended linear-quadratic programming [40] that can be formulated as

 minx12x⊤Px+a⊤x+maxs∈S{(d−Cx)⊤s−12s⊤Qs}, s.t. Ax≤b, (4)

where and are symmetric positive semidefinite matrices, and is a polyhedral set. Apparently, (4) includes quadratic programming as a special case. In general, its objective is a piece-wise linear-quadratic convex function. Let , where denotes the indicator function of . Then

 maxs∈S{(d−Cx)⊤s−12s⊤Qs}=g∗(d−Cx),

where denotes the convex conjugate of . Replacing by and introducing slack variable , we can equivalently write (4) into the form of (1):

 minx,y,z12x⊤Px+a⊤x+g∗(y), s.t. Ax+z=b, z≥0, Cx+y=d, (5)

for which one can further partition the -variable into a number of disjoint blocks.

Many other interesting applications in various areas can be formulated as optimization problems in the form of (1), including those arising from signal processing, image processing, machine learning and statistical learning; see [24, 8, 5, 16] and the references therein.

Finally, we mention that computing a point on the central path for a generic convex programming in block variables :

 minxf(x1,⋯,xN) s.t. ∑Ni=1Aixi≤b,xi≥0,i=1,2,...,N

boils down to

 minx,yf(x1,⋯,xN)−μe⊤lnx−μe⊤lny s.t. ∑Ni=1Aixi+y=b,

where and indicates the sum of the logarithm of all the components of . This model is again in the form of (1).

### 1.2 Related works in the literature

Our work relates to two recently very popular topics: the Alternating Direction Method of Multipliers (ADMM) for multi-block structured problems and the first-order primal-dual method for bilinear saddle-point problems. Below we review the two methods and their convergence results. More complete discussion on their connections to our method will be provided after presenting our algorithm.

#### Multi-block ADMM and its variants

One well-known approach for solving a linear constrained problem in the form of (1) is the augmented Lagrangian method, which iteratively updates the primal variable by minimizing the augmented Lagrangian function in (7) and then the multiplier through dual gradient ascent. However, the linear constraint couples and all together, it can be very expensive to minimize the augmented Lagrangian function simultaneously with respect to all block variables. Utilizing the multi-block structure of the problem, the multi-block ADMM updates the block variables sequentially, one at a time with the others fixed to their most recent values, followed by the update of multiplier. Specifically, it performs the following updates iteratively (by assuming the absence of the coupled functions and ):

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩xk+11=argminx1∈X1Lρ(x1,xk2,⋯,xkN,yk,λk),⋮xk+1N=argminxN∈XNLρ(xk+11,⋯,xk+1N−1,xN,yk,λk),yk+11=argminy1∈Y1Lρ(xk+1,y1,yk2,⋯,ykM,λk),⋮yk+1M=argminyM∈YMLρ(xk+1,yk+11,⋯,yk+1M−1,yM,λk),λk+1=λk−ρ(Axk+1+Byk+1−b), (6)

where the augmented Lagrangian function is defined as:

 (7)

When there are only two blocks, i.e., , the update scheme in (6) reduces to the classic 2-block ADMM [18, 14]. The convergence properties of the ADMM for solving 2-block separable convex problems have been studied extensively. Since the 2-block ADMM can be viewed as a manifestation of some kind of operator splitting, its convergence follows from that of the so-called Douglas-Rachford operator splitting method; see [17, 13]. Moreover, the convergence rate of the 2-block ADMM has been established recently by many authors; see e.g. [23, 34, 12, 31, 25, 1].

In [24], the authors propose a block successive upper bound minimization method of multipliers (BSUMM) to solve problem (1) without variable. Essentially, at every iteration, the BSUMM replaces the nonseparable part by an upper-bound function and works on that modified function in an ADMM manner. Under some error bound conditions and a diminishing dual stepsize assumption, the authors are able to show that the iterates produced by the BSUMM algorithm converge to the set of primal-dual optimal solutions. Along a similar direction, Cui et al. [8] introduces a quadratic upper-bound function for the nonseparable function to solve 2-block problems; they show that their algorithm has an convergence rate, where is the number of total iterations. Very recently, [16] has proposed a set of variants of the ADMM by adding some proximal terms into the algorithm; the authors have managed to prove convergence rate for the 2-block case, and the same results applied for general multi-block case under some strong convexity assumptions. Moreover, [5] shows the convergence of the ADMM for 2-block problems by imposing quadratic structure on the coupled function and also the convergence of RP-ADMM for multi-block case where all separable functions vanish (i.e. ).

#### Primal-dual method for bilinear saddle-point problems

Recently, the work [9] generalizes the first-order primal-dual method in [3] to a randomized method for solving a class of saddle-point problems in the following form:

 minz∈Z{h(z)+maxx∈X⟨z,N∑i=1Aixi⟩−N∑i=1ui(xi)}, (8)

where and . Let and . Then it is easy to see that (8) is a saddle-point reformulation of the multi-block structured optimization problem

 minx∈XN∑i=1ui(xi), s.t. N∑i=1Aixi=b,

which is a special case of (1) without variable or the coupled function .

At each iteration, the algorithm in [9] chooses one block of -variable uniformly at random and performs a proximal update to it, followed by another proximal update to the -variable. More precisely, it iteratively performs the updates:

 xk+1i={argminxi∈Xi⟨−¯zk,Aixi⟩+ui(xi)+τ2∥xi−xki∥22, if i=ik,xki, if i≠ik, (9a) zk+1=argminz∈Zh(z)+⟨z,Axk+1⟩+η2∥z−zk∥22, (9b) ¯zk+1=q(zk+1−zk)+zk+1, (9c)

where is a randomly selected block, and and are certain parameters111Actually, [9] presents its algorithm in a more general way with the parameters adaptive to the iteration. However, its convergence result assumes constant values of these parameters for the weak convexity case.. When there is only one block of -variable, i.e., , the scheme in (9) becomes exactly the primal-dual method in [3]. Assuming the boundedness of the constraint sets and , [9] shows that under weak convexity, convergence rate result of the scheme can be established by choosing appropriate parameters, and if ’s are all strongly convex, the scheme can be accelerated to have convergence rate by adapting the parameters.

### 1.3 Contributions and organization of this paper

• We propose a randomized primal-dual coordinate update algorithm to solve problems in the form of (1). The key feature is to introduce randomization as done in (9) to the multi-block ADMM framework (6). Unlike the random permutation scheme as previously investigated in [42, 5], we simply choose a subset of blocks of variables based on the uniform distribution. In addition, we perform a proximal update to that selected subset of variables. With appropriate proximal terms (e.g., the setting in (22)), the selected block variables can be decoupled, and thus the updates can be done in parallel.

• More general than (6), we can accommodate coupled terms in the objective function in our algorithm by linearizing such terms. By imposing Lipschitz continuity condition on the partial gradient of the coupled functions and and using proximal terms, we show that our method has an expected convergence rate for solving problem (1) under mere convexity assumption.

• We show that our algorithm includes several existing methods as special cases such as the scheme in (9) and the proximal Jacobian ADMM in [11]. Our result indicates that the convergence rate of the scheme in (9) can be shown without assuming boundedness of the constraint sets. In addition, the same order of convergence rate of the proximal Jacobian ADMM can be established in terms of a better measure.

• Furthermore, the linearization scheme allows us to deal with stochastic objective function, for instance, when the function is given in a form of expectation where is a random vector. As long as an unbiased estimator of the (sub-)gradient of is available, we can extend our method to the stochastic problem and an expected convergence rate is achievable.

The rest of the paper is organized as follows. In Section 2, we introduce our algorithm and present some preliminary results. In Section 3, we present the sublinear convergence rate results of the proposed algorithm. Depending on the multi-block structure of , different conditions and parameter settings are presented in Subsections 3.1, 3.2 and 3.3, respectively. In Section 4, we present an extension of our algorithm where the objective function is assumed not to be even exactly computable, instead only some first-order stochastic approximation is available. The convergence analysis is extended to such settings accordingly. Numerical results are shown in Section 5. In Section 6, we discuss the connections of our algorithm to other well-known methods in the literature. Finally, we conclude the paper in Section 7. The proofs for the technical lemmas are presented in Appendix A, and the proofs for the main theorems are in Appendix B.

## 2 Randomized Primal-Dual Block Coordinate Update Algorithm

In this section, we first present some notations and then introduce our algorithm as well as some preliminary lemmas.

### 2.1 Notations

We denote and . For any symmetric positive semidefinite matrix , we define . Given an integer , denotes the set . We use and as index sets, while is also used to denote the identity matrix; we believe that the intention is evident in the context. Given , we denote:

• Block-indexed variable: ;

• Block-indexed set: ;

• Block-indexed function: ;

• Block-indexed matrix: .

### 2.2 Algorithm

Our algorithm is rather general. Its major ingredients are randomization in selecting block variables, linearization of the coupled functions and , and adding proximal terms. Specifically, at each iteration , it first randomly samples a subset of blocks of , and then a subset of blocks of according to the uniform distribution over the indices. The randomized sampling rule is as follows:

Randomization Rule (U): For the given integers and , it randomly chooses index sets with and with uniformly; i.e., for any subsets and , the following holds

 Prob[Ik={i1,i2,…,in}]=1/(Nn), (10) Prob[Jk={j1,j2,…,jm}]=1/(Mm).

After those subsets have been selected, it performs a prox-linear update to those selected blocks based on the augmented Lagrangian function, followed by an update of the Lagrangian multiplier. The details of the method are summarized in Algorithm 1 below.

In Algorithm 1, and are predetermined positive semidefinite matrices with appropriate dimensions. For the selected blocks in and , instead of implementing the exact minimization of the augmented Lagrangian function, we perform a block proximal gradient update. In particular, before minimization, we first linearize the coupled functions , , and add some proximal terms to it. Note that one can always select all blocks, i.e., and . Empirically however, the block coordinate update method usually outperforms the full coordinate update method if the problem possesses certain structures; see [37] for an example. In addition, by choosing appropriate and , the problems (11) and (13) can be separable with respect to the selected blocks, and thus one can update the variables in parallel.

### 2.3 Preliminaries

Let be the aggregated primal-dual variables and the primal-dual linear mapping; namely

 w=⎛⎜⎝xyλ⎞⎟⎠,H(w)=⎛⎜⎝−A⊤λ−B⊤λAx+By−b⎞⎟⎠, (16)

and also let

 u(x)=N∑i=1ui(xi),v(y)=M∑j=1vj(yj), F(x)=f(x)+u(x),G(y)=g(y)+v(y),Φ(x,y)=F(x)+G(y).

The point is a solution to (1) if and only if there exists such that

 Φ(x,y)−Φ(x∗,y∗)+(w−w∗)⊤H(w∗)≥0,∀(x,y)∈X×Y,∀λ, (17a) Ax∗+By∗=b (17b) x∗∈X,y∗∈Y. (17c)

The following lemmas will be used in our subsequent analysis, whose proofs are elementary and thus are omitted here.

###### Lemma 2.1

For any two vectors and , it holds

 (w−~w)⊤H(w)=(w−~w)⊤H(~w). (18)
###### Lemma 2.2

For any two vectors and a positive semidefinite matrix :

 u⊤Wv=12(∥u∥2W+∥v∥2W−∥u−v∥2W). (19)
###### Lemma 2.3

For any nonzero positive semidefinite matrix , it holds for any and of appropriate size that

 ∥z−^z∥2≥1∥W∥2∥z−^z∥2W, (20)

where denotes the matrix operator norm of .

The following lemma presents a useful property of , which essentially follows from (18).

###### Lemma 2.4

For any vectors , and sequence of positive numbers , it holds that

 ⎛⎜ ⎜ ⎜ ⎜⎝t∑k=0βkwkt∑k=0βk−w⎞⎟ ⎟ ⎟ ⎟⎠⊤H⎛⎜ ⎜ ⎜ ⎜⎝t∑k=0βkwkt∑k=0βk⎞⎟ ⎟ ⎟ ⎟⎠=1t∑k=0βkt∑k=0βk(wk−w)⊤H(wk). (21)

## 3 Convergence Rate Results

In this section, we establish sublinear convergence rate results of Algorithm 1 for three different cases. We differentiate those cases based on whether or not in problem (1) also has the multi-block structure. In the first case where is a multi-block variable, it requires where and are the cardinalities of the subsets of and selected in our algorithm respectively. Since the analysis only requires weak convexity, we can ensure the condition to hold by adding zero component functions if necessary, in such a way that and then choosing . The second case is that is treated as a single-block variable, and this can be reflected in our algorithm by simply selecting all -blocks every time, i.e. . The third case assumes no -variable at all. It falls into the first and second cases, and we discuss this case separately since it requires weaker conditions to guarantee the same convergence rate.

Throughout our analysis, we choose the matrices and in Algorithm 1 as follows:

 Pk=^PIk−ρxA⊤IkAIk,Qk=^QJk−ρyB⊤JkBJk, (22)

where and are given symmetric positive semidefinite and block diagonal matrices, and denotes the diagonal blocks of indexed by . Note that such choice of and makes the selected block variables in (11) and in (13) decoupled, and thus both updates can be computed in parallel. In addition, we make the following assumptions:

###### Assumption 1 (Convexity)

For (1), ’s and ’s are some closed convex sets, and are smooth convex functions, and ’s and ’s are proper closed convex function.

###### Assumption 2 (Existence of a solution)

There is at least one point satisfying the conditions in (17).

###### Assumption 3 (Lipschitz continuous partial gradient)

There exist constants and such that for any subset of with and any subset of with , it holds that

 ∥∇If(x+UI~x)−∇If(x)∥≤Lf∥~xI∥,∀x,~x, (23a) ∥∇Jg(y+UJ~y)−∇Jg(y)∥≤Lg∥~yJ∥,∀y,~y, (23b)

where keeps the blocks of that are indexed by and zero elsewhere.

Before presenting the main convergence rate result, we first establish a few key lemmas.

###### Lemma 3.1 (One-step analysis)

Let be the sequence generated from Algorithm 1 with matrices and defined as in (22). Then the following inequalities hold

 EIk[F(xk+1)−F(x)+(xk+1−x)⊤(−A⊤λk+1) (26) +(ρx−ρ)(xk+1−x)⊤A⊤rk+1−ρx(xk+1−x)⊤A⊤B(yk+1−yk)] +EIk(xk+1−x)⊤(^P−ρxA⊤A)(xk+1−xk)−Lf2EIk∥xk−xk+1∥2 ≤ (1−nN)[F(xk)−F(x)+(xk−x)⊤(−A⊤λk)+ρx(xk−x)⊤A⊤rk], (27)

and

 EJk[G(yk+1)−G(y)+(yk+1−y)⊤(−B⊤λk+1)+(ρy−ρ)(yk+1−y)⊤B⊤rk+1] (30) +EJk(yk+1−y)⊤(^Q−ρyB⊤B)(yk+1−yk)−Lg2EJk∥yk−yk+1∥2 −(1−mM)ρy(yk−y)⊤B⊤A(xk+1−xk) ≤ (1−mM)[G(yk)−G(y)+(yk−y)⊤(−B⊤λk)+ρy(yk−y)⊤B⊤rk], (31)

where denotes expectation over and conditional on all previous history.

Note that for any feasible point (namely, and ),

 Axk+1−Ax = 1ρ(λk−λk+1)−(Byk+1−b)+(By−b) (32) = 1ρ(λk−λk+1)−B(yk+1−y) (33)

and

 Byk−By = 1ρ(λk−1−λk)−(Axk−b)+(Ax−b) (34) = 1ρ(λk−1−λk)−A(xk−x). (35)

Then, using (19) we have the following result.

###### Lemma 3.2

For any feasible point and integer , it holds

 t∑k=0(xk+1−x)⊤A⊤B(yk+1−yk) = 1ρt∑k=0(λk−λk+1)⊤B(yk+1−yk)−12(∥yt+1−y∥2B⊤B−∥y0−y∥2B⊤B+t∑k=0∥yk+1−yk∥2B⊤B)

and

 t∑k=0(yk−y)⊤B⊤A(xk+1−xk) = 1ρt∑k=0(λk−1−λk)⊤A(xk+1−xk)+12(∥x0−x∥2A⊤A−∥xt+1−x∥2A⊤A+t∑k=0∥xk+1−xk∥2A⊤A).
###### Lemma 3.3

Given a continuous function , for a random vector , if for any feasible point that may depend on , we have

 E[Φ(^x,^y)−Φ(x,y)+(^w−w)⊤H(w)]≤E[h(w)], (38)

then for any and any optimal solution to (1) we also have

 E[Φ(^x,^y)−Φ(x∗,y∗)+γ∥A^x+B^y−b∥]≤sup∥λ∥≤γh(x∗,y∗,λ).

Noting

 Φ(x,y)−Φ(x∗,y∗)+(w−w∗)⊤H(w∗)=Φ(x,y)−Φ(x∗,y∗)−(λ∗)⊤(Ax+By−b),

we can easily show the following lemma by the optimality of and the Cauchy-Schwarz inequality.

###### Lemma 3.4

Assume satisfies (17). Then for any point , we have

 Φ(^x,^y)−Φ(x∗,y∗)≥−∥λ∗∥⋅∥A^x+B^y−b∥. (39)

The following lemma shows a connection between different convergence measures, and it can be simply proved by using (39). If both and are deterministic, it reduces to Lemma 2.4 in [15].

###### Lemma 3.5

Suppose that

 E[Φ(^x,^y)−Φ(x∗,y∗)+γ∥A^x+B^y−b∥]≤ϵ.

Then, we have

where satisfies the optimality conditions in (17), and we assume .

The convergence analysis for Algorithm 1 requires slightly different parameter settings under different structures. In fact, the underlying analysis and results also differ. To account for the differences, we present in the next three subsections the corresponding convergence results. The first one assumes there is no part at all; the second case assumes a single block on the side; the last one deals with the general case where the ratios is assumed to be equal to .

### 3.1 Multiple x blocks and no y variable

We first consider a special case with no -variable, namely, and in (1). This case has its own importance. It is a parallel block coordinate update version of the linearized augmented Lagrangian method (ALM).

###### Theorem 3.6 (Sublinear ergodic convergence I)

Assume and in (1). Let be the sequence generated from Algorithm 1 with . Assume and

 Pk⪰LfI,∀k. (40)

Let

 ^xt=xt+1+θ∑tk=1xk1+θt. (41)

Then, under Assumptions 1, 2 and 3, we have

 max{∣∣E[F(^xt)−F(x∗)]∣∣,E∥A^xt−b∥} ≤ 11+θt[(1−θ)(F(x0)−F(x∗)+ρx∥r0∥2)+12∥x0−x∗∥2~P+max{(1+∥λ∗∥)2,4∥λ∗∥2}2ρx]

where , and is an arbitrary primal-dual solution.

If there is no coupled function , (40) indicates that we can even choose for all , i.e. without proximal terms at all. Some caution is required here: in that case when , the algorithm is not the Jacobian ADMM as discussed in [20] since the block variables are still coupled in the augmented Lagrangian function. To make it parallelizable, a proximal term is needed. Then our result recovers the convergence of the proximal Jacobian ADMM introduced in [11]. In fact, the above theorem strengthens the convergence result in [11] by establishing an rate of convergence in terms of the feasibility measure and the objective value.

### 3.2 Multiple x blocks and a single y block

When the -variable is simple to update, it could be beneficial to renew the whole of it at every iteration, such as the problem (3). In this subsection, we consider the case that there are multiple -blocks but a single -block (or equivalently, ), and we establish a sublinear convergence rate result with a different technique of dealing with the -variable.

###### Theorem 3.7 (Sublinear ergodic convergence II)

Let be the sequence generated from Algorithm 1 with and , where Assume

 ^P⪰LfI+ρxA⊤A,^Q⪰LgθI+(ρθ4−ρθ2+ρy)B⊤B. (43)

Let

 ^xt=xt+1+θ∑tk=1xk1+θt,^yt=~yt+1+θ∑tk=1yk1+θt (44)

where

 ~yt+1=argminy∈Y⟨∇g(yt)−B⊤λt,y⟩+v(y)+ρx2∥Axt+1+By−b∥2+θ2∥y−yt∥2^Q−ρyB⊤B. (45)

Then, under Assumptions 1, 2 and 3, we have

 max{∣∣E[Φ(^xt,^yt)−Φ(x∗,y∗)]∣∣,E∥A^xt+B^yt−b∥} ≤ 11+θt[(1−θ)(Φ(x0,y0)−Φ(x∗,y∗)+ρx2∥r0∥2)+12∥x0−x∗∥2^P+12∥y0−y∗∥2θ~Q+ρxB⊤B +max{(1+∥λ∗∥)2,4∥λ∗∥2}2ρx]

where , and is an arbitrary primal-dual solution.

###### Remark 3.1

It is easy to see that if , the result in Theorem 3.7 becomes exactly the same as that in Theorem 3.8 below. In general, they are different because the conditions in (43) on and are different from those in (48a).

### 3.3 Multiple x and y blocks

In this subsection, we consider the most general case where both and have multi-block structure. Assuming