Randomized PrimalDual Proximal Block Coordinate Updates
Abstract
In this paper we propose a randomized primaldual proximal block coordinate updating framework for a general multiblock convex optimization model with coupled objective function and linear constraints. Assuming mere convexity, we establish its convergence rate in terms of the objective value and feasibility measure. The framework includes several existing algorithms as special cases such as a primaldual method for bilinear saddlepoint problems (PDS), the proximal Jacobian ADMM (ProxJADMM) and a randomized variant of the ADMM method for multiblock convex optimization. Our analysis recovers and/or strengthens the convergence properties of several existing algorithms. For example, for PDS our result leads to the same order of convergence rate without the previously assumed boundedness condition on the constraint sets, and for ProxJADMM the new result provides convergence rate in terms of the objective value and the feasibility violation. It is well known that the original ADMM may fail to converge when the number of blocks exceeds two. Our result shows that if an appropriate randomization procedure is invoked to select the updating blocks, then a sublinear rate of convergence in expectation can be guaranteed for multiblock ADMM, without assuming any strong convexity. The new approach is also extended to solve problems where only a stochastic approximation of the (sub)gradient of the objective is available, and we establish an convergence rate of the extended approach for solving stochastic programming.
Keywords: primaldual method, alternating direction method of multipliers (ADMM), randomized algorithm, iteration complexity, firstorder stochastic approximation.
Mathematics Subject Classification: 90C25, 95C06, 68W20.
1 Introduction
In this paper, we consider the following multiblock structured convex optimization model
(1)  
s.t.  
where the variables and are naturally partitioned into and blocks respectively, and are block matrices, ’s and ’s are some closed convex sets, and are smooth convex functions, and ’s and ’s are proper closed convex (possibly nonsmooth) functions.
1.1 Motivating examples
Optimization problems in the form of (1) have many emerging applications from various fields. For example, the constrained lasso (classo) problem that was first studied by James et al. [26] as a generalization of the lasso problem, can be formulated as
(2) 
where , are the observed data, and , are the predefined data matrix and vector. Many widely used statistical models can be viewed as special cases of (2), including the monotone curve estimation, fused lasso, generalized lasso, and so on [26]. By partitioning the variable into blocks as where as well as other matrices and vectors in (2) correspondingly, and introducing another slack variable , the classo problem can be transformed to
(3) 
which is in the form of (1).
Another interesting example is the extended linearquadratic programming [40] that can be formulated as
(4) 
where and are symmetric positive semidefinite matrices, and is a polyhedral set. Apparently, (4) includes quadratic programming as a special case. In general, its objective is a piecewise linearquadratic convex function. Let , where denotes the indicator function of . Then
where denotes the convex conjugate of . Replacing by and introducing slack variable , we can equivalently write (4) into the form of (1):
(5) 
for which one can further partition the variable into a number of disjoint blocks.
Many other interesting applications in various areas can be formulated as optimization problems in the form of (1), including those arising from signal processing, image processing, machine learning and statistical learning; see [24, 8, 5, 16] and the references therein.
Finally, we mention that computing a point on the central path for a generic convex programming in block variables :
boils down to
where and indicates the sum of the logarithm of all the components of . This model is again in the form of (1).
1.2 Related works in the literature
Our work relates to two recently very popular topics: the Alternating Direction Method of Multipliers (ADMM) for multiblock structured problems and the firstorder primaldual method for bilinear saddlepoint problems. Below we review the two methods and their convergence results. More complete discussion on their connections to our method will be provided after presenting our algorithm.
Multiblock ADMM and its variants
One wellknown approach for solving a linear constrained problem in the form of (1) is the augmented Lagrangian method, which iteratively updates the primal variable by minimizing the augmented Lagrangian function in (7) and then the multiplier through dual gradient ascent. However, the linear constraint couples and all together, it can be very expensive to minimize the augmented Lagrangian function simultaneously with respect to all block variables. Utilizing the multiblock structure of the problem, the multiblock ADMM updates the block variables sequentially, one at a time with the others fixed to their most recent values, followed by the update of multiplier. Specifically, it performs the following updates iteratively (by assuming the absence of the coupled functions and ):
(6) 
where the augmented Lagrangian function is defined as:
(7) 
When there are only two blocks, i.e., , the update scheme in (6) reduces to the classic 2block ADMM [18, 14]. The convergence properties of the ADMM for solving 2block separable convex problems have been studied extensively. Since the 2block ADMM can be viewed as a manifestation of some kind of operator splitting, its convergence follows from that of the socalled DouglasRachford operator splitting method; see [17, 13]. Moreover, the convergence rate of the 2block ADMM has been established recently by many authors; see e.g. [23, 34, 12, 31, 25, 1].
Although the multiblock ADMM scheme in (6) performs very well for many instances encountered in practice (e.g. [36, 43]), it may fail to converge for some instances if there are more than 2 block variables, i.e., . In particular, an example was presented in [4] to show that the ADMM may even diverge with 3 blocks of variables, when solving a linear system of equations. Thus, some additional assumptions or modifications will have to be in place to ensure convergence of the multiblock ADMM. In fact, by incorporating some extra correction steps or changing the GaussSeidel updating rule, [11, 22, 20, 21, 45] show that the convergence can still be achieved for the multiblock ADMM. Moreover, if some part of the objective function is strongly convex or the objective has certain regularity property, then it can be shown that the convergence holds under various conditions; see [6, 31, 2, 30, 28, 19, 44]. Using some other conditions including the error bound condition and taking small dual stepsizes, or by adding some perturbations to the original problem, authors of [25, 32] establish the rate of convergence results even without strong convexity. Not only for the problem with linear constraint, in [7, 29, 41] multiblock ADMM are extended to solve convex linear/quadratic conic programming problems. In a very recent work [42], Sun, Luo and Ye propose a randomly permuted ADMM (RPADMM) that basically chooses a random permutation of the block indices and performs the ADMM update according to the order of indices in that permutation, and they show that the RPADMM converges in expectation for solving nonsingular square linear system of equations.
In [24], the authors propose a block successive upper bound minimization method of multipliers (BSUMM) to solve problem (1) without variable. Essentially, at every iteration, the BSUMM replaces the nonseparable part by an upperbound function and works on that modified function in an ADMM manner. Under some error bound conditions and a diminishing dual stepsize assumption, the authors are able to show that the iterates produced by the BSUMM algorithm converge to the set of primaldual optimal solutions. Along a similar direction, Cui et al. [8] introduces a quadratic upperbound function for the nonseparable function to solve 2block problems; they show that their algorithm has an convergence rate, where is the number of total iterations. Very recently, [16] has proposed a set of variants of the ADMM by adding some proximal terms into the algorithm; the authors have managed to prove convergence rate for the 2block case, and the same results applied for general multiblock case under some strong convexity assumptions. Moreover, [5] shows the convergence of the ADMM for 2block problems by imposing quadratic structure on the coupled function and also the convergence of RPADMM for multiblock case where all separable functions vanish (i.e. ).
Primaldual method for bilinear saddlepoint problems
Recently, the work [9] generalizes the firstorder primaldual method in [3] to a randomized method for solving a class of saddlepoint problems in the following form:
(8) 
where and . Let and . Then it is easy to see that (8) is a saddlepoint reformulation of the multiblock structured optimization problem
which is a special case of (1) without variable or the coupled function .
At each iteration, the algorithm in [9] chooses one block of variable uniformly at random and performs a proximal update to it, followed by another proximal update to the variable. More precisely, it iteratively performs the updates:
(9a)  
(9b)  
(9c) 
where is a randomly selected block, and and are certain parameters^{1}^{1}1Actually, [9] presents its algorithm in a more general way with the parameters adaptive to the iteration. However, its convergence result assumes constant values of these parameters for the weak convexity case.. When there is only one block of variable, i.e., , the scheme in (9) becomes exactly the primaldual method in [3]. Assuming the boundedness of the constraint sets and , [9] shows that under weak convexity, convergence rate result of the scheme can be established by choosing appropriate parameters, and if ’s are all strongly convex, the scheme can be accelerated to have convergence rate by adapting the parameters.
1.3 Contributions and organization of this paper

We propose a randomized primaldual coordinate update algorithm to solve problems in the form of (1). The key feature is to introduce randomization as done in (9) to the multiblock ADMM framework (6). Unlike the random permutation scheme as previously investigated in [42, 5], we simply choose a subset of blocks of variables based on the uniform distribution. In addition, we perform a proximal update to that selected subset of variables. With appropriate proximal terms (e.g., the setting in (22)), the selected block variables can be decoupled, and thus the updates can be done in parallel.

More general than (6), we can accommodate coupled terms in the objective function in our algorithm by linearizing such terms. By imposing Lipschitz continuity condition on the partial gradient of the coupled functions and and using proximal terms, we show that our method has an expected convergence rate for solving problem (1) under mere convexity assumption.

We show that our algorithm includes several existing methods as special cases such as the scheme in (9) and the proximal Jacobian ADMM in [11]. Our result indicates that the convergence rate of the scheme in (9) can be shown without assuming boundedness of the constraint sets. In addition, the same order of convergence rate of the proximal Jacobian ADMM can be established in terms of a better measure.

Furthermore, the linearization scheme allows us to deal with stochastic objective function, for instance, when the function is given in a form of expectation where is a random vector. As long as an unbiased estimator of the (sub)gradient of is available, we can extend our method to the stochastic problem and an expected convergence rate is achievable.
The rest of the paper is organized as follows. In Section 2, we introduce our algorithm and present some preliminary results. In Section 3, we present the sublinear convergence rate results of the proposed algorithm. Depending on the multiblock structure of , different conditions and parameter settings are presented in Subsections 3.1, 3.2 and 3.3, respectively. In Section 4, we present an extension of our algorithm where the objective function is assumed not to be even exactly computable, instead only some firstorder stochastic approximation is available. The convergence analysis is extended to such settings accordingly. Numerical results are shown in Section 5. In Section 6, we discuss the connections of our algorithm to other wellknown methods in the literature. Finally, we conclude the paper in Section 7. The proofs for the technical lemmas are presented in Appendix A, and the proofs for the main theorems are in Appendix B.
2 Randomized PrimalDual Block Coordinate Update Algorithm
In this section, we first present some notations and then introduce our algorithm as well as some preliminary lemmas.
2.1 Notations
We denote and . For any symmetric positive semidefinite matrix , we define . Given an integer , denotes the set . We use and as index sets, while is also used to denote the identity matrix; we believe that the intention is evident in the context. Given , we denote:

Blockindexed variable: ;

Blockindexed set: ;

Blockindexed function: ;

Blockindexed gradient: ;

Blockindexed matrix: .
2.2 Algorithm
Our algorithm is rather general. Its major ingredients are randomization in selecting block variables, linearization of the coupled functions and , and adding proximal terms. Specifically, at each iteration , it first randomly samples a subset of blocks of , and then a subset of blocks of according to the uniform distribution over the indices. The randomized sampling rule is as follows:
Randomization Rule (U): For the given integers and , it randomly chooses index sets with and with uniformly; i.e., for any subsets and , the following holds
(10)
After those subsets have been selected, it performs a proxlinear update to those selected blocks based on the augmented Lagrangian function, followed by an update of the Lagrangian multiplier. The details of the method are summarized in Algorithm 1 below.
(11)  
(12) 
(13)  
(14) 
(15) 
In Algorithm 1, and are predetermined positive semidefinite matrices with appropriate dimensions. For the selected blocks in and , instead of implementing the exact minimization of the augmented Lagrangian function, we perform a block proximal gradient update. In particular, before minimization, we first linearize the coupled functions , , and add some proximal terms to it. Note that one can always select all blocks, i.e., and . Empirically however, the block coordinate update method usually outperforms the full coordinate update method if the problem possesses certain structures; see [37] for an example. In addition, by choosing appropriate and , the problems (11) and (13) can be separable with respect to the selected blocks, and thus one can update the variables in parallel.
2.3 Preliminaries
Let be the aggregated primaldual variables and the primaldual linear mapping; namely
(16) 
and also let
The point is a solution to (1) if and only if there exists such that
(17a)  
(17b)  
(17c) 
The following lemmas will be used in our subsequent analysis, whose proofs are elementary and thus are omitted here.
Lemma 2.1
For any two vectors and , it holds
(18) 
Lemma 2.2
For any two vectors and a positive semidefinite matrix :
(19) 
Lemma 2.3
For any nonzero positive semidefinite matrix , it holds for any and of appropriate size that
(20) 
where denotes the matrix operator norm of .
The following lemma presents a useful property of , which essentially follows from (18).
Lemma 2.4
For any vectors , and sequence of positive numbers , it holds that
(21) 
3 Convergence Rate Results
In this section, we establish sublinear convergence rate results of Algorithm 1 for three different cases. We differentiate those cases based on whether or not in problem (1) also has the multiblock structure. In the first case where is a multiblock variable, it requires where and are the cardinalities of the subsets of and selected in our algorithm respectively. Since the analysis only requires weak convexity, we can ensure the condition to hold by adding zero component functions if necessary, in such a way that and then choosing . The second case is that is treated as a singleblock variable, and this can be reflected in our algorithm by simply selecting all blocks every time, i.e. . The third case assumes no variable at all. It falls into the first and second cases, and we discuss this case separately since it requires weaker conditions to guarantee the same convergence rate.
Throughout our analysis, we choose the matrices and in Algorithm 1 as follows:
(22) 
where and are given symmetric positive semidefinite and block diagonal matrices, and denotes the diagonal blocks of indexed by . Note that such choice of and makes the selected block variables in (11) and in (13) decoupled, and thus both updates can be computed in parallel. In addition, we make the following assumptions:
Assumption 1 (Convexity)
For (1), ’s and ’s are some closed convex sets, and are smooth convex functions, and ’s and ’s are proper closed convex function.
Assumption 2 (Existence of a solution)
There is at least one point satisfying the conditions in (17).
Assumption 3 (Lipschitz continuous partial gradient)
There exist constants and such that for any subset of with and any subset of with , it holds that
(23a)  
(23b) 
where keeps the blocks of that are indexed by and zero elsewhere.
Before presenting the main convergence rate result, we first establish a few key lemmas.
Lemma 3.1 (Onestep analysis)
Note that for any feasible point (namely, and ),
(32)  
(33) 
and
(34)  
(35) 
Then, using (19) we have the following result.
Lemma 3.2
For any feasible point and integer , it holds
and
Lemma 3.3
Given a continuous function , for a random vector , if for any feasible point that may depend on , we have
(38) 
then for any and any optimal solution to (1) we also have
Noting
we can easily show the following lemma by the optimality of and the CauchySchwarz inequality.
Lemma 3.4
Assume satisfies (17). Then for any point , we have
(39) 
The following lemma shows a connection between different convergence measures, and it can be simply proved by using (39). If both and are deterministic, it reduces to Lemma 2.4 in [15].
Lemma 3.5
The convergence analysis for Algorithm 1 requires slightly different parameter settings under different structures. In fact, the underlying analysis and results also differ. To account for the differences, we present in the next three subsections the corresponding convergence results. The first one assumes there is no part at all; the second case assumes a single block on the side; the last one deals with the general case where the ratios is assumed to be equal to .
3.1 Multiple blocks and no variable
We first consider a special case with no variable, namely, and in (1). This case has its own importance. It is a parallel block coordinate update version of the linearized augmented Lagrangian method (ALM).
Theorem 3.6 (Sublinear ergodic convergence I)
If there is no coupled function , (40) indicates that we can even choose for all , i.e. without proximal terms at all. Some caution is required here: in that case when , the algorithm is not the Jacobian ADMM as discussed in [20] since the block variables are still coupled in the augmented Lagrangian function. To make it parallelizable, a proximal term is needed. Then our result recovers the convergence of the proximal Jacobian ADMM introduced in [11]. In fact, the above theorem strengthens the convergence result in [11] by establishing an rate of convergence in terms of the feasibility measure and the objective value.
3.2 Multiple blocks and a single block
When the variable is simple to update, it could be beneficial to renew the whole of it at every iteration, such as the problem (3). In this subsection, we consider the case that there are multiple blocks but a single block (or equivalently, ), and we establish a sublinear convergence rate result with a different technique of dealing with the variable.
Theorem 3.7 (Sublinear ergodic convergence II)
Let be the sequence generated from Algorithm 1 with and , where Assume
3.3 Multiple and blocks
In this subsection, we consider the most general case where both and have multiblock structure. Assuming