A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non Separable Functions
This paper introduces a coordinate descent version of the Vũ-Condat algorithm. By coordinate descent, we mean that only a subset of the coordinates of the primal and dual iterates is updated at each iteration, the other coordinates being maintained to their past value. Our method allows us to solve optimization problems with a combination of differentiable functions, constraints as well as non-separable and non-differentiable regularizers.
We show that the sequences generated by our algorithm converge to a saddle point of the problem at stake, for a wider range of parameter values than previous methods. In particular, the condition on the step-sizes depends on the coordinate-wise Lipschitz constant of the differentiable function’s gradient, which is a major feature allowing classical coordinate descent to perform so well when it is applicable.
We illustrate the performances of the algorithm on a total-variation regularized least squares regression problem and on large scale support vector machine problems.
We consider the optimization problem
where is a Euclidean space, is a linear operator onto a second Euclidean space ; functions , and are assumed proper, closed and convex; the function is moreover assumed differentiable. We assume that and are product spaces of the form and for some integers . For any , we use the notation to represent the (block of) coordinates of (similarly for in ). Problem (1) has numerous applications e.g. in machine learning [CBS14], image processing [CCC10] or distributed optimization [BPC11].
Under the standard qualification condition (where and respectively stand for domain and relative interior), a point is a minimizer of (1) if and only if there exists such that is a saddle point of the Lagrangian function
where is the inner product and is the Fenchel-Legendre transform of . There is a rich literature on primal-dual algorithms searching for a saddle point of (see [TDC14] and references therein). In the special case where , the alternating direction method of multipliers (ADMM) proposed by Glowinsky and Marroco [GM75], Gabay and Mercier [GM76] and the algorithm of Chambolle and Pock [CP11] are amongst the most celebrated ones. In order to encompass the presence of a nonzero smooth function , Combettes and Pesquet proposed a primal-dual splitting algorithm which, in the case of Problem (1), involves two calls of the gradient of at each iteration [CP12]. Hence, function is handled explicitely in the sense that the algorithm does not involve, for instance, the call of a proximity operator associated with . Based on an elegant idea also used in [HY12], Vũ [Vũ13] and Condat [Con13] separately proposed a primal-dual algorithm allowing as well to handle explicitely, and requiring one evaluation of the gradient of at each iteration. A convergence rate analysis is provided in [CP14a] (see also [TDC14]). A related splitting method has been recently introduced by [DY15].
This paper introduces a coordinate descent (CD) version of the Vũ-Condat algorithm. By coordinate descent, we mean that only a subset of the coordinates of the primal and dual iterates is updated at each iteration, the other coordinates being maintained to their past value. Coordinate descent was historically used in the context of coordinate-wise minimization of a unique function in a Gauss-Seidel sense [War63] [BT89] [TM01]. Tseng et al. [LT02] [TY] [TY10] and Nesterov [Nes12] developped CD versions of the gradient descent. In [Nes12] as well as in this paper, the updated coordinates are randomly chosen at each iteration. The algorithm of [Nes12] has at least two interesting features. Not only it is often easier to evaluate a single coordinate of the gradient vector rather than the whole vector, but the conditions under which the CD version of the algorithm is provably convergent are generally weaker than in the case of standard gradient descent. The key point is that the step size used in the algorithm when updating a given coordinate can be chosen to be inversely proportional to the coordinate-wise Lipschitz constant of along its th coordinate, rather than the global Lipschitz constant of (as would be the case in a standard gradient descent). Hence, the introduction of coordinate descent allows to use larger step sizes which potentially results in a more attractive performance. The random CD gradient descent of [Nes12] was later generalized by Richtárik and Takáč [RT14] to the minimization of a sum of two convex functions (that is, in problem (1)). The algorithm of [RT14] is analyzed under the additional assumption that function is separable in the sense that for each , for some functions . Accelerated and parallel versions of the algorithm have been later developed by [RT12] [RT15] [FR13] always assuming the separability of .
In the literature, several papers seek to apply the principle of coordinate descent to primal-dual algorithms. In the case where , is separable and smooth and is strongly convex, Zhang and Xiao [ZX14] introduce a stochastic CD primal-dual algorithm and analyze its convergence rate (see also [Suz14] for related works). In 2013, Iutzeler et al. [IBCH13] proved that random coordinate descent can be successfully applied to fixed point iterations of firmly non-expansive (FNE) operators. Due to [Gab83], the ADMM can be written as a fixed point algorithm of a FNE operator, which led the authors of [IBCH13] to propose a coordinate descent version of ADMM with application to distributed optimization. The key idea behind the convergence proof of [IBCH13] is to establish the so-called stochastic Fejèr monotonicity of the sequence of iterates as noted by [CP14b]. In a more general setting than [IBCH13], Combettes et al. in [CP14b] and Bianchi et al. [BHI14] extend the proof to the so-called -averaged operators, which include FNE operators as a special case. This generalization allows to apply the coordinate descent principle to a broader class of primal-dual algorithms which is no longer restricted to the ADMM or the Douglas Rachford algorithm. For instance, Forward-Backward splitting is considered in [CP14b] and the Vũ-Condat algorithm is considered in [BHI14, PR14]. However, the approach of [IBCH13], [CP14b], [BHI14] which is based on stochastic Féjer monotonicity, has a major drawback: the convergence conditions are identical to the ones of the brute method, the one without coordinate descent. These conditions involve the global Lipschitz constant of the gradient instead than its coordinate-wise Lipschitz constants. In practice, it means that the application of coordinate descent to primal-dual algorithm as suggested by [CP14b] and [BHI14] is restricted to the use of potentially small step sizes. One of the major benefits of coordinate descent is lost.
In this paper, we introduce a CD primal-dual algorithm with a broader range of admissible step sizes. The algorithm is introduced in Section 2. At each iteration , an index is randomly chosen w.r.t. the uniform distribution in where is, as we recall, the number of primal coordinates. The coordinate of the current primal iterate is updated, as well as a set of associated dual iterates. Under some assumptions involving the coordinate-wise Lipschitz constants of , the primal-dual iterates converges to a saddle point of the Lagrangian. As a remarkable feature, our CD algorithm makes no assumption of separability of the functions , or . In the special case where and is separable, the algorithm reduces to the CD proximal gradient algorithm of [RT14]. The convergence proof is provided in Section 3. It is worth noting that, under the stated assumption on the step-size, the stochastic Fejèr monotonicity of the sequence of iterates, which is the key idea in [IBCH13], [CP14b], [BHI14], does not hold (a counter-example is provided). Our proof relies on the introduction of an adequate Lyapunov function. In Section 4, the proposed algorithm is instanciated to the case of total-variation regularization and support vector machines. Numerical results performed on real IRM and text data establish the attractive behavior of the proposed algorithm and emphasize the importance of using primal-dual CD with large step sizes.
2 Coordinate Descent Primal-Dual Algorithm
We note where are the block components of . For each , we introduce the set
Otherwise stated, the th component of vector only depends on through the coordinates such that . We denote by
the number of such coordinates. Without loss of generality, we assume that for all . For all , we define
Note that for every pair , the statements and are equivalent.
Recall that . For every , we use the notation . An arbitrary element in will be represented by . We define . An arbitrary element in will be represented as . These notations are recalled in Table 1 below.
If is an integer, is a collection of positive real numbers and is an arbitrary product of Euclidean spaces, we introduce the weighted norm on given by for every where stand for the norm on . If denotes a convex proper lower-semicontinuous function, we introduce the proximity operator defined for any by
where we use the notation . We denote by the th coordinate mapping of that is, for any . The notation (or simply when no ambiguity occurs) stands for the diagonal operator on given by for every .
Finally, the adjoint of a linear operator is denoted . The spectral radius of a square matrix is denoted by .
2.2 Main Algorithm
Consider Problem (1). Let and be two tuples of positive real numbers. Consider an independent and identically distributed sequence with uniform distribution on 222The results of this paper easily extend to the selection of several primal coordinates at each iteration with a uniform samplings of the coordinates, using the techniques introduced in [RT15].. The proposed primal-dual CD algorithm consists in updating four sequences , , and . It is provided in Algorithm 1 below.
Initialization: Choose , .
For all , set .
For all , set .
Iteration : Define:
For and for each , update:
Otherwise, set , , and .
In Algorithm 1, it is worth noting that quantities do not need to be explicitely calculated. At iteration , only the coordinates
are needed to perform the update. When is separable, it can be easily checked that other coordinates do not need to be computed. From a computational point of view, it is often the case that the evaluation of the above coordinates is less demanding than the computation of the whole vectors , . Practical examples are provided in Section 4.
For every , we denote by the linear operator given by i.e., all coordinates of are zero except the th coordinate which coincides with . Our convergence result holds under the following assumptions.
The functions , , are closed proper and convex.
The function is differentiable on .
For every , there exists such that for any , any ,
The random sequence is independent with uniform distribution on .
For every ,
We denote by the set of saddle points of the Lagrangian function . Otherwise stated, a couple lies in if and only if it satisfies the following inclusions
We shall also refer to elements of as primal-dual solutions.
2.3 Special cases
2.3.1 The case
We consider the special case . Otherwise stated, the linear operator has a single nonzero component per row . This case happens for instance in the context of distributed optimization [BHI14].
Initialization: Choose , .
Iteration : Define:
For and for each , update:
Otherwise, set ,.
2.3.2 The case
Instanciating Algorithm 1 in the special case , it boils down to the following CD forward-backward algorithm:
As a consequence, Algorithm 1 allows to recover the CD proximal gradient algorithm of [RT14] with the notable difference that we do not assume the separability of . On the other hand, Assumption 1(e) becomes whereas in the separable case, [RT14] assumes . This remark leads us to conjecture that, even though Assumption 1(e) generally allows for the use of larger step sizes as the ones suggested by the approach of [CP14b] [BHI14], one might be able to use even larger step sizes than the ones allowed by Theorem 1.
Note that a similar CD forward-backward algorithm can be found in [CP14b] with no need to require the separability of . However, the algorithm of [CP14b] assumes that the step size (there assumed to be independent of ) is less than where is the global Lipschitz constant of . As discussed in the introduction, an attractive feature of our algorithm is the fact that our convergence condition only involves the coordinate-wise Lipschitz constant of .
2.4 Failure of stochastic Fejér monotonicity
As discussed in the introduction, an existing approach to prove convergence of CD algorithm in a general setting (that is, not restricted to and separable ) is to establish the stochastic Fejér monotonicity of the iterates. The idea was used in [IBCH13] and extended by [CP14b] and [BHI14] to a more general setting. Unfortunately, this approach implies to select a “small” step size as noticed in the previous section. The use of small step size is unfortunate in practice, as it may significantly affect the convergence rate.
It is natural to ask whether the existing convergence proof based on stochastic Fejér monotonicity can be extended to the use of larger step sizes. The answer is negative, as shown by the following example.
Consider the toy problem
that is we take and . One of the minimizers is . The global Lispchitz constant of is equal to and the coordinate-wise Lipschitz constants are equal to 1. The CD proximal gradient algorithm (4) writes
where we used for simplicity. By Theorem 1, converges almost surely to whenever . Setting , one has . It is immediately seen that where represents the expectation. In particular, as soon as . Therefore, the sequence is not decreasing. This example shows that the proof techniques based on monotone operators and Fejér monotonicity are not directly applicable in the case of long stepsizes.
3 Proof of Theorem 1
3.1 Preliminary Lemma
For every , we define
The inclusions (3) also read
Setting and in the above inequalities, we obtain
By definition of the proximal operator,
Setting , we obtain
and using (6), we finally have
Similarly, Equality (8) implies that for any ,
We set . This yields
Using moreover Inequality (5), we obtain
hence, rearranging the terms,
Summing the above inequality with (9),
This completes the proof of the lemma thanks to the definition of . ∎
3.2 Study of Algorithm 2
We first prove Theorem 1 in the special case . In that case, Algorithm 1 boils down to Algorithm 2. We recall that in this case, the vector is reduced to a single value where is the unique index such that . We simply denote this value by .
We denote by the filtration generated by the random variable (r.v.) . We denote by the conditional expectation w.r.t. .
The first equality is immediate.
Consider the second one. which coincides with and the second equality is proved.
Similarly for the third equality, and for every ,
, implies that and the third equality is proved.
Consider the fourth equality. Note that
For any pair such that , the conditional expectation of each term in the sum is equal to which in turn implies the fourth equality in the Lemma. ∎
Assume that for each . Define for every ,