A better convergence analysis of the block coordinate descent method for large scale machine learning
This paper considers the problems of unconstrained minimization of large scale smooth convex functions having block-coordinate-wise Lipschitz continuous gradients. The block coordinate descent (BCD) method are among the first optimization schemes suggested for solving such problems . We obtain a new lower (to our best knowledge the lowest currently) bound that is times smaller than the best known on the information-based complexity of BCD method based on an effective technique called Performance Estimation Problem (PEP) proposed by Drori and Teboulle  recently for analyzing the performance of first-order black box optimization methods. Numerical test confirms our analysis.
1 Introduction and problem statement
In this work, we consider the block coordinate descent (BCD) algorithms for solving the large scale problems of the following form:
where is a smooth convex function (no need to be strongly convex), and it is assumed throughout this work that
The gradients of are block-coordinate-wise Lipschitz continuous with const
where is a decomposition of the identity matrix into column submatrices , and the space is decomposed into subspaces: , while is the block of partial derivatives . We denote the set of functions satisfy this condition as , here stands for , and stands for .
The optimal set is nonempty, i.e., the Problem (1.1) is solvable.
Block coordinate descent (BCD) methods have recently gained in popularity for solving the Problem (1.1) both in theoretical optimization and in many applications, such as machine learning, signal processing, communications, and so on. These problems are of very large scale, and the computational is simple and the cost is very cheap per iteration of BCD methods, yielding computational efficiency. If moderate accuracy solutions are sufficient for the target applications, BCD methods are often the best option to solve the Problem (1.1) in a reasonable time. For convex optimization problems, there exists an extensive literature on the development and analysis of BCD methods, but most of them focus on the randomized BCD methods [7, 5, 8, 4, 6], where blocks are randomly chosen in each iteration. In contrast, existing literature on cyclic BCD methods is rather limited [1, 3], and the later  is focused on strongly convex functions. In this paper, we focuse on the theoretical performance analysis of cyclic BCD methods for unconstrained minimization with an objective function which is known to satisfy the assumptions in the Problem (1.1) over the Euclidean space , although the function itself is not known.
We consider finding a minimizer over of a cost function belonging to the set . The class of standard and popular cyclic algorithms of interest generates a sequence of points using the following scheme:
Input: start point .
1: repeat for
2: Set , and generate recursively
3: Update: .
The update step at the th iterate performs a gradient step with constant stepsize with respect to a different block of variables taken in a cyclic order. Evaluating the convergence bound of such BCD algorithms is essential. The sequence is known to satisfy the bound :
for , which to our best knowledge is the previously best know analytical bound of cyclic BCD method for unconstrained smooth convex minimization. Here and are the maximal and minimal block Lipschitz constants
is Lipschitz constant of , that is
for every , and is define by
But in practice, BCD converges much fast. It can be seen in Figure 1 in Section 4 tha there is big gap between the currently best known bound and the practice convergence. This work is try to fill this gap. Recently, Drori and Teboulle  considered the Performance Estimation Problem (PEP) approach to bounding the decrease of a cost function . Following this excellent work, we can formulate the worst case performance bound of the BCD method over all smooth convex functions as the solution of the following constrained optimization problem:
In the sequel, we often need to estimate from above the differences between two block partial gradients. For that it is convenient to use the following simple lemma:
Let , then we have
for every .
We also need the following lemma (similar but different from Lemma 3.1 of ) to simplify a quadratic function of matrix variable into a function of vector variable.
Let be a quadratic function, where , , and . Then
The proofs of these two lemmas are contained in the appendix for the completeness of this work.
2 Relaxations of the PEP
Since Problem ((P)) involves an unknown function as a variable, PEP is infinite-dimensional. Nevertheless, it can be relaxed by using the property of the functions belong to .
for and , that is
where we use the fact .
In this paper we deal with a standard case to get the insights, here we assume all the block partial Lipschitz constants are equal, that is . We define
for every . In view of Algorithm (1), since for , obviously we have
In view of the above notations, Problem ((P)) can now be relaxed by discarding the constrains to the following form
We try to relax the above problem, if we set and , then we have
Same as in , the Problem ((P2)) is invariant under the transformation , for any orthogonal transformation . We can therefore assume without loss of generality that , where is any given unit vector in . Therefore, we have
In order to simplify notation, we denote as in the following Now we can remove some constraints from Problem ((P2)) to further simplify the analysis:
where . It is obvious that from the definition (1.7).
Let denote the matrix whose rows are , and be , the ()th standard unit vector for and ,. Then we have
for any . Let
Let Problem ((P3)), then it can be transformed into a more compact form in terms of and
where in order for convenience, we recast the above as a minimization problem, and we also omit the fixed term from the objective.
Attaching the dual multipliers
to the first and second set of inequalities respectively, and using the notation
, we get that the Lagrangian of this problem is given as a sum of two separable functions in the variables :
The dual objective function is then defined by
and the dual problem of Problem ((P4)) is then given by
Since is linear in , we have whenever
According to Lemma 1.2, we have
Let be , then we have . Therefore for any satisfying (2.2), we have obtained that the dual objective is upper bounded by
If all the block have equal size, that is for every , then we get
Now we obtain an upper bound for the optimal value of Problem ((P3)):
3 New bound of BCD
According to Appendix C, if we set
we have Thus we have the following new upper bound on the complexity of the BCD:
Let and let be generated by Algorithm 1 with and . Then we have
From above theorem, we notice that our bound is times smaller than for the known bound (1.4) (with )
4 Numerical test
Consider the least squares problem
where , . A is a nonsingular matrix, so obviously the optimal solution of the problem is the vector and the optimal value is . We consider the partition of the variables to blocks, each with variables (we assume that divides ). We will also use the notation
where is the submatrix of A comprising the columns corresponding to the -th block, that is, columns .
We consider , and three choices of : 2,5, 20, and 100. The results together with classical bound on the convergence rate of the sequence of the BCD method are summarized in Figure 1.
This paper provide a novel and better analytical convergence bound, that is times as small as previous best, for the sequence of BCD methods for unconstrained smooth convex functions. Extending this approach to general, such randomized BCD type method or stochastic gradient method, is important future work. In a broader context, we believe that the current paper could serve as a basis for examining the method on the PEP approach to various BCD related methods.
Appendix A Proof of Lemma 1.1
In this appendix, we complete the proofs of the Lemma 1.1.
For all and , we have
where the second inequality follows from the Cauchy-Schwartz inequality and the third inequality follows from (1.2). In short from above we have
Then consider the function . The gradient of is , which is obvious block-coordinate-wise Lipschitz continuous with constants , thais belong to the class , same as , and is one of its optimal points. Therefor in view of (A.1), we get
Let in (A.2), we have