A Proof of Lemma 1.1

# A better convergence analysis of the block coordinate descent method for large scale machine learning

## Abstract

This paper considers the problems of unconstrained minimization of large scale smooth convex functions having block-coordinate-wise Lipschitz continuous gradients. The block coordinate descent (BCD) method are among the first optimization schemes suggested for solving such problems [5]. We obtain a new lower (to our best knowledge the lowest currently) bound that is times smaller than the best known on the information-based complexity of BCD method based on an effective technique called Performance Estimation Problem (PEP) proposed by Drori and Teboulle [2] recently for analyzing the performance of first-order black box optimization methods. Numerical test confirms our analysis.

## 1 Introduction and problem statement

In this work, we consider the block coordinate descent (BCD) algorithms for solving the large scale problems of the following form:

 minx∈RDf(x), (1.1)

where is a smooth convex function (no need to be strongly convex), and it is assumed throughout this work that

• The gradients of are block-coordinate-wise Lipschitz continuous with const

 ∥∇if(x+Uihi)−∇if(x+Uihi)∥≤Li∥hi∥, (1.2)

where is a decomposition of the identity matrix into column submatrices , and the space is decomposed into subspaces: , while is the block of partial derivatives . We denote the set of functions satisfy this condition as , here stands for , and stands for .

• The optimal set is nonempty, i.e., the Problem (1.1) is solvable.

Block coordinate descent (BCD) methods have recently gained in popularity for solving the Problem (1.1) both in theoretical optimization and in many applications, such as machine learning, signal processing, communications, and so on. These problems are of very large scale, and the computational is simple and the cost is very cheap per iteration of BCD methods, yielding computational efficiency. If moderate accuracy solutions are sufficient for the target applications, BCD methods are often the best option to solve the Problem (1.1) in a reasonable time. For convex optimization problems, there exists an extensive literature on the development and analysis of BCD methods, but most of them focus on the randomized BCD methods [7, 5, 8, 4, 6], where blocks are randomly chosen in each iteration. In contrast, existing literature on cyclic BCD methods is rather limited [1, 3], and the later [3] is focused on strongly convex functions. In this paper, we focuse on the theoretical performance analysis of cyclic BCD methods for unconstrained minimization with an objective function which is known to satisfy the assumptions in the Problem (1.1) over the Euclidean space , although the function itself is not known.

We consider finding a minimizer over of a cost function belonging to the set . The class of standard and popular cyclic algorithms of interest generates a sequence of points using the following scheme:

The update step at the th iterate performs a gradient step with constant stepsize with respect to a different block of variables taken in a cyclic order. Evaluating the convergence bound of such BCD algorithms is essential. The sequence is known to satisfy the bound [1]:

 f(xk)−f(x∗)≤4Lmax(1+pL2/L2min)R2(x0)1k+8/p (1.4)

for , which to our best knowledge is the previously best know analytical bound of cyclic BCD method for unconstrained smooth convex minimization. Here and are the maximal and minimal block Lipschitz constants

 Lmax=maxi=1,...,pLiandLmin=maxi=1,...,pLi, (1.5)

is Lipschitz constant of , that is

 ∥∇f(x)−∇f(y)∥≤L∥x−y∥. (1.6)

for every , and is define by

 R(x0):=maxx∈RDmaxx∗∈X∗(f){∥x−x∗∥:f(x)≤f(x0)} (1.7)

same as [5, 1].

But in practice, BCD converges much fast. It can be seen in Figure 1 in Section 4 tha there is big gap between the currently best known bound and the practice convergence. This work is try to fill this gap. Recently, Drori and Teboulle [2] considered the Performance Estimation Problem (PEP) approach to bounding the decrease of a cost function . Following this excellent work, we can formulate the worst case performance bound of the BCD method over all smooth convex functions as the solution of the following constrained optimization problem:

 maxf∈FL,U(RD)maxx00,…,xik,…,xpN,x∗∈RDf(xpN)−f(x∗)s.t. xik=xi−1k−1LiUi∇if(xi−1k),k=0,…,N,i=1,…,p,xpk=x0k+1,k=0,…,N−1,x∗∈X∗(f). (P)

### 1.1 Lemmas

In the sequel, we often need to estimate from above the differences between two block partial gradients. For that it is convenient to use the following simple lemma:

###### Lemma 1.1.

Let , then we have

 12Li∥∇if(y)−∇if(x)∥2≤f(y)−f(x)−⟨∇f(x),y−x⟩ (1.8)

for every .

We also need the following lemma (similar but different from Lemma 3.1 of [2]) to simplify a quadratic function of matrix variable into a function of vector variable.

###### Lemma 1.2.

Let be a quadratic function, where , , and . Then

 infX∈Rn×mf(X)=infξ∈Rnf(ξb⊤).

The proofs of these two lemmas are contained in the appendix for the completeness of this work.

## 2 Relaxations of the PEP

Since Problem ((P)) involves an unknown function as a variable, PEP is infinite-dimensional. Nevertheless, it can be relaxed by using the property of the functions belong to .

Let be the sequence generated by the Algorithm (1). Applying (1.8) to , , and , and note (1.3), we get

 12Lt∥∇tf(xim)−∇tf(xjn)∥2 ≤ f(xim)−f(xjn)−⟨∇f(xjn),xim−xjn⟩, 12Lt∥∇tf(xim)−∇tf(x∗)∥2 ≤ f(xim)−f(x∗)−⟨∇f(x∗),xim−xjn⟩, 12Lt∥∇tf(x∗)−∇tf(xjn)∥2 ≤ f(x∗)−f(xjn)−⟨∇f(xjn),x∗−xjn⟩,

for and , that is

 12Lt∥U⊤t∇f(xim)−U⊤t∇f(xjn)∥2 ≤ f(xim)−f(xjn)−⟨∇f(xjn),xim−xjn⟩, 12Lt∥U⊤t∇f(xim)∥2 ≤ f(xim)−f(x∗), 12Lt∥U⊤t∇f(xjn)∥2 ≤ f(x∗)−f(xjn)−⟨∇f(xjn),x∗−xjn⟩,

where we use the fact .

In this paper we deal with a standard case to get the insights, here we assume all the block partial Lipschitz constants are equal, that is . We define

 δk,i :=1pLcR2(x0)(f(xik)−f(x∗)), gk,i :=1pLcR(x0)∇f(xik), δ∗ :=1pLcR2(x0)(f(x∗)−f(x∗))=0, g∗ :=1pLcR(x0)∇f(x∗)=0,

for every . In view of Algorithm (1), since for , obviously we have

 δk,p=δk+1,0andgk,p=gk+1,0. (2.1)

In view of the above notations, Problem ((P)) can now be relaxed by discarding the constrains to the following form

 Misplaced & (P1)

We try to relax the above problem, if we set and , then we have

 Misplaced & (P2)

Same as in [2], the Problem ((P2)) is invariant under the transformation , for any orthogonal transformation . We can therefore assume without loss of generality that , where is any given unit vector in . Therefore, we have

 p2∥U⊤tgk,i∥2≤−δk,i−⟨gk,i,∥x∗−x0∥ν+x0−xk,i⟩R(x0).

and

 x0−xk,i=pR(x0)∑{k′,i′:k′p+i′≤kp+i}Ui′U⊤i′gk′,i′−1

In order to simplify notation, we denote as in the following Now we can remove some constraints from Problem ((P2)) to further simplify the analysis:

 maxxik∈RD, gk,i∈RD, δk,i∈R,k=0,…,N, i=1,…,p.pLcR2(x0)δN,ps.t. p2∥U⊤tgk,i−1−U⊤tgk,i∥2≤δk,i−1−δk,i−p⟨gk,i,UiU⊤igk,i−1⟩,p2∥U⊤tgk,i∥2≤−δk,i−⟨gk,i,αν+p∑k′p+i′≤kp+iUi′U⊤i′gk′,i′−1⟩,p2∥U⊤tg0,0∥2≤−δ0,0−⟨g0,0,αν⟩,δk,p=δk+1,0,gk,p=gk+1,0,k=0,…,N,i,t=1,…,p, (P3)

where . It is obvious that from the definition (1.7).

Let denote the matrix whose rows are , and be , the ()th standard unit vector for and ,. Then we have

 U⊤tgk,i=U⊤tG⊤uk,i,tr(U⊤tG⊤um,iu⊤n,jGUt)=⟨U⊤tgm,i,U⊤tgn,j⟩,and⟨G⊤uk,i,ν⟩=⟨gk,i,ν⟩

for any . Let

Let Problem ((P3)), then it can be transformed into a more compact form in terms of and

 Extra open brace or missing close brace (P4)

where in order for convenience, we recast the above as a minimization problem, and we also omit the fixed term from the objective.

Attaching the dual multipliers

 λ:=(λ0,1,...,λk,i,...,λN,p)⊤∈R(N+1)p+

and

 τ:=(τ0,0,...,τk,i,...,τN,p)⊤∈R(N+1)p+1+

to the first and second set of inequalities respectively, and using the notation

 δ=(δ0,0,...,δk,i,...,δN,p)

, we get that the Lagrangian of this problem is given as a sum of two separable functions in the variables :

 L(G,δ,λ,τ) = −δN,p+N∑k=0p∑i=1λk,i(δk,i−δk,i−1)+N∑k=0p∑i=1τk,iδk,i+τ0,0δ0,0 +p2N∑k=0p∑i=1λk,itr(UiU⊤iG⊤(uk,i−1u⊤k,i−1+uk,iu⊤k,i)G) +N∑k=0p∑i=1τk,i[p2tr(UiU⊤iG⊤uk,iu⊤k,iG)+tr(ανu⊤k,iG) +p2∑k′p+i′≤kp+itr(Ui′U⊤i′G⊤(uk′,i′−1u⊤k,i+uk,iu⊤k′,i′−1)G)] +τ0,0[p2tr(UiU⊤iG⊤u0,0u⊤0,0G)+tr(ανu⊤0,0G)] ≡ L1(δ,λ,τ)+L2(G,λ,τ).

The dual objective function is then defined by

 H(λ,τ)=minG,δL(G,δ,λτ)=minδL1(δ,λ,τ)+minGL2(G,λ,τ),

and the dual problem of Problem ((P4)) is then given by

 max{H(λ,τ):λ∈R(N+1)p+1+,τ∈R(N+1)p+1+}.

Since is linear in , we have whenever

 −λ0,1+τ0,0 = 0, λk,i−λk,i+1+τk,i = 0,(k=1,…,N,i=1,...,p−1), (2.2) −1+λN,p+τN,p = 0,

and otherwise.

According to Lemma 1.2, we have

 minG∈R((N+1)p+1)×DL2(G,λ,τ)=minw∈R(N+1)p+1L2(wν⊤,λ,τ).

Let be , then we have . Therefore for any satisfying (2.2), we have obtained that the dual objective is upper bounded by

 H(λ,τ) ≤ =minw∈R(N+1)p+1L2(wν⊤,λ,τ) = minw∈R(N+1)p+1{p2N∑k=0p∑i=1DiDλk,iw⊤(uk,i−1u⊤k,i−1+uk,iu⊤k,i)w + N∑k=0p∑i=1τk,i[p2DiDw⊤uk,iu⊤k,iw+αu⊤k,iw+p2∑k′p+i′≤kp+iDi′Dw⊤(uk,iu⊤k′,i′−1+uk′,i′−1u⊤k,i)w] + τ0,0[p2DiDw⊤u0,0u⊤0,0w+αu⊤0,0w]} = maxt∈R{−12t:w⊤Aw+ατ⊤w≥−12t, ∀w∈R(N+1)p+1} = maxt∈R{−12t:(A12τ12τ⊤12t)⪰0},

where

 A = p2N∑k=0p∑i=1DiDλk,i(uk,i−1u⊤k,i−1+uk,iu⊤k,i) + N∑k=0p∑i=1τk,i[p2DiDuk,iu⊤k,i+p2∑k′p+i′≤kp+iDi′D(uk,iu⊤k′,i′−1+uk′,i′−1u⊤k,i)]+τ0,0p2DiDu0,0u⊤0,0.

If all the block have equal size, that is for every , then we get

 A = 12N∑k=0p∑i=1λk,i(uk,i−1u⊤k,i−1+uk,iu⊤k,i) + N∑k=0p∑i=1τk,i[12uk,iu⊤k,i+∑k′p+i′≤kp+i12(uk,iu⊤k′,i′−1+uk′,i′−1u⊤k,i)]+τ0,012u0,0u⊤0,0.

Now we obtain an upper bound for the optimal value of Problem ((P3)):

 Unknown environment 'aligned% (D)

## 3 New bound of BCD

Note (2.2) and (2), we have

 τ = (τ0,0,τ0,1,...,τk,i,...,τN,p)⊤ = (λ0,0,λ0,2−λ0,1,...,λk,i+1−λk,i,...,1−λN,p)⊤

and becomes

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝2λ0,1λ0,2−λ0,1⋯λk,i+1−λk,i⋯λN,p−λN,p−11−λN,pλ0,2−λ0,12λ0,2λk,i+1−λk,iλN,p−λN,p−11−λN,p⋮⋱⋮λk,i+1−λk,iλk,i+1−λk,i2λk,iλN,p−λN,p−11−λN,p⋮⋱⋮λN,p−λN,p−1λN,p−λN,p−1λN,p−λN,p−12λN,p1−λN,p1−λN,p1−λN,p⋯1−λN,p⋯1−λN,p1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠. (3.1)

According to Appendix C, if we set

 λk,i = kp+i2(N+1)p+1−kp−i,k=0,...,N,p=1,...,p, t = 12(N+1)p+1.

we have Thus we have the following new upper bound on the complexity of the BCD:

###### Theorem 3.1.

Let and let be generated by Algorithm 1 with and . Then we have

 f(xN)−f(x∗)≤14(N+1)p+2pLcR2(x0). (3.2)
###### Remark 1.

From above theorem, we notice that our bound is times smaller than for the known bound (1.4) (with )

 f(xk)−f(x∗)≤4Lc(1+p3)R2(x0)1k+8/p.

## 4 Numerical test

Consider the least squares problem

 minx∈RD12∥Ax−b∥2, (4.1)

where , . A is a nonsingular matrix, so obviously the optimal solution of the problem is the vector and the optimal value is . We consider the partition of the variables to blocks, each with variables (we assume that divides ). We will also use the notation

 A=(A1A2…Ap)

where is the submatrix of A comprising the columns corresponding to the -th block, that is, columns .

We consider , and three choices of : 2,5, 20, and 100. The results together with classical bound on the convergence rate of the sequence of the BCD method are summarized in Figure 1.

## 5 Conclusion

This paper provide a novel and better analytical convergence bound, that is times as small as previous best, for the sequence of BCD methods for unconstrained smooth convex functions. Extending this approach to general, such randomized BCD type method or stochastic gradient method, is important future work. In a broader context, we believe that the current paper could serve as a basis for examining the method on the PEP approach to various BCD related methods.

## Appendix A Proof of Lemma 1.1

In this appendix, we complete the proofs of the Lemma 1.1.

For all and , we have

 f(x+Uihi) = f(x)+∫10⟨∇f(x+θUihi),Uihi⟩dθ = f(x)+∫10⟨U⊤i∇f(x+θUihi),hi⟩dθ = f(x)+⟨∇if(x),hi⟩+∫10⟨∇if(x+θUihi)−∇if(x),hi⟩dθ ≤ f(x)+⟨∇if(x),hi⟩+∫10∥⟨∇if(x+θUihi)−∇if(x),hi⟩∥dθ ≤ f(x)+⟨∇if(x),hi⟩+∫10∥∇if(x+θUihi)−∇if(x)∥∥hi∥dθ ≤ f(x)+⟨∇if(x),hi⟩+∫10θLi∥hi∥2dθ = f(x)+⟨∇if(x),hi⟩+Li2∥hi∥2,

where the second inequality follows from the Cauchy-Schwartz inequality and the third inequality follows from (1.2). In short from above we have

 f(x+Uihi)≤f(x)+⟨∇if(x),hi⟩+Li2∥hi∥2. (A.1)

Then consider the function . The gradient of is , which is obvious block-coordinate-wise Lipschitz continuous with constants , thais belong to the class , same as , and is one of its optimal points. Therefor in view of (A.1), we get

 ϕ(y+Uihi)≤ϕ(y)+⟨∇iϕ(y),hi⟩+Li2∥hi∥2. (A.2)

Let in (A.2), we have

 ϕ(x0) ≤ ϕ(y+Ui(−1Li∇iϕ(y))) ≤ ϕ(y)+⟨∇iϕ(y),−1Li∇iϕ(y)⟩+L