A globally convergent algorithm for nonconvex optimization based on block coordinate update^{†}^{†}thanks: The work is supported in part by NSF DMS1317602 and ARO MURI W911NF0910383.
Abstract
Nonconvex optimization arises in many areas of computational science and engineering. However, most nonconvex optimization algorithms are only known to have local convergence or subsequence convergence properties. In this paper, we propose an algorithm for nonconvex optimization and establish its global convergence (of the whole sequence) to a critical point. In addition, we give its asymptotic convergence rate and numerically demonstrate its efficiency.
In our algorithm, the variables of the underlying problem are either treated as one block or multiple disjoint blocks. It is assumed that each nondifferentiable component of the objective function, or each constraint, applies only to one block of variables. The differentiable components of the objective function, however, can involve multiple blocks of variables together.
Our algorithm updates one block of variables at a time by minimizing a certain proxlinear surrogate, along with an extrapolation to accelerate its convergence. The order of update can be either deterministically cyclic or randomly shuffled for each cycle. In fact, our convergence analysis only needs that each block be updated at least once in every fixed number of iterations. We show its global convergence (of the whole sequence) to a critical point under fairly loose conditions including, in particular, the KurdykaŁojasiewicz (KL) condition, which is satisfied by a broad class of nonconvex/nonsmooth applications. These results, of course, remain valid when the underlying problem is convex.
We apply our convergence results to the coordinate descent iteration for nonconvex regularized linear regression, as well as a modified rankone residue iteration for nonnegative matrix factorization. We show that both applications have global convergence. Numerically, we tested our algorithm on nonnegative matrix and tensor factorization problems, where random shuffling clearly improves to chance to avoid lowquality local solutions.
onconvex optimization, nonsmooth optimization, block coordinate descent, KurdykaŁojasiewicz inequality, proxlinear, whole sequence convergence
1 Introduction
In this paper, we consider (nonconvex) optimization problems in the form of
(1) 
where variable has blocks, , function is continuously differentiable, functions , , are proximable^{1}^{1}1A function is proximable if it is easy to obtain the minimizer of for any input and . but not necessarily differentiable. It is standard to assume that both and are closed and proper and the sets are closed and nonempty. Convexity is not assumed for , , or . By allowing to take the value, can incorporate the constraint since enforcing the constraint is equivalent to minimizing the indicator function of , and can remain proper and closed. Therefore, in the remainder of this paper, we do not include the constraints . The functions can incorporate regularization functions, often used to enforce certain properties or structures in , for example, the nonconvex quasinorm, , which promotes solution sparsity.
Special cases of (1) include the following nonconvex problems: quasinorm () regularized sparse regression problems [40, 10, 32], sparse dictionary learning [1, 38, 57], matrix rank minimization [47], matrix factorization with nonnegativity/sparsity/orthogonality regularization [45, 33, 27], (nonnegative) tensor decomposition [53, 29], and (sparse) higherorder principal component analysis [2].
Due to the lack of convexity, standard analysis tools such as convex inequalities and Fejérmonotonicity cannot be applied to establish the convergence of the iterate sequence. The case becomes more difficult when the problem is nonsmooth. In these cases, convergence analysis of existing algorithms is typically limited to objective convergence (to a possibly nonminimal value) or the convergence of a certain subsequence of iterates to a critical point. (Some exceptions will be reviewed below.) Although wholesequence convergence is almost always observed, it is rarely proved. This deficiency abates some widely used algorithms. For example, KSVD [1] only has nonincreasing monotonicity of its objective sequence, and iterative reweighted algorithms for sparse and lowrank recovery in [17, 39, 32] only has subsequence convergence. Some other methods establish whole sequence convergence by assuming stronger conditions such as local convexity (on at least a part of the objective) and either unique or isolated limit points, which may be difficult to satisfy or to verify. In this paper, we aim to establish whole sequence convergence with conditions that are provably satisfied by a wide class of functions.
Block coordinate descent (BCD) (more precisely, block coordinate update) is very general and widely used for solving both convex and nonconvex problems in the form of (1) with multiple blocks of variables. Since only one block is updated at a time, it has a low periteration cost and small memory footprint. Recent literature [42, 48, 50, 35, 8, 26] has found BCD as a viable approach for “big data” problems.
1.1 Proposed algorithm
In order to solve (1), we propose a block proxlinear (BPL) method, which updates a block of variables at each iteration by minimizing a proxlinear surrogate function. Specifically, at iteration , a block is selected and is updated as follows:
(2) 
where is a stepsize and is the extrapolation
(3) 
where is an extrapolation weight and is the value of before it was updated to . The framework of our method is given in Algorithm 1. At each iteration , only the block is updated.
While we can simply set , appropriate can speed up the convergence; we will demonstrate this in the numerical results below. We can set the stepsize with any , where is the Lipschitz constant of about . When is unknown or difficult to bound, we can apply backtracking on under the criterion:
Special cases
When there is only one block, i.e., , Algorithm 1 reduces to the wellknown (accelerated) proximal gradient method (e.g., [41, 7, 22]). When the update block cycles from 1 through , Algorithm 1 reduces to the cyclic block proximal gradient (CycBPG) method in [56, 8]. We can also randomly shuffle the blocks at the beginning of each cycle. We demonstrate in section 3 that random shuffling leads to better numerical performance. When the update block is randomly selected following the probability , where , Algorithm 1 reduces to the randomized block coordinate descent method (RBCD) (e.g., [42, 48, 35, 36]). Unlike these existing results, we do not assume convexity.
In our analysis, we impose an essentially cyclic assumption — each block is selected for update at least once within every consecutive iterations — otherwise the order is arbitrary. Our convergence results apply to all the above special cases except RBCD, whose convergence analysis requires different strategies; see [42, 48, 35] for the convex case and [36] for the nonconvex case.
1.2 KurdykaŁojasiewicz property
To establish whole sequence convergence of Algorithm 1, a key assumption is the KurdykaŁojasiewicz (KL) property of the objective function .
A lot of functions are known to satisfy the KL property. Recent works [4, section 4] and [56, section 2.2] give many specific examples that satisfy the property, such as the (quasi)norm with , any piecewise polynomial functions, indicator functions of polyhedral set, orthogonal matrix set, and positive semidefinite cone, matrix rank function, and so on. {definition}[KurdykaŁojasiewicz property] A function satisfies the KL property at point if there exist , a neighborhood , and a concave function for some and such that the KL inequality holds
(4) 
where and .
The KL property was introduced by Łojasiewicz [34] for real analytic functions. Kurdyka [31] extended it to functions of the minimal structure. Recently, the KL inequality (4) was further extended to nonsmooth subanalytic functions [11]. The work [12] characterizes the geometric meaning of the KL inequality.
1.3 Related literature
There are many methods that solve general nonconvex problems. Methods in the papers [21, 15, 18, 6], the books [9, 43], and in the references therein, do not break variables into blocks. They usually have the properties of local convergence or subsequence convergence to a critical point, or global convergence in the terms of the violation of optimality conditions. Next, we review BCD methods.
BCD has been extensively used in many applications. Its original form, block coordinate minimization (BCM), which updates a block by minimizing the original objective with respect to that block, dates back to the 1950’s [24] and is closely related to the GaussSeidel and SOR methods for linear equation systems. Its convergence was studied under a variety of settings (cf. [23, 51, 46] and the references therein). The convergence rate of BCM was established under the strong convexity assumption [37] for the multiblock case and under the general convexity assumption [8] for the twoblock case. To have even cheaper updates, one can update a block approximately, for example, by minimizing an approximate objective like was done in (2), instead of sticking to the original objective. The work [52] is a block coordinate gradient descent (BCGD) method where taking a block gradient step is equivalent to minimizing a certain proxlinear approximation of the objective. Its whole sequence convergence and local convergence rate were established under the assumptions of a socalled local Lipschitzian error bound and the convexity of the objective’s nondifferentiable part. The randomized block coordinate descent (RBCD) method in [42, 36] randomly chooses the block to update at each iteration and is not essentially cyclic. Objective convergence was established [42, 48], and the violation of the firstorder optimization condition was shown to converge to zero [36]. There is no iterate convergence result for RBCD.
Some special cases of Algorithm 1 have been analyzed in the literature. The work [56] uses cyclic updates of a fixed order and assumes blockwise convexity; [13] studies two blocks without extrapolation, namely, and in (2). A more general result is [5, Lemma 2.6], where three conditions for whole sequence convergence are given and are met by methods including averaged projection, proximal point, and forwardbackward splitting. Algorithm 1, however, does not satisfy the three conditions in [5].
The extrapolation technique in (3) has been applied to accelerate the (block) proxlinear method for solving convex optimization problems (e.g., [41, 7, 48, 35]). Recently, [56, 22] show that the (block) proxlinear iteration with extrapolation can still converge if the nonsmooth part of the problem is convex, while the smooth part can be nonconvex. Because of the convexity assumption, their convergence results do not apply to Algorithm 1 for solving the general nonconvex problem (1).
1.4 Contributions
We summarize the main contributions of this paper as follows.

We propose a block proxlinear (BPL) method for nonconvex smooth and nonsmooth optimization. Extrapolation is used to accelerate it. To our best knowledge, this is the first work of proxlinear acceleration for fully nonconvex problems (where both smooth and nonsmooth terms are nonconvex) with a convergence guarantee. However, we have not proved any improved convergence rate.

Assuming essentially cyclic updates of the blocks, we obtain the whole sequence convergence of BPL to a critical point with rate estimates, by first establishing subsequence convergence and then applying the KurdykaŁojasiewicz (KL) property. Furthermore, we tailor our convergence analysis to several existing algorithms, including nonconvex regularized linear regression and nonnegative matrix factorization, to improve their existing convergence results.

We numerically tested BPL on nonnegative matrix and tensor factorization problems. At each cycle of updates, the blocks were randomly shuffled. We observed that BPL was very efficient and that random shuffling avoided local solutions more effectively than the deterministic cyclic order.
1.5 Notation and preliminaries
We restrict our discussion in equipped with the Euclidean norm, denoted by . However, all our results extend to general of primal and dual norm pairs. The lowercase letter is reserved for the number of blocks and for various Lipschitz constants. is short for , for , and for . We simplify to . The distance of a point to a set is denoted by
Since the update may be aperiodic, extra notation is used for when and how many times a block is updated. Let denote the set of iterations in which the th block has been selected to update till the th iteration:
(5) 
and let
which is the number of times the th block has been updated till iteration . For we have and .
Let be the value of after the th iteration, and for each block , be the value of after its th update. By letting , we have .
The extrapolated point in (2) (for ) is computed from the last two updates of the same block:
(6) 
for some weight . We partition the set of Lipschitz constants and the extrapolation weights into disjoint subsets as
(7a)  
(7b) 
Hence, for each block , we have three sequences:
(8a)  
(8b)  
(8c) 
For simplicity, we take stepsizes and extrapolation weights as follows
(9) 
However, if the problem (1) has more structures such as block convexity, we can use larger and ; see Remark 2.2. Table 1 summarizes the notation. In addition, we initialize .
Notion  Definition 

the total number of blocks  
the update block selected at the th iteration  
the set of iterations up to in which is updated; see (5)  
: the number of updates to within the first iterations  
the value of after the th iteration  
the value of after its th update; see (8a)  
the gradient Lipschitz constant of the update block at the th iteration; see (11)  
the gradient Lipschitz constant of block at its th update; see (7a) and (8b)  
the extrapolation weight used at the th iteration  
the extrapolation weight used at the th update of ; see (7b) and (8c) 
We make the following definitions, which can be found in [49].
[Limiting Fréchet subdifferential[30]] A vector is a Fréchet subgradient of a lower semicontinuous function at if
The set of Fréchet subgradient of at is called Fréchet subdifferential and denoted as . If , then .
The limiting Fréchet subdifferential is denoted by and defined as
If is differentiable^{2}^{2}2A function on is differentiable at point if there exists a vector such that at , then ; see [30, Proposition 1.1] for example, and if is convex, then We use the limiting subdifferential for general nonconvex nonsmooth functions. For problem (1), it holds that (see [4, Lemma 2.1] or [49, Prop. 10.6, pp. 426])
(10) 
where denotes the Cartesian product of and . {definition}[Critical point] A point is called a critical point of if .
[Proximal mapping] For a proper, lower semicontinuous function , its proximal mapping is defined as
As is nonconvex, is generally setvalued. Using this notation, the update in (2) can be written as (assume )
1.6 Organization
2 Convergence analysis
In this section, we analyze the convergence of Algorithm 1. Throughout our analysis, we make the following assumptions.
Assumption 1
is proper and lower bounded in , is continuously differentiable, and is proper lower semicontinuous for all . Problem (1) has a critical point , i.e., .
Assumption 2
Let . has Lipschitz continuity constant with respect to , i.e.,
(11) 
and there exist constants such that for all .
Assumption 3 (Essentially cyclic block update)
In Algorithm 1, within any consecutive iterations, every block is updated at least one time.
Our analysis proceeds with several steps. We first estimate the objective decrease after every iteration (see Lemma 2) and then establish a square summable result of the iterate differences (see Proposition 2.1). Through the square summable result, we show a subsequence convergence result that every limit point of the iterates is a critical point (see Theorem 2.1). Assuming the KL property (see Definition 1.2) on the objective function and the following monotonicity condition, we establish whole sequence convergence of our algorithm and also give estimate of convergence rate (see Theorems 2.2 and 2.2).
Condition 2.1 (Nonincreasing objective)
The weight is chosen so that .
We will show that a range of nontrivial always exists to satisfy Condition 2.1 under a mild assumption, and thus one can backtrack to ensure . Maintaining the monotonicity of can significantly improve the numerical performance of the algorithm, as shown in our numerical results below and also in [44, 55]. Note that subsequence convergence does not require this condition.
We begin our analysis with the following lemma. The proofs of all the lemmas and propositions are given in Appendix A.
Remark 2.1
We can relax the choices of and in (9). For example, we can take , and for any and some . Then, (12) and (13) hold with . In addition, if (not necessary ), (12) holds with positive and , and the extrapolation weights satisfy for some , then all our convergence results below remain valid.
Note that for and . Adopting the convention that when , we can write (13) into
(14) 
which will be used in our subsequent convergence analysis.
Remark 2.2
If is block multiconvex, i.e., it is convex with respect to each block of variables while keeping the remaining variables fixed, and is convex for all , then taking , we have (12) holds with and ; see the proof in Appendix A. In this case, we can take for some , and all our convergence results can be shown through the same arguments.
2.1 Subsequence convergence
Using Lemma 2, we can have the following result, through which we show subsequence convergence of Algorithm 1. {proposition}[Square summable] Let be generated from Algorithm 1 with and taken from (9). We have
(15) 
[Subsequence convergence] Under Assumptions 1 through 3, let be generated from Algorithm 1 with and taken from (9). Then any limit point of is a critical point of (1). If the subsequence converges to , then
(16) 
Remark 2.3
The existence of finite limit point is guaranteed if is bounded, and for some applications, the boundedness of can be satisfied by setting appropriate parameters in Algorithm 1; see examples in section 3. If ’s are continuous, (16) immediately holds. Since we only assume lower semicontinuity of ’s, may not converge to as , so (16) is not obvious.
Assume is a limit point of . Then there exists an index set so that the subsequence converging to . From (15), we have and thus for any . Define
Take an arbitrary . Note is an infinite set according to Assumption 3. Taking another subsequence if necessary, converges to some as . Note that since , for any ,
(17) 
Note from (15) and (6) that as . Since is continuously differentiable and is lower semicontinuous, letting in (17) yields
Hence,
and satisfies the firstorder optimality condition:
(18) 
Since (18) holds for arbitrary , is a critical point of (1).
In addition, (17) implies
Taking limit superior on both sides of the above inequality over gives Since is lower semicontinuous, it holds , and thus
Noting that is continuous, we complete the proof.
2.2 Whole sequence convergence and rate
In this subsection, we establish the whole sequence convergence and rate of Algorithm 1 by assuming Condition 2.1. We first show that under mild assumptions, Condition 2.1 holds for certain .
Let . Assume is singlevalued near and
(19) 
namely, progress can still be made by updating the th block. Then, there is such that for any , we have .
By this proposition, we can find through backtracking to maintain the monotonicity of . All the examples in section 3 satisfy the assumptions of Proposition 2.2. The proof of Proposition 2.2 involves the continuity of and is deferred to Appendix A.4.
Under Condition 2.1 and the KL property of (Definition 1.2), we show that the sequence converges as long as it has a finite limit point. We first establish a lemma, which has its own importance and together with the KL property implies Lemma 2.6 of [5].
The result in Lemma 2.2 below is very general because we need to apply it to Algorithm 1 in its general form. To ease understanding, let us go over its especial cases. If , and , then (21) below with and reduces to , which together with Young’s inequality gives . Hence, if is summable, so will be . This result can be used to analyze the proxlinear method. The more general case of , and applies to the cyclic block proxlinear method. In this case, (21) reduces to which together with the Young’s inequality implies
(20) 
where is sufficiently large so that . Less obviously but still, if is summable, so will be . Finally, we will need in (21) to analyze the accelerated block proxlinear method. {lemma} For nonnegative sequences , and , if
and
(21) 
where , and are nonnegative integer sequences satisfying: , for some integer . Then we have
(22) 
In addition, if , , and (21) holds for all , then we have
(23) 
The proof of this lemma is given in Appendix A.5
Remark 2.4
To apply (21) to the convergence analysis of Algorithm 1, we will use for and relate to Lipschitz constant . The second term in the bracket of the left hand side of (21) is used to handle the extrapolation used in Algorithm 1, and we require such that the first term can dominate the second one after summation.
We also need the following result.{proposition} Let be generated from Algorithm 1. For a specific iteration , assume for some and . If for each , is Lipschitz continuous with constant within with respect to , i.e.,
then
(24) 
We are now ready to present and show the whole sequence convergence of Algorithm 1. {theorem}[Whole sequence convergence] Suppose that Assumptions 1 through 3 and Condition 2.1 hold. Let be generated from Algorithm 1. Assume

has a finite limit point ;

satisfies the KL property (4) around with parameters , and .

For each , is Lipschitz continuous within with respect to .
Then
Remark 2.5
Before proving the theorem, let us remark on the conditions 1–3. The condition 1 can be guaranteed if has a bounded subsequence. The condition 2 is satisfied for a broad class of applications as we mentioned in section 1.2. The condition 3 is a weak assumption since it requires the Lipschitz continuity only in a bounded set.