A Preliminary

On the Sublinear Convergence of Randomly Perturbed Alternating Gradient Descent to Second Order Stationary Solutions

Abstract

The alternating gradient descent (AGD) is a simple but popular algorithm which has been applied to problems in optimization, machine learning, data ming, and signal processing, etc. The algorithm updates two blocks of variables in an alternating manner, in which a gradient step is taken on one block, while keeping the remaining block fixed. When the objective function is nonconvex, it is well-known the AGD converges to the first-order stationary solution with a global sublinear rate.

In this paper, we show that a variant of AGD-type algorithms will not be trapped by “bad” stationary solutions such as saddle points and local maximum points. In particular, we consider a smooth unconstrained optimization problem, and propose a perturbed AGD (PA-GD) which converges (with high probability) to the set of second-order stationary solutions (SS2) with a global sublinear rate. To the best of our knowledge, this is the first alternating type algorithm which takes iterations to achieve SS2 with high probability [where polylog is polynomial of the logarithm of dimension of the problem].

1 Introduction

In this paper, we consider a smooth and unconstrained nonconvex optimization problem

(1)

where is twice differentiable.

There are many ways of solving problem (1), such as gradient descent (GD), accelerated gradient descent (AGD), etc. When the problem dimension is large, it is natural to split the variables into multiple blocks and solve the subproblems with smaller size individually. The block coordinate descent (BCD) algorithm, and many of its variants such as block coordinate gradient descent (BCGD) and alternating gradient descent (AGD) Bertsekas [1999]; Li and Liang [2017], are among the most powerful tools for solving large scale convex/nonconvex optimization problems Nesterov [2012]; Beck and Tetruashvili [2013]; Razaviyayn et al. [2013]; Hong et al. [2017]. The BCD-type algorithms partition the optimization variables into multiple small blocks, and optimize each block one by one following certain block selection rule, such as cyclic rule Tseng [2001], Gauss-Southwell rule Tseng and Yun [2009], etc.

In recent years, there are many applications of BCD-type algorithms in the areas of machine learning and data mining, such as matrix factorization Zhao et al. [2015]; Lu et al. [2017a, b], tensor decomposition, matrix completion/decomposition Xu and Yin [2013]; Jain et al. [2013], and training deep neural networks (DNNs) Zhang and Brand [2017]. Under relatively mild conditions, the convergence of BCD-type algorithms to first-order stationary solutions (SS1) have been broadly investigated for nonconvex and non-differentiable optimization Tseng [2001]; Grippo and Sciandrone [2000]. In particular, it is known that under mild conditions, these algorithms also achieve global sublinear rates Razaviyayn et al. [2014]. However, despite its popularity and significant recent progress in understanding its behavior, it remains unclear whether BCD-type algorithms can converge to the set of second-order stationary solutions (SS2) with a provable global rate, even for the simplest problem with two blocks of variables.

1.1 Motivation

Algorithms that can escape from strict saddle points – those stationary points that have negative eigenvalues – have wide applications. Many recent works have analyzed the saddle points in machine learning problems Kawaguchi [2016]. Such as learning in shallow networks, the stationary points are either global minimum points or strict saddle points. In two-layer porcupine neural networks (PNNs), it has been shown that most local optima of PNN optimizations are also global optimizers Feizi et al. [2017]. Previous work in Ge et al. [2015] has shown that the saddle points in tensor decomposition are indeed strict saddle points. Also, it has been shown that any saddle points are strict in dictionary learning and phase retrieval problems theoretically and numerically in Sun et al. [2015, 2017]; Wang et al. [2017b, a]. More recently, Ge et al. [2017] proposed a unified analysis of saddle points for a board class of low rank matrix factorization problems, and they proved that these saddle points are strict.

1.2 Related Work

Many recent works have been focused on the performance analysis and/or design of algorithms with convergence guarantees to local minimum points/SS2 for nonconvex optimization problems. These include the trust region method Conn et al. [2000], cubic regularized Newton’s method Nesterov and Polyak [2006]; Carmon and Duchi [2016], and a mixed approach of the first-order and seconde-order methods Reddi et al. [2017], etc. However, these algorithms typically require second-order information, therefore they incur high computational complexity when problem dimension becomes large.

There has been a line of work on stochastic gradient descent algorithms, where properly scaled Gaussian noise is added to the iterates of the gradient at each time [also known as stochastic gradient Langevin dynamics, (SGLD)]. Some theoretical works have pointed out that SGLD not only converges to the local minimum points asymptotically but also may escape from local minima Zhang et al. [2017]; Raginsky et al. [2017]. Unfortunately, these algorithms require a large number of iterations with steps to achieve the optimal point. There are fruitful results that show some carefully designed algorithms can escape from strict saddle point efficiently, such as negative-curvature-originated-from noise (Neon) Xu and Yang [2017], Neon2 Allen-Zhu and Li [2017], NeonXu et al. [2017] and gradient descent with one-step escaping (GOSE) Yu et al. [2017]. The Neon-type of algorithms utilizes the stochastic first-order updates to find the negative curvature direction, and GOSE just needs one negative curvature descent step with calculation of eigenvectors when the iterates of the algorithm are near the saddle point for saving the computational burden.

On the other hand, there is also a line of work analyzing the deterministic GD type method. With random initializations, it has been shown that GD only converges to SS2 for unconstrained smooth problems Lee et al. [2016]. More recently, block coordinate descent, block mirror descent and proximal block coordinate descent have been proven to almost always converge to SS2 with random initializations Lee et al. [2017], but there is no convergence rate reported. Unfortunately, a follow-up study indicated that GD requires exponential time to escape from saddle points for certain pathological problems Du et al. [2017]. Adding some noise occasionally to the iterates of the algorithm is another way of finding the negative curvature. A perturbed version of GD has been proposed with convergence guarantees to SS2 Jin et al. [2017a], which shows a faster provable convergence rate than the ordinary gradient descent algorithm with random initializations. Furthermore, the accelerated version of PGD (PAGD) is also proposed in Jin et al. [2017b], which shows the fastest convergence rate among all Hessian free algorithms.


Algorithm Iterations -SS2
SGD Ge et al. [2015]
SGLD Zhang et al. [2017]
Neon+SGD Xu and Yang [2017]
Neon+Natasha Xu and Yang [2017]
Neon2+SGD Allen-Zhu and Li [2017]
Xu et al. [2017]
PGD Jin et al. [2017a]
PAGD Jin et al. [2017b]
PA-GD/PA-PP (This work)
Table 1: Convergence rates of algorithms to SS2 with the first order information, where , and hides factor ploylog().

1.3 Scope of This Paper

In this work, we consider a smooth unconstrained optimization problem, and develop a perturbed AGD algorithm (PA-GD) which converges (with high probability) to the set of SS2 with a global sublinear rate. Our work is inspired by the works Jin et al. [2017a]; Ge et al. [2015], which developed novel perturbed GDs that escapes from strict saddle points. Similarly as in Jin et al. [2017a], we also divide the entire iterates of GD into three types of points: those whose gradients are large, those that are local minimum, and those that are strict saddle points. At a given point, when the size of the gradient is large enough, we just implement the ordinary AGD. When the gradient norm is small, which may be either strict saddle or local minimum, a perturbation will be added on the iterates to help to escape from the saddle points.

From the above section, we know that many works have been developed to make use of negative curvature information around the saddle points. Unfortunately, these techniques cannot be directly applied to the BCD/AGD- type of algorithms. The key challenge here is that at each iteration only part of the variables are updated, therefore we have access only to partial second order information at the points of interest. For example, consider a quadratic objective function shown in Figure 1. While fixing one block, the problem is strongly convex with respect to the other block, but the entire problem is nonconvex. Even if the iterates converge for each block to the minimum points within the block, the stationary point could still be a saddle point for the overall objective function. Therefore, the analysis of how AGD type of algorithms exploit the negative curvature is one of the main tasks in this paper.

To the best of our knowledge, there is no work on modifying AGD algorithms to escape from strict saddle points with any convergence rate. The main contributions of this work are as follows.

1.4 Contributions of This Work

In this paper, we design and analyze a perturbed AGD algorithm for solving an unconstrained nonconvex problem, namely perturbed AGD. Through the perturbation of AGD, the algorithm is guaranteed to converge to a set of SS2 of a nonconvex problem with high probability. By utilizing the matrix perturbation theory, convergence rate of the proposed algorithm is also established, which shows that the algorithm takes iterations to achieve an ()-SS2 with high probability. Also, considering the fact that there is a strong relation between GD and proximal point algorithm, we also study a perturbed alternating proximal point (PA-PP) algorithm with some random perturbation. By leveraging the techniques proposed in this paper, we show that PA-PP, which may not need to calculate the gradient at each step, converges as fast as PA-GD in the order of . The comparison of the algorithms which only use the first order information for escaping from strict saddle points is summarized as shown in Table 1.

The main contributions of the paper are highlighted below:

  1. To the best of our knowledge, it is the first time that the convergence analysis shows that some variants of AGD (using first-order information) can converge to SS2 for nonconvex optimization problems.

  2. The convergence rate of the perturbed AGD algorithm is analyzed, where the choice of the step size is only dependent on certain maximum Lipschitz constant over blocks rather than all variables. This is one of the major difference between GD and AGD.

  3. By further extending the analysis in this paper, we also show that PA-PP can also escape from the strict points efficiently with the speed of .

2 Preliminaries

2.1 Notation

Notation. Bold upper case letters without subscripts (e.g., ) denote matrices and bold lower case letters without subscripts (e.g., ) represent vectors. Notation denotes the th block of vector . We use to denote the partial gradient with respect to its th block variable while the remaining one is fixed. Notation denotes a -dimensional ball centered at with radius , and , denote the smallest and largest eigenvalues of matrix respectively.

2.2 Definitions

The objective function has the following properties.

Definition 1.

A differentiable function is L-smooth with gradient Lipschitz constant (uniformly Lipschitz continuous), if

The function is called block-wise smooth with gradient Lipschitz constants , if

or with gradient Lipschitz constants , if

Further, let .

Definition 2.

For a differentiable function , if , then is a first-order stationary point. If , then is an -first-order stationary point.

Definition 3.

For a differentiable function , if is a SS1, and there exists so that for any in the -neighborhood of , we have , then is a local minimum. A saddle point is a SS1 that is not a local minimum. If , is a strict (non-degenerate) saddle point.

Definition 4.

A twice-differentiable function is -Hessian Lipschitz if

(2)
Definition 5.

For a -Hessian Lipschitz function , is a second-order stationary point if and . If the following holds

(3)

where , then is a -SS2.

Assumption 1.

Function is -smooth, block-wise smooth with gradient Lipschitz constants , and -Hessian Lipschitz.

Input: , , , , , , ,
for  do
     if  and  then
          and
         , uniformly taken from
     end if
     if  and  then
         return
     end if
     for  do
         
     end for
end for
Algorithm 1 Perturbed Alternating Gradient Descent (PA-GD)

3 Perturbed Alternating Gradient Descent

3.1 Algorithm Description

AGD is a classical algorithm that optimizes the variables of an optimization problem in an alternating manner Bertsekas [1999], meaning that when one block of variables is updated, the remaining block is fixed to be the same as its previous solution. Mathematically, the iterates of AGD are updated by the following rule

(4)

where superscript denotes the iteration counter; and ; is the step size. AGD can be considered as a special case of block coordinate gradient descent Nesterov [2012]; Beck and Tetruashvili [2013].

Our proposed algorithm is based on AGD, but modified in a way similar to the recent work [Jin et al., 2017a], which adds some noise in PGD. The details of the implementation of PA-GD are shown in Algorithm 1, where is a constant so that , denotes the difference of the objective value at the initial point and global optimal solution, represents the predefined target error.

Figure 1: Contour of the objective values and the trajectory (pink color) of PA-GD started near strict saddle point . The objective function is where , and the length of the arrows indicate the strength of projected onto directions .

In each update of variables, we implement one step of the block gradient descent, and then proceed to the next block. Once the algorithm has sufficient decrease of the objective value, it implies that the algorithm converges to some good solution. Otherwise, some perturbation may be needed to help the iterates escape from the saddle points. If after the perturbation the objective value does not decrease sufficiently after a number of further iterations, the algorithm terminates and returns the iterate before the last perturbation.

To illustrate the practical behavior of the algorithm, we provide an example that shows the trajectory of AGD after a small perturbation at a stationary point. In Figure 1, it is clear that is a SS1 and also a strict saddle point since the eigenvalues of are and respectively. When is fixed, function is convex with respect to and vice versa, however, the objective function is nonconvex. It can be observed that PA-GD can escape from the strict saddle point efficiently.

3.2 Convergence Rate Analysis

Despite the fact that PA-GD exploits a different way of updating variables, we will show that it can still escape from strict saddle points with high probability with suitable perturbation. The main theorem is presented as follows.

Theorem 1.

Under Assumption 1, there exists a constant such that: for any , , , and constant , with probability , the iterates generated by PA-GD converge to an -SS2 satisfying

in the following number of iterations:

(5)

where denotes the global minimum value of the objective function, and and .

Remark 1. When is used, the convergence rate of PA-GD is

(6)

It shows that if a smaller step size is used, the convergence rate of PA-GD is faster (with smaller constants) since the linear dependency of and in (5) both disappear. This property is consistent with the known result when BCD is used in convex optimization problems, i.e., when a smaller step size is used, the rate could become better; e.g., see [Sun and Hong, 2015, Theorem 2.1].

4 Perturbed Alternating Proximal Point

In many applications, AGD may not be efficient in the sense that the convergence rate of gradient in each block may be very slow. For example, consider matrix factorization problem where is the given data, , and are two block variables. For this problem, the alternating least squares algorithm (which exactly minimizes each block) would be a faster algorithm compared with the AGD which only uses gradient steps.

In this section, we consider the classical proximal point algorithm Parikh et al. [2014] in which each block of variables is exactly minimized with respect to certain quadratic surrogate. To be specific, we can replace (4) in Algorithm 1 by

(7)

where is penalty parameter. The iteration can be explicitly written as

(8)

which has the similar form as the PA-GD algorithm, but with the step size being , and with gradient evaluated at the new iterate. The resulting algorithm, detailed in the table above, is referred to as the perturbed alternating proximal point (PA-PP). It is worth noting that when the subproblem is convex, such as , only needs to be a small number to make the corresponding subproblem strongly convex. This property is useful in practice.

Next, we can also give the convergence rate of PA-PP.

Input: , , , , , ,
for  do
     if  and  then
          and
         , uniformly taken from
     end if
     if  and  then
         return
     end if
     for  do
         
     end for
end for
Algorithm 2 Perturbed Alternating Proximal Point (PA-PP)
Corollary 1.

Under Assumption 1, there exists a constant such that: for any , , , and constant , with probability , the iterates generated by PA-PP converges to an -SS2 satisfying

in the following number of iterations:

where denotes the global minimum value of the objective function, and .

Comparing with Theorem 1, we can find that term is removed so the convergence rate of PA-PP is slightly faster than PA-GD.

5 Convergence Analysis

In this section, we will present the main proof steps of convergence analysis of PA-GD.

5.1 The Main Difficulty of the Proof

Gradient Descent:

GD searches the descent direction of the objective function in the entire space . Without loss of generality, we assume . According to the mean value theorem, the GD update can be expressed as

(9)

It can be observed that the update rule of GD contains the information of the Hessian matrix at point , i.e., . To be more specific, letting where denotes an -SS2 satisfying (3), we can rewrite (9) as

(10)

where .

Based on the -Hessian Lipschitz property, we can quantify that is upper bounded by the difference of iterates. By exploiting the negative curvature of the Hessian matrix at saddle point , we can project the iterate onto the direction where the eigenvalue of is greater than 1. This leads to the fact that the norm of the iterates projected along direction will be increasing exponentially as the algorithm proceeds around point , implying the sequence generated by GD is escaping from the saddle point. The details of characterizing the convergence rate have been analyzed previously in Jin et al. [2017a].

Alternating Gradient Descent:

However, the AGD algorithm only updates partial variables of vector , which belong to a subspace of the feasible set. Similarly, from the mean value theorem we can express the AGD rule of updating variables with assuming as follows:

(11)

where

From the above expression, it can be seen clearly that the update rule of AGD does not include a full Hessian matrix at any point but only partial ones. Furthermore, the right hand side of (11) not only contains the second order information of the previous point, i.e., but also the one of the most recently updated point, i.e., . These represent the main challenges in understanding the behavior of the sequence generated by the AGD algorithm.

5.2 The Main Idea of the Proof

Although the second order information is divided into two parts, we can still characterize the recursion of the iterates around strict saddle points. We can also split as two parts, which are

(12)

and obviously we have .

Then, recursion (11) can be written as

(13)

where , . However, it is still unclear from (13) how the iteration evolves around the strict saddle point.

To highlight ideas, let us define

(14)

It can be observed that is a lower triangular matrix where the diagonal entries are all 1s; therefore it is invertible. After taking the inverse of matrix on both sides of (13), we can obtain

Our goal of analyzing the recursion of becomes to find the maximum eigenvalue of . With the help of the matrix perturbation theory, we can quantify the difference between the eigenvalues of matrix that contains the negative curvature and matrix that we are interested in analyzing. To be more precise, we give the following lemma.

Lemma 1.

Under Assumption 1, let denote the Hessian matrix at an -SS2 where and . We have

(15)

where are defined in (12) and (14).

Lemma 1 illustrates that there exits a subspace spanned by the eigenvector of whose eigenvalue is greater than 1, indicating that the sequence generated by AGD can still potentially escape from the strict saddle point by leveraging such negative curvature information. Next, we can give a sketch of the proof of Theorem 1.

5.3 The Sketch of the Proof

The structure of the proof for quantifying the sufficient decrease of the objective function after the perturbation is borrowed from the proof of PGD Jin et al. [2017a], but PA-GD updates the variables block by block, so we have to provide the new proofs to show that PA-GD can still escape from saddle points with the perturbation technique.

First, if the size of the gradient is large enough, Algorithm 1 just implements the ordinary AGD. We give the descent lemma of AGD as follows.

Lemma 2.

Under Assumption 1, for the AGD algorithm with step size , we have

Second, if the iterates are near a strict saddle point, we can show that the AGD algorithm after a perturbation can give a sufficient decrease with high probability in terms of the objective value. To be more precise, the statement is given as follows.

Lemma 3.

Under Assumption 1, there exists a absolute constant . Let , , and , , , calculated as Algorithm 1 describes. Let be a strict saddle point, which satisfies

(16)

and , where and .

Let where is generated randomly which follows the uniform distribution over , and let be the iterates of PA-GD. With at least probability , we have .

We remark that Lemma 2 is well-known and Lemma 3 is the core technique. In the following, we outline the main idea used in proving the latter. The formal statements of these steps are shown in the appendix; see Lemma 8–Lemma 10 therein.

We emphasize that the main contributions of this paper lies in the analysis of the first two steps, where the special update rule of PA-GD is analyzed so that the negative curvature of around the saddle points can be utilized.

Step 1

(Lemma 8) Consider a generic sequence generated by PA-GD. As long as the initial point of is close to saddle point , the distance between and can be upper bounded by using the -Hessian Lipschitz continuity property.

Step 2

(Lemma 9) Leveraging the negative curvature around the strict saddle point, we know that there exits a direction, i.e., , which is spanned by the eigenvector of whose corresponding eigenvalue is largest (greater than 1). Consider two sequences generated by PA-GD, initialized around the saddle point. When the initial points of these two iterates are separated apart away from each other along direction with a small distance, meaning that where denotes the radius of the perturbation ball defined in Algorithm 1, we can show that if iterate is still near the saddle point after steps, the other sequence will give a sufficient decrease of the objective value with less than steps, implying that iterates can escape from the saddle point with less than steps.

Step 3

(Lemma 10) Consider as the points after the perturbation from the saddle point. We can quantify the probability that the AGD sequence will give a sufficient decrease of the objective value within iterations after the perturbation [Jin et al., 2017a, Lemma 14,15].

5.4 Extension to PA-PP

By leveraging the convergence analysis of PA-GD and relation between PA-GD and PA-PP shown in (8), we can also write the recursion of the PA-PP iteration as

(17)

where , , ,

(18)

and

Let

(19)

We know that is an upper triangular matrix where the diagonal entries are all 1s, so it is invertible. Different from the case of PA-GD, we take the inverse of matrix on both sides of (17) and obtain

Then, we can give the following result that characterizes the recursion of generated by PA-PP.

Corollary 2.

Under Assumption 1, let denote the Hessian matrix at an -SS2 where and . Let denote the minimum positive eigenvalue of a matrix. Then we have

(20)

where are defined in (18) and (19); and .

We remark that Corollary 2 is useful since it can be leveraged to show that the norm of the iterates around saddle points can increase exponentially. Then, we can apply the similar analysis steps as the case of proving the convergence rate of PA-GD and obtain the results shown in Corollary 1.

6 Connection with Existing Works

Remark 2. In Theorem 1 we characterized the convergence rate to an -SS2. We can also translate this bound to the one for achieving an -SS2, and in this case PA-GD needs iterations. Compared with the existing recent works Jin et al. [2017a], the convergence rate of PA-GD/PA-PP is slower than GD. The main reason is the fact that different from GD-type algorithms, PA-GD and PA-PP cannot fully utilize the Hessian information because they never see a full iteration. Similar situation happens for SGD-type of algorithms which also cannot get the exact negative curvature around strict saddle points.

From Table 1, it can be seen that the convergence rate of PA-GD/PA-PP is still faster than SGD Ge et al. [2015], SGLD Zhang et al. [2017], Neon+SGD Xu and Yang [2017], and Neon2+SGD Allen-Zhu and Li [2017] to achieve an -SS2, but slower than the rest. We emphasize that PA-GD and PA-PP represent the first BCD-type algorithms with the convergence rate guarantee to escape from the strict saddle points efficiently. At this point, it is unclear whether our rate is the best that is achievable, and the question of whether the resulting rate can be improved will be left to future work.

7 Numerical Results

(a) Objective function in 2D.
(b) Objective value versus the number of iterations
Figure 2: Convergence comparison between AGD and PA-GD, where , , , , .

In this section, we present a simple example that shows the convergence behavior of PA-GD. Consider a nonconvex objective function, i.e.,

(21)

First, we have the following properties of function such that satisfies the assumptions of the analysis.

Lemma 4.

For any and , defined in (21) is -smooth and -Hessian Lipschitz.

Here, we can easily show the shape of objective function (21) in the two dimensional (2D) case in Figure a, where . It can be observed clearly that there exits a strict saddle point at and two other local optimal points. We randomly initialize the algorithms around strict saddle point . The convergence comparison between AGD and PA-GD is shown in Figure b. It can be observed that PA-GD converges faster than AGD to a local optimal point.

8 Conclusion

In this paper, the perturbed variants of AGD and alternating proximal point (APP) algorithms are proposed, with the objective of finding the second order stationary solutions of nonconvex smooth problems. Leveraging the recently developed idea of random perturbation for the first-order methods, the proposed algorithms add suitable perturbation to the AGD or APP iterates. The main contribution of this work is a new analysis that takes into consideration the block structure of the updates for the perturbed AGD and APP algorithms. By exploiting the negative curvature, it is established that with high probability the algorithms can converge to an -SS2 with iterations.

9 Acknowledgment

The authors would like to thank Chi Jin for discussion on the perturbed gradient descent algorithm.

Appendix

Appendix A Preliminary

We provide the proofs of some preliminary lemmas (Lemma 5–Lemma 7) used in the proof of Section B.

First, Lemma 5 and Lemma 6 give the property that quantify the size of the difference of the second-order information of the objective values between two points.

Lemma 5.

If function is -Hessian Lipschitz, we have

(22)
Lemma 6.

Under Assumption 1, we have block-wise Lipschitz continuity as follows:

(23)

and

(24)

Then, we illustrate that the size of the partial gradient with one round update by the AGD algorithm has the following relation with the full size of the gradient.

Lemma 7.

If function is -smooth with Lipschitz constant, then we have

(25)

where sequence is generated by the AGD algorithm.

a.1 Proof of Lemma 5

Proof.

If function is -Hessian Lipschitz, then we have

where is true because of Hessian Lipschitz, in we used the triangle inequality. ∎

a.2 Proof of Lemma 6

There proof involves two parts:

Upper Triangular Matrix:

Consider three different vectors , and . We can have

where in we used

(26)

and .

Lower Triangular Matrix:

where is true because we know .

a.3 Proof of Lemma 7

Proof.

Recall the definition