A Second Order Method for Nonconvex OptimizationSubmitted to the editors on September 30, 2017.      Funding: Work supported by the ARL DCIST CRA W911NF-17-2-0181

A Second Order Method for Nonconvex Optimization††thanks: Submitted to the editors on September 30, 2017.      Funding: Work supported by the ARL DCIST CRA W911NF-17-2-0181

Santiago Paternain Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, PA (spater@seas.upenn.edu, aryanm@seas.upenn.edu, aribeiro@seas.upenn.edu).    Aryan Mokhtari22footnotemark: 2    Alejandro Ribeiro22footnotemark: 2
Abstract

Machine learning problems such as neural network training, tensor decomposition, and matrix factorization, require local minimization of a nonconvex function. This local minimization is challenged by the presence of saddle points, of which there can be many and from which descent methods may take inordinately large number of iterations to escape. This paper presents a second-order method that modifies the update of Newton’s method by replacing the negative eigenvalues of the Hessian by their absolute values and uses a truncated version of the resulting matrix to account for the objective’s curvature. The method is shown to escape saddles in at most iterations where is the target optimality and characterizes a point sufficiently far away from the saddle. This base of this exponential escape is independently of problem constants. Adding classical properties of Newton’s method, the paper proves convergence to a local minimum with probability in iterations.

Key words. smooth nonconvex unconstrained optimization, line-search methods, second-order methods, Newton-type methods.

AMS subject classifications. 49M05, 49M15, 49M37, 90C06, 90C30.

1 Introduction

Although it is generally accepted that the distinction between functions that are easy and difficult to minimize is their convexity, a more accurate statement is that the distinction lies on the ability to use local descent methods. A convex function is easy to minimize because a minimum can be found by following local descent directions, but this is not possible for nonconvex functions. This is unfortunate because many interesting problems in machine learning can be reduced to the minimization of nonconvex functions [23]. Despite this general complexity, some recent results have shown that for a large class of nonconvex problems such as dictionary learning [28], tensor decomposition [12], matrix completion [13], and training of some specific forms of neural networks [16], all local minimizers are global minima. This reduces the problem of finding the global optimum to the problem of finding a local minimum which can be accomplished with local descent methods.

Conceptually, finding a local minimum of a nonconvex function is not more difficult than finding the minimum of a convex function. It is true that the former can have saddle points that are attractors of gradient fields for some initial conditions [19, Section 1.2.3]. However, since these initial conditions lie in a low dimensional manifold, gradient descent can be shown to converge almost surely to a local minimum if the initial condition is assumed randomly chosen [17, 22], or if noise is added to gradient descent steps [25]. These fundamental facts notwithstanding, practical implementations show that finding a local minimum of a nonconvex function is much more challenging than finding the minimum of a convex function. This happens because the performance of first order methods is degraded by ill conditioning which in the case of nonconvex functions implies that it may take a very large number of iterations to escape from a saddle point [11, 7]. Indeed, it can be argued that it is saddle-points and not local minima that provide a fundamental impediment to rapid high dimensional non-convex optimization [11, 2, 27, 26].

In this paper we propose the nonconvex Newton (NCN) method to accelerate the speed of escaping saddles. NCN uses a descent direction analogous to the Newton step except that we use the Positive definite Truncated (PT)-inverse of the Hessian in lieu of the regular inverse of the Hessian (Definition LABEL:def_trucho_inverse). The PT-inverse has the same eigenvector basis of the regular inverse but its eigenvalues differ in that: (i) All negative eigenvalues are replaced by their absolute values. (ii) Small eigenvalues are replaced by a constant. The idea of using the absolute value of the eigenvalues of the Hessian in nonconvex optimization was first proposed in [21, Chapters 4 and 7] and then in [18, 24]. These properties ensure that the value of the function is reduced at each iteration with an appropriate selection of the step size. Our main contribution is to show that NCN can escape any saddle point with eigenvalues bounded away from zero at an exponential rate which can be further shown to have a base of 3/2 independently of the function’s properties in a neighborhood of the saddle. Specifically, we show the following result:

• Consider an arbitrary and the region around a saddle at which the objective gradient is smaller than . There exists a subset of this region so that NCN iterations result in the norm of the gradient growing from to at an exponential rate with base . The number of NCN iterations required for the gradient to progress from to is therefore not larger than ; see Theorem LABEL:thm_steps_escape.

We emphasize that the base of escape 3/2 is independent of the function’s properties asides from the requirement to have non-degenerate saddles. The constant depends on Lipschitz constants and Hessian eigenvalue bounds.

As stated in (i) the base 3/2 for exponential escape does not hold for all points close to the saddle but in a specific subset at which the gradient norm is smaller than . It is impossible to show that NCN iterates stay within this region as they approach the saddle, but we show that it is possible to add noise to NCN iterates to quickly enter into this subset with overwhelming probability. Specifically, we show that:

• By adding gaussian noise with standard deviation proportional to when the norm of the gradient of the function is smaller than , the region in which the base of the exponential escape of NCN is is visited by the iterates with probability in iterations. Once this region is visited once, result (i) holds and we escape the saddle in not more than iterations; see Proposition LABEL:prop_probabilistic.

Combined with other standard properties of classical Newton’s method, results (i) and (ii) imply convergence to a local minimum with probability in a number of iterations that is of order with respect to the target accuracy and of order with respect to the desired probability (Theorem LABEL:main_thm). This convergence rate results are analogous to the results for gradient descent with noise [12, 15]. The fundamental difference is that while gradient descent escapes saddles at an exponential rate with a base that depends on the problem’s condition number, NCN escapes saddles at an exponential rate with a base of 3/2 for all non-degenerate saddles (Section LABEL:sec_illustrative_example). Section LABEL:sec_numerical considers the problem of matrix factorization to support theoretical conclusions.

1.1 Related work

A second approach to ensure that the stationary point attained by the local descent method is a local minimum utilizes second order information to guarantee that the stationary point is a local minimum. These include cubic regularization [14, 20, 5, 6, 1] and trust region algorithms [8, 11, 10], as well as approaches where the descent is computed only along the direction corresponding to the negative eigenvalues of the Hessian [9]. When using a cubic regularization of a second order approximation of the objective the number of iterations needed to converge to an epsilon local minimum can be further shown to be of order [20]. Solving this cubic regularization is in itself computationally prohibitive. This is addressed with trust region methods that reduce the computational complexity and still converge to a local minimum in iterations [10]. A related attempt utilizes low-complexity Hessian-based accelerations to achieve convergence in iterations [1, 4]. Although these convergence rates seem to be worse than the rate achieved by NCN this is simply a difference in assumptions because we assume here that saddles are nondegenerate. This assumption is absent from [14, 20, 5, 6, 1, 8, 11, 10, 9].

2 Nonconvex Newton Method (NCN)

Given a multivariate nonconvex function , we would like to solve the following problem

 x∗:=\operatornamewithlimitsargminx∈Rn f(x). (1)

Finding is NP hard in general, except in some particular cases, e.g., when all local minima are known to be global. We then settle for the simpler problem of finding a local minima , which we define as any point where the gradient is null and the Hessian is positive definite

 ∥∥∇f(x†)∥∥=0,∇2f(x†)≻0. (2)

The fundamental difference between (strongly) convex and nonconvex optimization is that any local minimum is global because there is only one point at which and that point satisfies . Nonconvex functions may have many minima and many other critical points at which but the Hessian is not positive definite. Of particular significance are saddle points, which are defined as those at which the Hessian is indefinite

 ∥∥∇f(x‡)∥∥=0,∇2f(x‡)⊁0,∇2f(x‡)⊀0. (3)

Local minima can be found with local descent methods. The most widely used of these is gradient descent which can be proven to approach some with probability one relative to a random initialization under some standard regularity conditions [17, 22]. Convergence guarantees notwithstanding, gradient descent methods can perform poorly around saddle points. Indeed, while escaping saddles is guaranteed in theory, the number of iterations required to do so is large enough that gradient descent can converge to saddles in practical implementations [24].

Newton’s method ameliorates slow convergence of gradient descent by premultiplying gradients with the Hessian inverse. Since the Hessian is positive definite for strongly convex functions, Newton’s method provides a descent direction and converges to the minimizer at a quadratic rate. The reason for the improvement in the convergence of Newton’s method as compared with gradient descent is due to the fact that by premultiplying the descent direction by the inverse of the Hessian we are performing a local change of coordinates by which the level sets of the function become circular. The algorithm proposed here relies in performing an analogous transformation that turns saddles with “slow” unstable manifolds as compared to the stable manifold – this is smaller absolute value of the negative eigenvalues of the Hessian than its positive eigenvalues– into saddles that have the same absolute values of every eigenvalue. For nonconvex functions the Hessian is not necessarily positive definite and convergence to a minimum is not guaranteed by Newton’s method. In fact, all critical points are stable relative to Newton dynamics and the method can converge to a local minimum, a saddle or a local maximum. This shortcoming can be overcome by adopting a modified inverse using the absolute values of the Hessian eigenvalues [21].

Definition 2.1 (PT-inverse).

Let be a symmetric matrix, a basis of orthonormal eigenvectors of , and a diagonal matrix of corresponding eigenvalues. We say that is the Positive definite Truncated (PT)-eigenvalue matrix of with parameter if

 (|Λ|m)ii={ll|Λii|if|Λii|≥mmotherwise. (4)

The PT-inverse of with parameter is the matrix

Given the decomposition , the inverse, when it exists, can be written as . The PT inverse flips the signs of the negative eigenvalues and truncates small eigenvalues by replacing for any eigenvalue with absolute value smaller than . Both of these properties are necessary to obtain a convergent Newton method for nonconvex functions. We use the PT-inverse of the Hessian to define the NCN method. To do so, consider iterates , a step size , and use the shorthand to represent the PT-inverse of the Hessian evaluated at the iterate. The NCN method is defined by the recursion

 xk+1 = xk−ηkHm(xk)−1∇f(xk) = xk−ηk∣∣∇2f(xk)∣∣−1m∇f(xk). (5)

The step size is chosen with a backtracking line search as is customary in regular Newton’s method; see, e.g., [3, Section 9.5.2]. This yields a step routine that is summarized in Algorithm LABEL:alg_ncn. In Step 3 we update the iterate using the PT-inverse Hessian computed in Step 2 and initial stepsize . The updated variable is checked against the decrement condition with parameter in Step 4. If the condition is not met, we decrease the stepsize by backtracking it with the constant as in Step 5. We update the iterate with the new stepsize as in Step 6 and repeat the process until the decrement condition is satisfied.

Since the PT-inverse is defined to guarantee that is a proper descent direction, it is unsurprising that NCN converges to a local minimum. The expectation is, however, that it will do so at a faster rate because of the Newton-like correction that is implied by (LABEL:eqn_update). Intuitively, the Hessian inverse in convex functions implements a change of coordinates that renders level sets approximately spherical around the current iterate . The Hessian PT-inverse in nonconvex functions implements an analogous change of coordinates that renders level sets in the neighborhood of a saddle point close to a symmetric hyperboloid. This regularization of level sets is expected to improve convergence, something that has been observed empirically, [24].

2.1 Convergence of NCN to local minima

Convergence results are derived with customary assumptions on Lipschitz continuity of the gradient and Hessian, boundedness of the norm of the local minima, and non-degeneracy of critical points:

Assumption 1.

The function is twice continuously differentiable. The gradient and Hessian of are Lipchitz continuous, i.e., there exits constants such that for any

 ∥∇f(x)−∇f(y)∥≤M∥x−y∥,and∥∥∇2f(x)−∇2f(y)∥∥≤L∥x−y∥. (6)

Assumption 2.

There exists a positive constant such that the norm of local minima satisfies for all satisfying (LABEL:eqn_local_minima). In particular, this is true of the global minimum in (LABEL:eqn_global_minimum).

Assumption 3.

Local minima and saddles are non-degenerate. I.e., there exists a constant such that for all local minima and for all saddle pioints defined in (LABEL:eqn_saddles). The notation refers to the -th eigenvalue of the Hessian of at the point .

The main feature of the update in (LABEL:eqn_update) is that it exploits curvature information to accelerate the rate for escaping from saddle points relative to gradient descent. In particular, the iterates of NCN escape from a local neighborhood of saddle points exponentially fast at a rate which is independent of the problem’s condition number. To state this result formally, let be a saddle of interest and denote and as the orthogonal subspaces associated with the negative and positive eigenvalues of . For a point we define the gradient projections on these subspaces as

 ∇f−(x):=Q⊤−∇f(x)and∇f+(x):=Q⊤+∇f(x). (7)

These projections have different behaviors in the neighborhood of a saddle point. The projection on the positive subspace enjoys an approximately quadratic convergent phase as in Newton’s method (Theorem LABEL:theo_rate_saddles). This is as would be expected because the positive portion of the Hessian is not affected by the PT-inverse. The negative portion can be shown to present an exponential divergence from the saddle point with a rate independent of the problem conditioning. These results provide a bound in the number of steps required to escape the neighborhood of the saddle point that we state next.

Theorem 2.2.

Let be a function satisfying Assumptions LABEL:assumption_lipschitz and LABEL:assumption_morse, be the desired accuracy of the solution provided by Algorithm LABEL:alg_nonconvex_newton and be one of its inputs. Define . If

 ∥∇f−(x0)∥≥max{(5L/2m2)∥∇f(x0)∥2,ε} (8)

and , we have that , with

 K1≤1+log3/2(δ2ε). (9)

The result in Theorem LABEL:thm_steps_escape establishes an upper bound for the number of iterations needed to escape the saddle point which is of the order as long as the iterate satisfies and . However, the fundamental result is that the rate at which the iterates escape the neighborhood of the saddle point is a constant independent of the constants of the specific problem. To establish convergence to a local minimum we will prove four additional results:

(i) In Proposition LABEL:prop_sublinear we state that the convergence of the algorithm to a neighborhood of the critical points such that is achieved in a constant number of iterations bounded by

 K2=4M2(f(x0)−f(x∗))αβmδ2. (10)

(ii) We show in Proposition LABEL:prop_revisit_saddle that the number of times that the iterates re-visit the same neighborhood of a saddle point is upper bounded by

 T<2αβM3m3+αβ, (11)

and that (iii) once in such neighborhood of a local minimum, the algorithm achieves accuracy in at most

 K3=log2(log2(2m25Lε)) (12)

iterations (Corollary LABEL:coro_local_minima). (iv) For the case that the iterate is within the neighborhood of a saddle point, but the conditions required by Theorem LABEL:thm_steps_escape are not satisfied, we show that by adding noise to the iterate we can ensure that said conditions are met with probability after a number of iterations of order , where is the number of saddles of the problem. Since we may visit of all of the saddles in the descent process, to converge to the minimum with probability we need to escape each one of them with probability . In particular, we show that if we add a bounded version of the Gaussian noise to each component of the decision variable, with probability the perturbed variable will be in the region that conditions required by Theorem LABEL:thm_steps_escape are satisfied. We further show that the probability is lower bounded by

 q>2(1−Φ(1))γ(n2,n2)Γ(n2), (13)

where is the integral of the Gaussian distribution with integration boundaries and , is the gamma function, is the lower incomplete gamma function and is the dimension of the iterate . In practice, since we cannot check if the conditions required in Theorem LABEL:thm_steps_escape are satisfied by the perturbed iterate, we run the algorithm for iterations (c.f. (LABEL:eqn_number_steps_escape)), i.e., the maximum number of required iterations to escape the saddle assuming that the perturbed iterate is in the preferable region. If the perturbed variable were not in the region of interest, the iterates may not escape the saddle and another round of perturbation is needed. Using this scheme we show that the iterates will be within the desired region of the saddle with probability after at most

 K4≤(1+log(S/p)log(1/(1−q)))[log2(log2(5L2m2ε))+log3/2(2)+1]. (14)

The fact that we may need to visit each saddle times does not contribute to the increase in the previous probability as we show in Proposition LABEL:prop_revisit_saddle, since in only one of theese visits we reach the neighborhood, where noise is added. By combing the previous bounds we establish a total complexity of order for NCN to converge to an neighborhood of a local minima of with probability . We formalize this result next.

Theorem 2.3.

Let be a function with saddle points, satisfying Assumptions LABEL:assumption_lipschitzLABEL:assumption_morse. Let be theaccuracy of the solution and , the remaining inputs of Algorithm LABEL:alg_nonconvex_newton. The latter, with probability and for any satisfying

 ε

outputs satisfying and in at most

 K=STK1+(ST+1)K2+K3+SK4 (16)

iterations, where , , , and are the bounds are defined in (LABEL:eqn_number_steps_escape)–(LABEL:eqn_number_steps_rand).

The result in Theorem LABEL:main_thm states that the proposed NCN method outputs a solution with a desired accuracy and with a desired probability in a number of steps that is bounded by . The final bound follows from the fact that it may be required to visit every saddle point times before reaching a local minima. Hence we may have approaches to neighborhoods of the critical points, each one of them taking iterations. The latter corresponds to the second term in (LABEL:eqn_final_complexity). The first term correspond to the need of escaping all saddles, times each if we are if we are in the good region. Thus taking a total of steps. If we are not in the desired region, it takes steps to reach said region and we may have to do this in all saddles, hence the last term in (LABEL:eqn_final_complexity).Finally, term correspond to the quadratic convergence to the local minimum. The dominant terms in (LABEL:eqn_final_complexity) are , which has a dependence, and , which depends on the probability as .

Before proving the result of the previous theorem we describe the details of Algorithm LABEL:alg_nonconvex_newton. Its main core is the NCN step described in (LABEL:eqn_update) (Step 3) that is performed as long as the iterates are not in a neighborhood of a local minimum satisfying . Steps 4–12 are introduced to add Gaussian noise to satisfy the hypothesis of Theorem LABEL:thm_steps_escape. If the iterate is in a neighborhood of a saddle point such that (Step 4), noise from a Gaussian distribution is added (Step 5). The draw is repeated as long as (Steps 6–8). This is done to ensure that the iterates remain close to the saddle point. Once this condition is satisfied we perform the NCN step twice if the iterate is still in the neighborhood (steps 10–12). In cases where , two steps of NCN are enough to escape the neighborhood of the saddle point and therefore to satisfy the hypothesis of Theorem LABEL:thm_steps_escape (c.f. Proposition LABEL:prop_probabilistic).

2.2 An Illustrative Example

In understanding escape from a saddle point it is important to distinguish challenges associated with the saddle’s condition number and challenges associated with starting from an initial point that is close to the stable manifold. To illustrate these two different challenges we consider a family of nonconvex functions parametrized by a coefficient and defined as

 fλ(x) = 12x21−λ2x22, (17)

As decreases from 1 to 0 the saddle becomes flatter, its condition number growing from to . Gradient descent iterates using unit stepsize for the function result in zeroing of the first coordinate in just one step because . The second coordinate evolves according to the recursion

 x+2 = x2−η∇x2fλ(x) = (1+λ)x2 (18)

Likewise, NCN with unit step size results in zeroing of the first coordinate because . The second coordinate, however, evolves as

 x+2 = x2−[H−1∇fλ(x)]2 = 2x2. (19)

Both expressions imply exponential escape from the saddle point. In the case of gradient descent the base of escape is but in the case of NCN the base of escape is 2 independently of the value of – this is better than the guaranteed base of that we establish in Theorem LABEL:thm_steps_escape.

3 Convergence Analysis

To study the convergence of the proposed NCN method we divide the results into two parts. In the first part, we study the performance of NCN in a neighborhood of critical points. We first define this region and then characterize the number of required iterations to escape it in the case that the critical point is a saddle or the number of required iterations for convergence in case that the critical point is a minimum. In the second part, we study the behavior of NCN when the iterates are not close to a critical point and derive an upper bound for the number of iterations required to reach one.

To analyze the local behavior of NCN we characterize the region in which the step size chosen by backtracking line search is , as in standard Newton’s method for convex optimization. We formally introduce this region in the following lemma.

Lemma 3.1.

Let be a function satisfying Assumptions LABEL:assumption_lipschitz and LABEL:assumption_morse, and be the input parameter of Algorithm LABEL:alg_nonconvex_newton. Then if a backtracking algorithm admits the step size .

Proof.

See Appendix LABEL:sec_step_size_one.

The result in Lemma LABEL:lemma_step_size_one characterizes the neighborhood in which the step size of NCN chosen by backtracking is . In the following theorem, we study the behavior of NCN in this neighborhood. Before introducing the result, recall the definitions of the gradient projected over the subspace of the eigenvectors associated with the negative and positive eigenvalues in (LABEL:def_grad_neg_and_pos) which are denoted by and , respectively. We attempt to show that the norm is almost doubled per iteration in this local neighborhood, and the norm converges to zero quadratically.

Theorem 3.2.

Let Assumptions LABEL:assumption_lipschitzLABEL:assumption_morse and hold. Then and defined in (LABEL:def_grad_neg_and_pos) satisfy

 ∥∇f+(xk+1)∥≤5L2m2∥∇f(xk)∥2, (20)

and

 ∥∇f−(xk+1)∥≥2∥∇f−(xk)∥−5L2m2∥∇f(xk)∥2. (21)

Proof.

It follows from Lemma LABEL:lemma_step_size_one that the backtracking algorithm admits a step . Hence we can write as

 ∇f(xk+1)=∇f(xk)+∫10∇2f(xk+θΔx)Δxdθ. (22)

We next show that in the region it holds that . Let us consider the region . In this region, by virtue of Lemma LABEL:lemma_bound_eigenvalues we have that and that in the boundary . Since at the critical point , by continuity of the norm we have that the region is contained in the region in which . Thus, we have that . Using the fact that the previous expression can be written as

 ∇f(xk+1)=2∇f(xk)+∫10(∇2f(xk+θΔx)+Hm(xk))Δxdθ=2∇f(xk)+∫10(∇2f(xk+θΔx)−∇2f(xk))Δxdθ+∫10(∇2f(xk)−∇2f(xc))Δxdθ+∫10(∇2f(xc)+Hm(xc))Δxdθ+∫10(Hm(xk)−Hm(xc))Δxdθ. (23)

The latter equality follows by adding and substracting , and . Multiply both sides of (LABEL:eqn_gradient_expansion) by the matrix of the eigenvectors corresponding to negative eigenvalues of the Hessian at . Since is a matrix whose columns are eigenvectors its norm is bounded by one. Combine this fact with the Lipschitz continuity of the Hessian (c.f. Assumption LABEL:assumption_lipschitz) to write

 ∥∥Q⊤−(∇2f(xk+θΔx)−∇2f(xk))Δx∥∥≤Lθ∥Δx∥2, (24)

Likewise, we can upper bound the second and fourth integrands in (LABEL:eqn_gradient_expansion) by

 ∥∥Q⊤−(∇2f(xk)−∇2f(xc))Δx∥∥≤L∥xk−xc∥∥Δx∥, (25)
 (26)

We next show that the third integrand in (LABEL:eqn_gradient_expansion) when multiplied by becomes zero. Let us write the product as

 Q⊤−(∇2f(xc)−Hm(xc))=Q⊤−Q(Λ(xc)−|Λ(xc)|)Q⊤. (27)

Let be the number of negative eigenvalues. Then . Moreover is diagonal with the first elements being zero and the remaining being . Which shows that . With this result and the bounds on (LABEL:eqn_gradient_first_bound),(LABEL:eqn_gradient_second_bound),(LABEL:eqn_gradient_third_bound) we can lower bound (LABEL:eqn_gradient_expansion) by

 ∥∇f−(xk+1)∥≥2∥∇f−(xk)∥−L∥Δx∥2∫10θdθ−2L∥xk−xc∥∥Δx∥. (28)

Finally, using the fact that (c.f. Lemma LABEL:lemma_bound_eigenvalues) and that , the previous bound reduces to

 ∥∇f−(xk+1)∥≥2∥∇f−(xk)∥−5L2m2∥∇f(xk)∥2. (29)

The proof for the projection over the positive subspace is analogous.

The first result in Theorem LABEL:theo_rate_saddles shows that the norm approaches zero at a quadratic rate if most of the energy of the gradient norm belongs to the term . In particular, when we are in a local neighborhood of a local minimizer and all the eigenvalues are positive we have and therefore the sequence of iterates converges quadratically to the local minimum. Indeed, in a neighborhood of a local minimum the algorithm proposed here is equivalent to Newton’s method. We formalize this result in the following corollary.

Corollary 3.3.

Let be a function satisfying Assumptions LABEL:assumption_lipschitz and LABEL:assumption_morse and let be in the neighborhood of a local minima such that where

 δ:=min{m2(1−2α)L,m25L}. (30)

Then, it holds that , where is bounded by

 K3≤log2(log2(2m25Lε)). (31)

Proof.

See Appendix LABEL:app_local_minima.

The second result in Theorem LABEL:theo_rate_saddles shows that the norm multiplies by factor , where is a free parameter, after each iteration of NCN if the squared norm is negligible relative to . We formally state this result in the following Proposition.

Proposition 3.4.

Let be a function satisfying Assumptions LABEL:assumption_lipschitz and LABEL:assumption_morse. Further, recall the definition of in (LABEL:def_delta) and let and be the inputs of Algorithm LABEL:alg_nonconvex_newton. Then, if the conditions and hold, the sequence generated by NCN needs iterations to escape the saddle and satisfy the condition , where is upper bounded by

 K1≤1+log(ζδ/ε)log(2−ζ). (32)

Proof.

See Appendix LABEL:app_steps_escape.

The result in Proposition LABEL:prop_steps_escape states that when the norm of the projection of the gradient over the subspace of eigenvectors corresponding to negative eigenvalues is larger than the required accuracy and the expression , then the sequence of iterates generated by NCN escapes the saddle exponentially with base . Since is a free parameter we set as in Theorem LABEL:thm_steps_escape. When the condition in Proposition LABEL:prop_steps_escape is not met for all , the sequence generated by NCN reaches the neighborhood after iterations as shown in the following proposition.

Proposition 3.5.

Let be a function satisfying Assumptions LABEL:assumption_lipschitz and LABEL:assumption_morse. Further, recall the definition of in (LABEL:def_delta) and let be the accuracy of Algorithm LABEL:alg_nonconvex_newton. Then, if the conditions is violated and holds for some , the sequence of iterates either satisfies with

 ~k=log2(log2(5Lζm2ε)). (33)

or for some we have that .

Proof.

If the iterate is in the neighborhood of a saddle point, then Algorithm 2 adds uniform noise in each component to . To analyze the perturbed iteration we define the following set.

 (34)

We show in Lemma LABEL:lemma_probability_good_region that the probability of is lower bounded by given in (LABEL:eqn_expression_q). In this case NCN escapes the neighborhood of the saddle at a exponential rate in iterations based on the analysis of Proposition LABEL:prop_steps_escape. If the latter is not the case, we show that the number of iterations between re-sample instances is bounded by . Because we want to escape each saddle point with probability the number of draws needed is of the order of . The following Proposition formalizes the previous discussion.

Proposition 3.6.

Let be a function satisfying Assumptions LABEL:assumption_lipschitz, LABEL:assumption_bounded_opt and LABEL:assumption_morse. Let be the desired accuracy of the output of Algorithm LABEL:alg_nonconvex_newton be such that

 ε

Further, consider the constants and let be a constant such that

 q>2(1−Φ(1))γ(n2,n2)Γ(n2), (36)

where is the integral of the Gaussian distribution with integration boundaries and , is the gamma function and is the lower incomplete gamma function. For any in the neighborhood of a saddle point satisfying , with probability we have that and , where is given by

 K4=(1+log(S/p)log(1/(1−q)))[log2(log2(5Lζm2ε))+log(1/γ)log(2−ζ)+1]. (37)

Proof.

See Appendix LABEL:app_probabilistic.

Combining the results from propositions LABEL:prop_steps_escape and LABEL:prop_probabilistic we show that with probability the number of steps required to escape the neighborhood of a saddle is . The previous result completes the analysis of the neighborhoods of the critical points. To complete the convergence analysis we show in the following lemma that the number of iterations required to reach a neighborhood of a critical point satisfying is constant.

Proposition 3.7.

Let be a function satisfying Assumptions LABEL:assumption_lipschitz, LABEL:assumption_bounded_opt and LABEL:assumption_morse and consider as the inputs of Algorithm LABEL:alg_nonconvex_newton. Further, recall the definition of in (LABEL:def_delta). Then if in at most iterations for Algorithm LABEL:alg_nonconvex_newton reaches a neighborhood such that , with

 K2≤M2(f(x0)−f(x∗))αβm(ζδ)2. (38)

Proof.

See Appendix LABEL:app_damped_phase.

The previous result establishes a bound in the number of iterations that NCN takes to reach a neighborhood of a critical point satisfying . However, to complete the proof of the final bound of the complexity of the algorithm, we need to ensure that the algorithm does not visit the neighborhood of the same saddle point indefinitely. In particular the next result estabishes that NCN visits such neighborhoods at most a constant number of times. Moreover, only in one of these visits there is need of adding noise.

Proposition 3.8.

Let satisfy assumptions LABEL:assumption_lipschitzLABEL:assumption_morse and be a critical point of , and let and is the constant defined in Theorem LABEL:thm_steps_escape and define

 N={x∈Rn∣∣∥x−xc∥≤m(1−2α)/L,∥∇f(x)∥<ζδ}. (39)

Let and the desired accuracy satisfy

 ε<   ⎷αβm3M3(ζδ)21+(2√nMm+1)2. (40)

Let be the number of times that the sequence generated by NCN visits . Then, the sequence generated by NCN is such that

 T<2αβM3m3+αβ. (41)

In addition, the neighborhood is visited at most once.

Proof.

The proof of the final complexity stated in Theorem LABEL:main_thm follows from propositions LABEL:prop_steps_escape, LABEL:prop_probabilistic, LABEL:prop_sublinear, LABEL:prop_revisit_saddle, Corollary LABEL:coro_local_minima and the discussion after the theorem in Section LABEL:sec_convergence_preview.

4 Numerical Experiments

In this section we apply Algorithm LABEL:alg_nonconvex_newton to the matrix factorization problem, where the goal is to find a rank approximation of a given matrix . The problem can be written as

 (42)

Writing the product as , where and are the –th columns of the matrices and respectively. Let , then we solve

 x∗:=\operatornamewithlimitsargminx∈Rr×(n+l)f(x). (43)

The matrix is a rank one randomly generated matrix. We compare the performance of gradient descent and NCN (Algorithm LABEL:alg_nonconvex_newton) on the problem (LABEL:eqn_application). The step size in gradient descent is selected via line search by backtracking with parameters and being the same as those of NCN. The parameters selected are , , , accuracy and . The initial iterate selected for all simulations is the same and it is selected at random from a multivariate normal random variable with mean zero and standard deviation . We star by considering an example where , since in small problems the different behaviors of NCN for different values of are better illustrated. At the end of this section we consider an example with to show that the advantages of NCN over gradient descent are preserved in larger problems.

In Figure LABEL:fig_n_10_r_1m1e3 we plot suboptimality and the norm of the gradient in logarithmic scale for gradient descent and NCN with . After iterations the suboptimality achieved by NCN is orders smaller than that of gradient descent. In particular observe that gradient descent does not progress after iterations – constant suboptimality – since it is in the neighborhood of a saddlle – norm of order . On the contrary NCN succeeds in escaping this neighborhood efficiently. Indeed, after iterations approximately we observe an increase in the norm of the gradient, which illustrates the escape from the saddle. Moreover, in this region the decrease in the function value is smaller. In Figure LABEL:fig_n_10_r_1m1e12 we present the results for NCN with . The conclusions of NCN in this case with gradient descent are the same as in the case with .

Note however that the performance of NCN is affected by the different truncation level of the PT-inverse: the smaller is, the faster the algorithm converges to the local minimum. Indeed, for