Local Saddle Point Optimization: A Curvature Exploitation Approach
Abstract
Gradientbased optimization methods are the most popular choice for finding local optima for classical minimization and saddle point problems. Here, we highlight a systemic issue of gradient dynamics that arise for saddle point problems, namely the presence of undesired stable stationary points that are no local optima. We propose a novel optimization approach that exploits curvature information in order to escape from these undesired stationary points. We prove that different optimization methods, including gradient method and Adagrad, equipped with curvature exploitation can escape nonoptimal stationary points. We also provide empirical results on common saddle point problems which confirm the advantage of using curvature exploitation.
Local Saddle Point Optimization: A Curvature Exploitation Approach
Leonard Adolphs Department of Computer Science ETH Zürich ladolphs@ethz.ch Hadi Daneshmand Department of Computer Science ETH Zürich hadi.daneshmand@inf.ethz.ch Aurelien Lucchi Department of Computer Science ETH Zürich aurelien.lucchi@inf.ethz.ch Thomas Hofmann Department of Computer Science ETH Zürich thomas.hofmann@inf.ethz.ch
noticebox[b]\end@float
1 Introduction
We consider the problem of finding a structured saddle point of a smooth objective, namely solving an optimization problem of the form
(1) 
Here, we assume that is smooth in and but not necessarily convex in or concave in . This particular problem arises in many applications, such as generative adversarial networks (GAN) goodfellow14gan (), robust optimization ben2009robust (), and game theory singh2000nash (); LeytonBrown:2008:EGT:1481632 (). Solving the saddle point problem in Eq. (1) is equivalent to finding a point such that
(2) 
holds for all and . For a non convexconcave function , finding such a saddle point is computationally infeasible. Instead of finding a global saddle point for Eq. (1), we aim for a more modest goal: finding a locally optimal saddle point, i.e. a point for which the condition in Eq. (2) holds true in a local neighbourhood around .
There is a rich literature on saddle point optimization for the particular class of convexconcave functions, i.e. when is convex in and concave in . Although this type of objective function is commonly encountered in applications such as constrained convex minimization, many saddle point problems of interest do not satisfy the convexconcave assumption. One popular example that recently emerged in machine learning is training generative adversarial networks (GANs). In this application, a representation of the data distribution is learned by playing a zerosum game between two competing neural networks. This yields a saddle point optimization problem which  due to the complex functional representation of the two neural networks  does not fulfill the convexityconcavity condition.
Firstorder methods are commonly used to solve problem (1) as they have a cheap periteration cost and are therefore easily scalable. One particular method of choice is simultaneous gradient descent/ascent, which performs the following iterative updates,
(3) 
where is a chosen step size which can, e.g., decrease with time or be a bounded constant (i.e. ). The convergence analysis of the above iterate sequence is typically tied to a strong/strict convexityconcavity property of the objective function defining the dynamics. Under such conditions, the gradient method is guaranteed to converge to a desired saddle point arrow1958studies (). These conditions can also be relaxed to some extent, which will be further discussed in Section 2.
It is known that the gradient method is locally asymptotically stable mescheder2017numerics (); but stability alone is not sufficient to guarantee convergence to a locally optimal saddle point. Through an example, we will later illustrate that the gradient method is indeed stable at some undesired stationary points, at which the structural minmax property ^{1}^{1}1This property refers to the function being a local minimum in and a maximum in is not met. This is in clear contrast to minimization problems where all stable stationary points of the gradient dynamics are local minima. The stability of these undesired stationary points is therefore an additional difficulty that one has to consider for escaping from such saddles. While a standard trick for escaping saddles in minimization problems consists of adding a small perturbation to the gradient, we will demonstrate that this does not guarantee avoiding undesired stationary points.
Throughout the paper, we will refer to a desired local saddle point as a local minimum in and maximum in . This characterization implies that the Hessian matrix at does not have a negative curvature direction in (which corresponds to an eigenvector of with a negative associated eigenvalue) and a positive curvature direction in (which corresponds to an eigenvector of with a positive associated eigenvalue). In that regard, curvature information can be used to certify whether the desired minmax structure is met.
In this work, we propose the first saddle point optimization that exploits curvature to guide the gradient trajectory towards the desired saddle points that respect the minmax structure. Since our approach only makes use of the eigenvectors corresponding to the maximum and minimum eigenvalue (rather than the whole eigenspace), we will refer to it as extreme curvature exploitation. We will prove that this type of curvature exploitation avoids convergence to undesired saddles. Our empirical results also confirm the advantage of curvature exploitation in saddle point optimization.
2 Related work
Asymptotical Convergence
In the context of optimizing a Lagrangian, the pioneering works of kose1956solutions (); arrow1958studies () popularized the use of the primaldual dynamics to arrive at the saddle points of the objective. The work of arrow1958studies () analyzed the stability of this method in continuous time proving global stability results under strict convexconcave assumptions. This result was extended in uzawa1958iterative () for a discretetime version of the subgradient method with a constant step size rule, proving that the iterates converge to a neighborhood of a saddle point. Results for a decreasing step size were provided in golshtein1974generalized (); maistroskii1977gradient () while nemirovskii1978cezare () analyzed an adaptive step size rule with averaged parameters. The work of cherukuri2017saddle () has shown that the conditions of the objective can be relaxed, proving asymptotic stability to the set of saddle points is guaranteed if either the convexity or concavity properties are strict, and convergence is pointwise. They also proved that the strictness assumption can be dropped under other linearity assumptions or assuming strongly joint quasiconvexquasiconcave saddle functions.
However, for problems where the function considered is not strictly convexconcave, convergence to a saddle point is not guaranteed, with the gradient dynamics leading instead to oscillatory solutions holding2014convergence (). These oscillations can be addressed by averaging the iterates nemirovskii1978cezare () or using the extragradient method (a perturbed version of the gradient method) korpelevich1976extragradient (); gidel2018variational ().
There are also instances of saddle point problems that do not satisfy the various conditions required for convergence. A notable example are generative adversarial networks (GANs) for which the work of nagarajan2017gradient () proved local asymptotic stability under certain suitable conditions on the representational power of the two players (called discriminator and generator). Despite these recent advances, the convergence properties of GANs are still not well understood.
Nonasymptotical Convergence
An explicit convergence rate for the subgradient method with a constant stepsize was proved in nedic2009subgradient () for reaching an approximate saddle point, as opposed to asymptotically exact solutions. Assuming the function is convexconcave, they proved a sublinear rate of convergence. Rates of convergence have also been derived for the extragradient method korpelevich1976extragradient () as well as for mirror descent nemirovski2004prox ().
In the context of GANs, nowozin2016f () showed that a singlestep gradient method converges to a saddle point in a neighborhood around the saddle point in which the function is strongly convexconcave. The work of liang2018interaction () studied the theory of nonasymptotic convergence to a local Nash equilibrium. They prove that assuming local strong convexityconcavity, simultaneous gradient descent achieves an exponential rate of convergence near a stable local Nash equilibrium. They also extended this result to other discretetime saddle point dynamics such as optimistic mirror descent or predictive methods.
Negative Curvature Exploitation
The presence of negative curvature in the objective function indicates the existence of a potential descent direction, which is commonly exploited in order to escape saddle points and reach a local minimizer. Among these approaches are trustregion methods that guarantee convergence to a secondorder stationary point conn2000trust (); nesterov2006cubic (); cartis2011adaptive (). While a naïve implementation of these methods would require the computation and inversion of the Hessian of the objective, this can be avoided by replacing the computation of the Hessian by Hessianvector products that can be computed efficiently in Pearlmutter94fastexact (). This is applied e.g. using matrixfree Lanczos iterations curtis2017exploiting () or online variants such as Oja’s algorithm allen2017natasha (). Subsampling the Hessian can furthermore reduce the dependence on by using various sampling schemes kohler2017sub (); xu2017newton (). Finally, allen2017neon2 (); xu2017first () showed that firstorder information can act as a noisy Power method allowing to find a negative curvature direction.
In contrast to these classical results that "blindly" try to escape any type of saddlepoint, our aim is to exploit curvature information to reach a specific type of stationary point that satisfies the minmax condition required at the optimum of the objective function.
3 Preliminaries
Assumptions
For the sake of further analysis, we require the function to be sufficiently smooth, and its second order derivatives with respect to the parameters and to be nondegenerate.
Assumption 1 (Smoothness).
We assume that is a function, and that its gradient and Hessian are Lipschitz with respect to the parameters and , i.e. we assume that the following inequalities hold:
(4)  
(5)  
(6)  
(7) 
Assumption 2 (Nondegenerate Hessian).
We assume that the matrices and are nondegenerate for all .
Locally optimal saddles
Let us define a neighbourhood around the point as
(8) 
with a sufficiently small .
Definition 3.
With the use of the nondegeneracy assumption on the Hessian matrices, we are able to establish sufficient conditions on to be a locally optimal saddle point.
Lemma 4.
Suppose that satisfies assumption 2; then, is a locally optimal saddle point on if and only if the gradient with respect to is zero, i.e.
(10) 
and the second derivative at is positive definite in and negative definite in ^{2}^{2}2In the game theory literature, such point is commonly referred to as local Nash equilibrium, see e.g. liang2018interaction ()., i.e. there exist such that
(11) 
4 Undesired Stability
Asymptotic scenarios
There are three different asymptotic scenarios for the gradient iterations in Eq. (3): (i) divergence (i.e. ), (ii) being trapped in a loop (i.e. ), and (iii) convergence to a stationary point of the gradient updates (i.e. ). Up to the best of our knowledge, there is no global convergence guarantee for general saddle point optimization. Typical convergence guarantees require convexityconcavity or somewhat relaxed conditions such as quasiconvexityquasiconcavity of cherukuri2017saddle (). Instead of global convergence, we therefore consider the more achievable goal of local convergence. Hence, we focus on the third outlined case where we are sure that the method converges to some stationary point for which holds . The question of interest here is whether such a sequence always yields a locally optimal saddle point as defined in Def. 3.
Local stability
A stationary point of the gradient iterations can be either stable or unstable. The notion of stability characterizes the behavior of the gradient iterations in a local region around the stationary point. In the neighbourhood of a stable stationary point, successive iterations of the method are not able to escape the region. Conversely, a stationary point is called unstable if it is not stable. The stationary point (for which holds) is a locally stable point of the gradient iterations, if the Jacobian of its dynamics has only eigenvalues within the unit sphere, i.e.
(12) 
Random initialization
In the following, we will use the notion of stability to analyze the asymptotic behavior of the gradient method. We start with a lemma extending known results for general minimization problems that prove that gradient descent with random initialization almost surely converges to a stable stationary point lee2016gradient ().
Undesired stable stationary point
If all stable stationary points of the gradient dynamics are locally optimal saddle points, then the result of Lemma 5 guarantees almost sure convergence to a solution of the saddle point problem in Eq. (1). It is already known that every locally optimal saddle point is a stable stationary point of the gradient dynamics mescheder2017numerics (); nagarajan2017gradient (). While for minimization problems, the set of stable stationary points is the same as the set of local minima, this might not be the case for the problem we consider here. Indeed, the gradient dynamics might introduce additional stable points that are not locally optimal saddle points. We illustrate this claim in the next example.
Example
Consider the following twodimensional saddle point problem^{3}^{3}3To guarantee smoothness, one can restrict the domain of to a bounded set.
(13) 
with . The critical points of the function, i.e. points for which , are
(14) 
Evaluating the Hessians at the three critical points gives rise to the following three matrices:
(15) 
We see that only is a locally optimal saddle point, namely that and , whereas the two other points are both a local minimum in the parameter , rather than a maximum. However, figure 0(a) illustrates gradient steps converging to the undesired stationary point because it is a locally stable point of the dynamics. Hence, even small perturbations of the gradients in each step can not avoid convergence to this point (see Figure 0(b)).
5 Extreme Curvature Exploitation
The previous example has shown that gradient iterations on the saddle point problem introduce undesired stable points. In this section, we propose a strategy to escape from these stationary points. Our approach is based on exploiting curvature information as in curtis2017exploiting ().
Extreme curvature direction
Let be the minimum eigenvalue of with its associated eigenvector , and be the maximum eigenvalue of with its associated eigenvector . Then, we define
(16) 
where is the sign function. Using the above vectors, we define as the extreme curvature direction at .
Algorithm
Using the extreme curvature direction, we modify the gradient steps as follows:
(17) 
This new update step is constructed by adding the extreme curvature direction to the gradient method of Eq. (3). From now on, we will refer to this modified update as the Cesp (curvature exploitation for the saddle point problem) method.
Stability
Extreme curvature exploitation has already been used for escaping from unstable stationary points (i.e. saddle points) of gradient descent for minimization problems curtis2017exploiting (). In saddle point problems, curvature exploitation is advantageous not only for escaping unstable stationary points but also for escaping undesired stable stationary points of the gradient iterates. The upcoming two lemmas prove that the set of stable stationary points of the Cesp dynamics and the set of locally optimal saddle points are the same – therefore, the optimizer only converges to a solution of the local saddle point problem.
Lemma 6.
A point is a stationary point of the iterates in Eq. (17) if and only if is a locally optimal saddle point.
We can conclude from the result of Lemma 6 that every stationary point of the Cesp dynamics is a locally optimal saddle point. The next Lemma establishes the stability of these points.
Escaping from undesired saddles
Extreme curvature exploitation allows us to escape from undesired saddles. In the next lemma, we show that the optimization trajectory of Cesp stays away from all undesired stationary points of the gradient dynamics.
Lemma 8.
Suppose that is an undesired stationary point of the gradient dynamics, namely
(20) 
Consider iterates of Eq. (17) starting from in a neighbourhood of . After one step the iterates escape the neighbourhood of , i.e.
(21) 
for a sufficiently small .
Guaranteed decrease/increase
The gradient update step of Eq. (3) has the property that an update with respect to () decreases (increases) the function value. The next lemma proves that Cesp shares the same desirable property.
Lemma 9.
In each iteration of Eq. (17), decreases in with
(22) 
and increases in with
(23) 
as long as the step size is chosen as
(24) 
Implementation with Hessianvector products
Since storing and computing the Hessian in high dimensions is very costly, we need to find a way to efficiently extract the extreme curvature direction. The most common approach for obtaining the eigenvector corresponding to the largest absolute eigenvalue, (and the eigenvalue itself) of is to run power iterations on as
(25) 
where is normalized after every iteration and is chosen such that . Since this method only requires implicit Hessian computation through a Hessianvector product, it can be implemented as efficiently as gradient evaluations Pearlmutter94fastexact (). The results of doi:10.1137/0613066 () provide a bound on the number of required iterations to extract the extreme curvature: for the case , iterations suffice to find a vector such that with probability (cf. lee2016gradient ()).
6 Curvature Exploitation for lineartransformed gradient steps
Lineartransformed gradient optimization
Applying a linear transformation to the gradient updates is commonly used to accelerate optimization for various types of problems. The resulting updates can be written in the general form
(26) 
where is a symmetric, blockdiagonal matrix. Different optimization methods use a different linear transformation . Table 1 in section B in the appendix illustrates the choice of for different optimizers. Adagrad Duchi:EECS201024 (), one of the most popular nonconvex optimization methods, belongs to this category.
Extreme curvature exploitation
We can adapt Cesp to the lineartransformed variant:
(27) 
where we choose the linear transformation matrix to be positive definite. This variant of Cesp is also able to filter out the undesired stable stationary points of the gradient method for the saddle point problem. The following lemma proves that it has the same properties as the nontransformed version.
Lemma 10.
A direct implication of Lemma 10 is that we can also use curvature exploitation for Adagrad. Later, we will experimentally show the advantage of using curvature exploitation for this method.
7 Experiments
7.1 Escaping from undesired stationary points of the toy example
Previously, we saw that for the two dimensional saddle point problem on the function of Eq. (13), gradient iterates may converge to an undesired stationary point that is not locally optimal. As shown in Figure 2, Cesp solves this issue. In this example, simultaneous gradient iterates converge to the undesired stationary point for many different initialization parameters, whereas our method always converges to the locally optimal saddle point. A plot of the basin of attraction of the two different optimizers on this example is presented in Figure 4 in the appendix.
7.2 Generative Adversarial Networks
This experiment evaluates the performance of the Cesp method for training a Generative Adversarial Network (GAN), which reduces to solving the saddle point problem
(28) 
where the functions and are represented by neural networks parameterized with the variables and , respectively. We use the MNIST data set and a simple GAN architecture with 1 hidden layer and 100 units. More technical details are provided in Appendix C.2. Here, we investigate the advantage of curvature exploitation for the Adagrad method, which is a member of the class of lineartransformed gradient methods often used for saddle point problems. Moreover, we make use of Power iterations as described in section 5 to efficiently approximate the extreme curvature vector. Note that since we’re using mini batches in this experiment, we do not have access to the correct gradient information but also rely on an approximation here.
We evaluate the efficacy of the negative curvature step in terms of the spectrum of at a convergent solution . We compare Cesp to the vanilla Adagrad optimizer. Since we are interested in a solution that gives rise to a locally optimal saddle point, we track (an approximation of) the smallest eigenvalue of and the largest eigenvalue of through the optimization. Using these estimates, we can evaluate if a method has converged to a locally optimal saddle point.
The results are shown in figure 3. The decrease in terms of the squared norm of the gradients indicates that both methods converge to a solution. Moreover, both fulfill the condition for a locally optimal saddle point for the parameter , i.e. the maximum eigenvalue of is negative. However, the graph of the minimum eigenvalue of shows that Cesp converges faster, and with less frequent and severe spikes, to a solution where the minimum eigenvalue is zero. Hence, the negative curvature step seems to be able to drive the optimization procedure to regions that yield points closer to a locally optimal saddle point.
More experimental results – in the application of training GANs and robust optimization – are provided in Appendix C.
8 Conclusion
We focused our study on reaching a solution to the local saddle point problem. First, we have shown that gradient methods have stable stationary points that are not locally optimal, which is a problem exclusively arising in saddle point optimization. Second, we proposed a novel approach that exploits extreme curvature information to avoid the undesired stationary points. We believe this work highlights the benefits of using curvature information for saddle point problems and might open the door to other novel algorithms with stronger global convergence guarantees.
References
 (1) Zeyuan AllenZhu. Natasha 2: Faster nonconvex optimization than sgd. arXiv preprint arXiv:1708.08694, 2017.
 (2) Zeyuan AllenZhu and Yuanzhi Li. Neon2: Finding local minima via firstorder oracles. arXiv preprint arXiv:1711.06673, 2017.
 (3) Kenneth Joseph Arrow, Leonid Hurwicz, Hirofumi Uzawa, and Hollis Burnley Chenery. Studies in linear and nonlinear programming. 1958.
 (4) Aharon BenTal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust optimization. Princeton University Press, 2009.
 (5) Michele Benzi, Gene H. Golub, and Jörg Liesen. Numerical solution of saddle point problems. ACTA NUMERICA, 14:1–137, 2005.
 (6) Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011.
 (7) Ashish Cherukuri, Bahman Gharesifard, and Jorge Cortes. Saddlepoint dynamics: conditions for asymptotic stability of saddle points. SIAM Journal on Control and Optimization, 55(1):486–511, 2017.
 (8) Andrew R Conn, Nicholas IM Gould, and Philippe L Toint. Trust region methods. SIAM, 2000.
 (9) Frank E Curtis and Daniel P Robinson. Exploiting negative curvature in deterministic and stochastic optimization. arXiv preprint arXiv:1703.00412, 2017.
 (10) Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2933–2941. Curran Associates, Inc., 2014.
 (11) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Technical Report UCB/EECS201024, EECS Department, University of California, Berkeley, Mar 2010.
 (12) Gauthier Gidel, Hugo Berard, Pascal Vincent, and Simon LacosteJulien. A variational inequality perspective on generative adversarial nets. arXiv preprint arXiv:1802.10551, 2018.
 (13) EG Golshtein. Generalized gradient method for finding saddlepoints. Matekon, 10(3):36–52, 1974.
 (14) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
 (15) Thomas Holding and Ioannis Lestas. On the convergence to saddle points of concaveconvex functions, the gradient method and emergence of oscillations. In Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pages 1143–1148. IEEE, 2014.
 (16) H.K. Khalil. Nonlinear Systems. Pearson Education. Prentice Hall, 2002.
 (17) Jonas Moritz Kohler and Aurelien Lucchi. Subsampled cubic regularization for nonconvex optimization. In International Conference on Machine Learning, 2017.
 (18) GM Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976.
 (19) T Kose. Solutions of saddle value problems by differential equations. Econometrica, Journal of the Econometric Society, pages 59–70, 1956.
 (20) J. Kuczyński and H. Woźniakowski. Estimating the largest eigenvalue by the power and lanczos algorithms with a random start. SIAM Journal on Matrix Analysis and Applications, 13(4):1094–1122, 1992.
 (21) Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915, 2016.
 (22) Kevin LeytonBrown and Yoav Shoham. Essentials of Game Theory: A Concise, Multidisciplinary Introduction. Morgan and Claypool Publishers, 1st edition, 2008.
 (23) Tengyuan Liang and James Stokes. Interaction matters: A note on nonasymptotic local convergence of generative adversarial networks. arXiv preprint arXiv:1802.06132, 2018.
 (24) D Maistroskii. Gradient methods for finding saddle points. Matekon, 14(1):3–22, 1977.
 (25) Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. arXiv preprint arXiv:1705.10461, 2017.
 (26) Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In Advances in Neural Information Processing Systems, pages 5591–5600, 2017.
 (27) Hongseok Namkoong and John C. Duchi. Variancebased regularization with convex objectives. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 2975–2984, 2017.
 (28) Angelia Nedić and Asuman Ozdaglar. Subgradient methods for saddlepoint problems. Journal of optimization theory and applications, 142(1):205–228, 2009.
 (29) Arkadi Nemirovski. Proxmethod with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convexconcave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
 (30) AS Nemirovskii and DB Yudin. Cezare convergence of gradient method approximation of saddle points for convexconcave functions. Doklady Akademii Nauk SSSR, 239:1056–1059, 1978.
 (31) Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
 (32) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fgan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
 (33) Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural Computation, 6:147–160, 1994.
 (34) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
 (35) Satinder Singh, Michael Kearns, and Yishay Mansour. Nash convergence of gradient dynamics in generalsum games. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pages 541–548. Morgan Kaufmann Publishers Inc., 2000.
 (36) Hirofumi Uzawa. Iterative methods for concave programming. Studies in linear and nonlinear programming, 6:154–165, 1958.
 (37) Peng Xu, Farbod RoostaKhorasani, and Michael W Mahoney. Newtontype methods for nonconvex optimization under inexact hessian information. arXiv preprint arXiv:1708.07164, 2017.
 (38) Yi Xu and Tianbao Yang. Firstorder stochastic algorithms for escaping from saddle points in almost linear time. arXiv preprint arXiv:1711.01944, 2017.
Appendix
Appendix A Theoretical Analysis
a.1 Lemma 4
Lemma 4.
Suppose that satisfies assumption 2; then, is a locally optimal saddle point on if and only if the gradient is zero, i.e.
(29) 
and the second derivative at is positive definite in and negative definite in , i.e., there exist such that
(30) 
Proof.
From definition 3 follows that a locally optimal saddle point is a point for which the following two conditions hold:
(31) 
Hence, is a local minimizer of and is a local maximizer. We therefore, without loss of generality, prove the statement of the lemma only for the minimizer , namely that


.
The proof for the maximizer directly follows from this.

If we assume that , then there exists a feasible direction such that , and we can find a step size for s.t. with . Using the smoothness assumptions (Assumption 1), we arrive at the following inequality
(32) Hence, it holds that:
(33) By choosing the gradient descent direction (with s.t. ), we can find a step size such that ,
which contradicts that is a local minimizer. Hence, is a necessary condition for a local minimizer.

To prove the second statement, we again make use of inequality (32) coming from the smoothness assumption and the update s.t. with . From (i) we know that and, therefore, we obtain:
(34) (35) If is not positive semidefinite, then there exists at least one eigenvector with negative curvature, i.e. . This implies that for following the curvature vector decreases the function value, i.e., . This contradicts that is a local minimizer which proves the sufficient condition
(36)
∎
a.2 Lemma 5
Lemma 11.
The gradient mapping for the saddle point problem
(37) 
with step size is a diffeomorphism.
Proof.
The following proof is very much based on the proof of proposition 4.5 from [21].
A necessary condition for a diffeomorphism is bijectivity. Hence, we need to check that is (i) injective, and (ii) surjective for .

Consider two points for which
(38) holds. Then, we have that
(39) (40) Note that
(41) (42) from which follows that
(43) (44) For this means , and therefore is injective.

We will show that is surjective by constructing an explicit inverse function for both optimization problems individually. As suggested by [21], we make use of the proximal point algorithm on the function for the parameters , individually.
For the parameter the proximal point mapping of centered at is given by(45) Moreover, note that is strongly convex in if :
(46) (47) Hence, the function has a unique minimizer, given by
(48) (49) which means that there is a unique mapping from to under the gradient mapping if .
The same line of reasoning can be applied to the parameter with the negative proximal point mapping of centered at , i.e.(50) Similarly as before, we can observe that is strictly concave for and that the unique minimizer of yields the update step of . This let’s us conclude that the mapping is surjective for if
Observing that under assumption 2, is continuously differentiable concludes the proof that is a diffeomorphism. ∎
Lemma 5 (Random Initialization).
a.3 Lemma 6
Lemma 6.
The point is a stationary point of the iterates in Eq. (17) if and only if is a locally optimal saddle point.
Proof.
The point is a stationary point of the iterates if and only if . Let’s consider w.l.o.g. only the stationary point condition with respect to , i.e.
(51) 
We prove that the above equation holds only if . This can be proven by a simple contradiction; suppose that , then multiplying both sides of the above equation by yields
(52) 
Since the lefthand side is negative and the righthand side is positive, the above equation leads to a contradiction. Therefore, and . This means that and and therefore according to lemma 4, is a locally optimal saddle point. ∎
a.4 Lemma 7
Lemma 7.
Proof.
The proof is based on a simple idea: in a neighbourhood of a locally optimal saddle point, can not have extreme curvatures, i.e., . Hence, within the update of Eq. (17) reduces to the gradient update in Eq. (3), which is stable according to [26, 25].
To prove our claim that negative curvature doesn’t exist in , we make use of the smoothness assumption. Suppose that , then the smoothness assumption 1 implies
(55)  
(56)  
(57)  
(58)  
(59) 
Similarly, one can show that
(60) 
Therefore, the extreme curvature direction is zero according to the definition in Eq. (16). ∎
a.5 Lemma 8
Lemma 8.
Suppose that is an undesired stationary point of the gradient dynamics, namely
(61) 
Consider the iterates of Eq. (17) starting from in a neighbourhood of . After one step the iterates escape the neighbourhood of , i.e.
(62) 
for a sufficiently small .
Proof.
Preliminaries: Consider compact notations
(63)  
(64)  
(65) 
Characterizing extreme curvature: The choice of ensures that
(66) 
holds. Since lies in a neighbourhood of , we can use the smoothness of to relate the negative curvature at to negative curvature in :
(67)  
(68) 
Therefore
(69) 
Similarly, one can show that
(70) 
Combining these two bounds yields
(71)  
(72)  
(73) 
To simplify the above bound, we use the compact notation :
(74) 
Proof of escaping:
The squared norm of the update can be computed as
(75)  
(76)  