Local Saddle Point Optimization: A Curvature Exploitation Approach
Gradient-based optimization methods are the most popular choice for finding local optima for classical minimization and saddle point problems. Here, we highlight a systemic issue of gradient dynamics that arise for saddle point problems, namely the presence of undesired stable stationary points that are no local optima. We propose a novel optimization approach that exploits curvature information in order to escape from these undesired stationary points. We prove that different optimization methods, including gradient method and Adagrad, equipped with curvature exploitation can escape non-optimal stationary points. We also provide empirical results on common saddle point problems which confirm the advantage of using curvature exploitation.
Local Saddle Point Optimization: A Curvature Exploitation Approach
Leonard Adolphs Department of Computer Science ETH Zürich firstname.lastname@example.org Hadi Daneshmand Department of Computer Science ETH Zürich email@example.com Aurelien Lucchi Department of Computer Science ETH Zürich firstname.lastname@example.org Thomas Hofmann Department of Computer Science ETH Zürich email@example.com
We consider the problem of finding a structured saddle point of a smooth objective, namely solving an optimization problem of the form
Here, we assume that is smooth in and but not necessarily convex in or concave in . This particular problem arises in many applications, such as generative adversarial networks (GAN) goodfellow14gan (), robust optimization ben2009robust (), and game theory singh2000nash (); Leyton-Brown:2008:EGT:1481632 (). Solving the saddle point problem in Eq. (1) is equivalent to finding a point such that
holds for all and . For a non convex-concave function , finding such a saddle point is computationally infeasible. Instead of finding a global saddle point for Eq. (1), we aim for a more modest goal: finding a locally optimal saddle point, i.e. a point for which the condition in Eq. (2) holds true in a local neighbourhood around .
There is a rich literature on saddle point optimization for the particular class of convex-concave functions, i.e. when is convex in and concave in . Although this type of objective function is commonly encountered in applications such as constrained convex minimization, many saddle point problems of interest do not satisfy the convex-concave assumption. One popular example that recently emerged in machine learning is training generative adversarial networks (GANs). In this application, a representation of the data distribution is learned by playing a zero-sum game between two competing neural networks. This yields a saddle point optimization problem which - due to the complex functional representation of the two neural networks - does not fulfill the convexity-concavity condition.
First-order methods are commonly used to solve problem (1) as they have a cheap per-iteration cost and are therefore easily scalable. One particular method of choice is simultaneous gradient descent/ascent, which performs the following iterative updates,
where is a chosen step size which can, e.g., decrease with time or be a bounded constant (i.e. ). The convergence analysis of the above iterate sequence is typically tied to a strong/strict convexity-concavity property of the objective function defining the dynamics. Under such conditions, the gradient method is guaranteed to converge to a desired saddle point arrow1958studies (). These conditions can also be relaxed to some extent, which will be further discussed in Section 2.
It is known that the gradient method is locally asymptotically stable mescheder2017numerics (); but stability alone is not sufficient to guarantee convergence to a locally optimal saddle point. Through an example, we will later illustrate that the gradient method is indeed stable at some undesired stationary points, at which the structural min-max property 111This property refers to the function being a local minimum in and a maximum in is not met. This is in clear contrast to minimization problems where all stable stationary points of the gradient dynamics are local minima. The stability of these undesired stationary points is therefore an additional difficulty that one has to consider for escaping from such saddles. While a standard trick for escaping saddles in minimization problems consists of adding a small perturbation to the gradient, we will demonstrate that this does not guarantee avoiding undesired stationary points.
Throughout the paper, we will refer to a desired local saddle point as a local minimum in and maximum in . This characterization implies that the Hessian matrix at does not have a negative curvature direction in (which corresponds to an eigenvector of with a negative associated eigenvalue) and a positive curvature direction in (which corresponds to an eigenvector of with a positive associated eigenvalue). In that regard, curvature information can be used to certify whether the desired min-max structure is met.
In this work, we propose the first saddle point optimization that exploits curvature to guide the gradient trajectory towards the desired saddle points that respect the min-max structure. Since our approach only makes use of the eigenvectors corresponding to the maximum and minimum eigenvalue (rather than the whole eigenspace), we will refer to it as extreme curvature exploitation. We will prove that this type of curvature exploitation avoids convergence to undesired saddles. Our empirical results also confirm the advantage of curvature exploitation in saddle point optimization.
2 Related work
In the context of optimizing a Lagrangian, the pioneering works of kose1956solutions (); arrow1958studies () popularized the use of the primal-dual dynamics to arrive at the saddle points of the objective. The work of arrow1958studies () analyzed the stability of this method in continuous time proving global stability results under strict convex-concave assumptions. This result was extended in uzawa1958iterative () for a discrete-time version of the subgradient method with a constant step size rule, proving that the iterates converge to a neighborhood of a saddle point. Results for a decreasing step size were provided in golshtein1974generalized (); maistroskii1977gradient () while nemirovskii1978cezare () analyzed an adaptive step size rule with averaged parameters. The work of cherukuri2017saddle () has shown that the conditions of the objective can be relaxed, proving asymptotic stability to the set of saddle points is guaranteed if either the convexity or concavity properties are strict, and convergence is pointwise. They also proved that the strictness assumption can be dropped under other linearity assumptions or assuming strongly joint quasiconvex-quasiconcave saddle functions.
However, for problems where the function considered is not strictly convex-concave, convergence to a saddle point is not guaranteed, with the gradient dynamics leading instead to oscillatory solutions holding2014convergence (). These oscillations can be addressed by averaging the iterates nemirovskii1978cezare () or using the extragradient method (a perturbed version of the gradient method) korpelevich1976extragradient (); gidel2018variational ().
There are also instances of saddle point problems that do not satisfy the various conditions required for convergence. A notable example are generative adversarial networks (GANs) for which the work of nagarajan2017gradient () proved local asymptotic stability under certain suitable conditions on the representational power of the two players (called discriminator and generator). Despite these recent advances, the convergence properties of GANs are still not well understood.
An explicit convergence rate for the subgradient method with a constant stepsize was proved in nedic2009subgradient () for reaching an approximate saddle point, as opposed to asymptotically exact solutions. Assuming the function is convex-concave, they proved a sub-linear rate of convergence. Rates of convergence have also been derived for the extragradient method korpelevich1976extragradient () as well as for mirror descent nemirovski2004prox ().
In the context of GANs, nowozin2016f () showed that a single-step gradient method converges to a saddle point in a neighborhood around the saddle point in which the function is strongly convex-concave. The work of liang2018interaction () studied the theory of non-asymptotic convergence to a local Nash equilibrium. They prove that assuming local strong convexity-concavity, simultaneous gradient descent achieves an exponential rate of convergence near a stable local Nash equilibrium. They also extended this result to other discrete-time saddle point dynamics such as optimistic mirror descent or predictive methods.
Negative Curvature Exploitation
The presence of negative curvature in the objective function indicates the existence of a potential descent direction, which is commonly exploited in order to escape saddle points and reach a local minimizer. Among these approaches are trust-region methods that guarantee convergence to a second-order stationary point conn2000trust (); nesterov2006cubic (); cartis2011adaptive (). While a naïve implementation of these methods would require the computation and inversion of the Hessian of the objective, this can be avoided by replacing the computation of the Hessian by Hessian-vector products that can be computed efficiently in Pearlmutter94fastexact (). This is applied e.g. using matrix-free Lanczos iterations curtis2017exploiting () or online variants such as Oja’s algorithm allen2017natasha (). Sub-sampling the Hessian can furthermore reduce the dependence on by using various sampling schemes kohler2017sub (); xu2017newton (). Finally, allen2017neon2 (); xu2017first () showed that first-order information can act as a noisy Power method allowing to find a negative curvature direction.
In contrast to these classical results that "blindly" try to escape any type of saddle-point, our aim is to exploit curvature information to reach a specific type of stationary point that satisfies the min-max condition required at the optimum of the objective function.
For the sake of further analysis, we require the function to be sufficiently smooth, and its second order derivatives with respect to the parameters and to be non-degenerate.
Assumption 1 (Smoothness).
We assume that is a function, and that its gradient and Hessian are Lipschitz with respect to the parameters and , i.e. we assume that the following inequalities hold:
Assumption 2 (Non-degenerate Hessian).
We assume that the matrices and are non-degenerate for all .
Locally optimal saddles
Let us define a -neighbourhood around the point as
with a sufficiently small .
The point is a locally optimal saddle point of the problem in Eq. (1) if
holds for .
With the use of the non-degeneracy assumption on the Hessian matrices, we are able to establish sufficient conditions on to be a locally optimal saddle point.
Suppose that satisfies assumption 2; then, is a locally optimal saddle point on if and only if the gradient with respect to is zero, i.e.
and the second derivative at is positive definite in and negative definite in 222In the game theory literature, such point is commonly referred to as local Nash equilibrium, see e.g. liang2018interaction ()., i.e. there exist such that
4 Undesired Stability
There are three different asymptotic scenarios for the gradient iterations in Eq. (3): (i) divergence (i.e. ), (ii) being trapped in a loop (i.e. ), and (iii) convergence to a stationary point of the gradient updates (i.e. ). Up to the best of our knowledge, there is no global convergence guarantee for general saddle point optimization. Typical convergence guarantees require convexity-concavity or somewhat relaxed conditions such as quasiconvexity-quasiconcavity of cherukuri2017saddle (). Instead of global convergence, we therefore consider the more achievable goal of local convergence. Hence, we focus on the third outlined case where we are sure that the method converges to some stationary point for which holds . The question of interest here is whether such a sequence always yields a locally optimal saddle point as defined in Def. 3.
A stationary point of the gradient iterations can be either stable or unstable. The notion of stability characterizes the behavior of the gradient iterations in a local region around the stationary point. In the neighbourhood of a stable stationary point, successive iterations of the method are not able to escape the region. Conversely, a stationary point is called unstable if it is not stable. The stationary point (for which holds) is a locally stable point of the gradient iterations, if the Jacobian of its dynamics has only eigenvalues within the unit sphere, i.e.
In the following, we will use the notion of stability to analyze the asymptotic behavior of the gradient method. We start with a lemma extending known results for general minimization problems that prove that gradient descent with random initialization almost surely converges to a stable stationary point lee2016gradient ().
Undesired stable stationary point
If all stable stationary points of the gradient dynamics are locally optimal saddle points, then the result of Lemma 5 guarantees almost sure convergence to a solution of the saddle point problem in Eq. (1). It is already known that every locally optimal saddle point is a stable stationary point of the gradient dynamics mescheder2017numerics (); nagarajan2017gradient (). While for minimization problems, the set of stable stationary points is the same as the set of local minima, this might not be the case for the problem we consider here. Indeed, the gradient dynamics might introduce additional stable points that are not locally optimal saddle points. We illustrate this claim in the next example.
Consider the following two-dimensional saddle point problem333To guarantee smoothness, one can restrict the domain of to a bounded set.
with . The critical points of the function, i.e. points for which , are
Evaluating the Hessians at the three critical points gives rise to the following three matrices:
We see that only is a locally optimal saddle point, namely that and , whereas the two other points are both a local minimum in the parameter , rather than a maximum. However, figure 0(a) illustrates gradient steps converging to the undesired stationary point because it is a locally stable point of the dynamics. Hence, even small perturbations of the gradients in each step can not avoid convergence to this point (see Figure 0(b)).
5 Extreme Curvature Exploitation
The previous example has shown that gradient iterations on the saddle point problem introduce undesired stable points. In this section, we propose a strategy to escape from these stationary points. Our approach is based on exploiting curvature information as in curtis2017exploiting ().
Extreme curvature direction
Let be the minimum eigenvalue of with its associated eigenvector , and be the maximum eigenvalue of with its associated eigenvector . Then, we define
where is the sign function. Using the above vectors, we define as the extreme curvature direction at .
Using the extreme curvature direction, we modify the gradient steps as follows:
This new update step is constructed by adding the extreme curvature direction to the gradient method of Eq. (3). From now on, we will refer to this modified update as the Cesp (curvature exploitation for the saddle point problem) method.
Extreme curvature exploitation has already been used for escaping from unstable stationary points (i.e. saddle points) of gradient descent for minimization problems curtis2017exploiting (). In saddle point problems, curvature exploitation is advantageous not only for escaping unstable stationary points but also for escaping undesired stable stationary points of the gradient iterates. The upcoming two lemmas prove that the set of stable stationary points of the Cesp dynamics and the set of locally optimal saddle points are the same – therefore, the optimizer only converges to a solution of the local saddle point problem.
A point is a stationary point of the iterates in Eq. (17) if and only if is a locally optimal saddle point.
We can conclude from the result of Lemma 6 that every stationary point of the Cesp dynamics is a locally optimal saddle point. The next Lemma establishes the stability of these points.
Escaping from undesired saddles
Extreme curvature exploitation allows us to escape from undesired saddles. In the next lemma, we show that the optimization trajectory of Cesp stays away from all undesired stationary points of the gradient dynamics.
Suppose that is an undesired stationary point of the gradient dynamics, namely
Consider iterates of Eq. (17) starting from in a -neighbourhood of . After one step the iterates escape the -neighbourhood of , i.e.
for a sufficiently small .
The gradient update step of Eq. (3) has the property that an update with respect to () decreases (increases) the function value. The next lemma proves that Cesp shares the same desirable property.
In each iteration of Eq. (17), decreases in with
and increases in with
as long as the step size is chosen as
Implementation with Hessian-vector products
Since storing and computing the Hessian in high dimensions is very costly, we need to find a way to efficiently extract the extreme curvature direction. The most common approach for obtaining the eigenvector corresponding to the largest absolute eigenvalue, (and the eigenvalue itself) of is to run power iterations on as
where is normalized after every iteration and is chosen such that . Since this method only requires implicit Hessian computation through a Hessian-vector product, it can be implemented as efficiently as gradient evaluations Pearlmutter94fastexact (). The results of doi:10.1137/0613066 () provide a bound on the number of required iterations to extract the extreme curvature: for the case , iterations suffice to find a vector such that with probability (cf. lee2016gradient ()).
6 Curvature Exploitation for linear-transformed gradient steps
Linear-transformed gradient optimization
Applying a linear transformation to the gradient updates is commonly used to accelerate optimization for various types of problems. The resulting updates can be written in the general form
where is a symmetric, block-diagonal -matrix. Different optimization methods use a different linear transformation . Table 1 in section B in the appendix illustrates the choice of for different optimizers. Adagrad Duchi:EECS-2010-24 (), one of the most popular non-convex optimization methods, belongs to this category.
Extreme curvature exploitation
We can adapt Cesp to the linear-transformed variant:
where we choose the linear transformation matrix to be positive definite. This variant of Cesp is also able to filter out the undesired stable stationary points of the gradient method for the saddle point problem. The following lemma proves that it has the same properties as the non-transformed version.
A direct implication of Lemma 10 is that we can also use curvature exploitation for Adagrad. Later, we will experimentally show the advantage of using curvature exploitation for this method.
7.1 Escaping from undesired stationary points of the toy example
Previously, we saw that for the two dimensional saddle point problem on the function of Eq. (13), gradient iterates may converge to an undesired stationary point that is not locally optimal. As shown in Figure 2, Cesp solves this issue. In this example, simultaneous gradient iterates converge to the undesired stationary point for many different initialization parameters, whereas our method always converges to the locally optimal saddle point. A plot of the basin of attraction of the two different optimizers on this example is presented in Figure 4 in the appendix.
7.2 Generative Adversarial Networks
This experiment evaluates the performance of the Cesp method for training a Generative Adversarial Network (GAN), which reduces to solving the saddle point problem
where the functions and are represented by neural networks parameterized with the variables and , respectively. We use the MNIST data set and a simple GAN architecture with 1 hidden layer and 100 units. More technical details are provided in Appendix C.2. Here, we investigate the advantage of curvature exploitation for the Adagrad method, which is a member of the class of linear-transformed gradient methods often used for saddle point problems. Moreover, we make use of Power iterations as described in section 5 to efficiently approximate the extreme curvature vector. Note that since we’re using mini batches in this experiment, we do not have access to the correct gradient information but also rely on an approximation here.
We evaluate the efficacy of the negative curvature step in terms of the spectrum of at a convergent solution . We compare Cesp to the vanilla Adagrad optimizer. Since we are interested in a solution that gives rise to a locally optimal saddle point, we track (an approximation of) the smallest eigenvalue of and the largest eigenvalue of through the optimization. Using these estimates, we can evaluate if a method has converged to a locally optimal saddle point.
The results are shown in figure 3. The decrease in terms of the squared norm of the gradients indicates that both methods converge to a solution. Moreover, both fulfill the condition for a locally optimal saddle point for the parameter , i.e. the maximum eigenvalue of is negative. However, the graph of the minimum eigenvalue of shows that Cesp converges faster, and with less frequent and severe spikes, to a solution where the minimum eigenvalue is zero. Hence, the negative curvature step seems to be able to drive the optimization procedure to regions that yield points closer to a locally optimal saddle point.
More experimental results – in the application of training GANs and robust optimization – are provided in Appendix C.
We focused our study on reaching a solution to the local saddle point problem. First, we have shown that gradient methods have stable stationary points that are not locally optimal, which is a problem exclusively arising in saddle point optimization. Second, we proposed a novel approach that exploits extreme curvature information to avoid the undesired stationary points. We believe this work highlights the benefits of using curvature information for saddle point problems and might open the door to other novel algorithms with stronger global convergence guarantees.
- (1) Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. arXiv preprint arXiv:1708.08694, 2017.
- (2) Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via first-order oracles. arXiv preprint arXiv:1711.06673, 2017.
- (3) Kenneth Joseph Arrow, Leonid Hurwicz, Hirofumi Uzawa, and Hollis Burnley Chenery. Studies in linear and non-linear programming. 1958.
- (4) Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust optimization. Princeton University Press, 2009.
- (5) Michele Benzi, Gene H. Golub, and Jörg Liesen. Numerical solution of saddle point problems. ACTA NUMERICA, 14:1–137, 2005.
- (6) Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011.
- (7) Ashish Cherukuri, Bahman Gharesifard, and Jorge Cortes. Saddle-point dynamics: conditions for asymptotic stability of saddle points. SIAM Journal on Control and Optimization, 55(1):486–511, 2017.
- (8) Andrew R Conn, Nicholas IM Gould, and Philippe L Toint. Trust region methods. SIAM, 2000.
- (9) Frank E Curtis and Daniel P Robinson. Exploiting negative curvature in deterministic and stochastic optimization. arXiv preprint arXiv:1703.00412, 2017.
- (10) Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2933–2941. Curran Associates, Inc., 2014.
- (11) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Technical Report UCB/EECS-2010-24, EECS Department, University of California, Berkeley, Mar 2010.
- (12) Gauthier Gidel, Hugo Berard, Pascal Vincent, and Simon Lacoste-Julien. A variational inequality perspective on generative adversarial nets. arXiv preprint arXiv:1802.10551, 2018.
- (13) EG Golshtein. Generalized gradient method for finding saddlepoints. Matekon, 10(3):36–52, 1974.
- (14) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
- (15) Thomas Holding and Ioannis Lestas. On the convergence to saddle points of concave-convex functions, the gradient method and emergence of oscillations. In Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pages 1143–1148. IEEE, 2014.
- (16) H.K. Khalil. Nonlinear Systems. Pearson Education. Prentice Hall, 2002.
- (17) Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In International Conference on Machine Learning, 2017.
- (18) GM Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976.
- (19) T Kose. Solutions of saddle value problems by differential equations. Econometrica, Journal of the Econometric Society, pages 59–70, 1956.
- (20) J. Kuczyński and H. Woźniakowski. Estimating the largest eigenvalue by the power and lanczos algorithms with a random start. SIAM Journal on Matrix Analysis and Applications, 13(4):1094–1122, 1992.
- (21) Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915, 2016.
- (22) Kevin Leyton-Brown and Yoav Shoham. Essentials of Game Theory: A Concise, Multidisciplinary Introduction. Morgan and Claypool Publishers, 1st edition, 2008.
- (23) Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. arXiv preprint arXiv:1802.06132, 2018.
- (24) D Maistroskii. Gradient methods for finding saddle points. Matekon, 14(1):3–22, 1977.
- (25) Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. arXiv preprint arXiv:1705.10461, 2017.
- (26) Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In Advances in Neural Information Processing Systems, pages 5591–5600, 2017.
- (27) Hongseok Namkoong and John C. Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 2975–2984, 2017.
- (28) Angelia Nedić and Asuman Ozdaglar. Subgradient methods for saddle-point problems. Journal of optimization theory and applications, 142(1):205–228, 2009.
- (29) Arkadi Nemirovski. Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
- (30) AS Nemirovskii and DB Yudin. Cezare convergence of gradient method approximation of saddle points for convex-concave functions. Doklady Akademii Nauk SSSR, 239:1056–1059, 1978.
- (31) Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
- (32) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
- (33) Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural Computation, 6:147–160, 1994.
- (34) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
- (35) Satinder Singh, Michael Kearns, and Yishay Mansour. Nash convergence of gradient dynamics in general-sum games. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pages 541–548. Morgan Kaufmann Publishers Inc., 2000.
- (36) Hirofumi Uzawa. Iterative methods for concave programming. Studies in linear and nonlinear programming, 6:154–165, 1958.
- (37) Peng Xu, Farbod Roosta-Khorasani, and Michael W Mahoney. Newton-type methods for non-convex optimization under inexact hessian information. arXiv preprint arXiv:1708.07164, 2017.
- (38) Yi Xu and Tianbao Yang. First-order stochastic algorithms for escaping from saddle points in almost linear time. arXiv preprint arXiv:1711.01944, 2017.
Appendix A Theoretical Analysis
a.1 Lemma 4
Suppose that satisfies assumption 2; then, is a locally optimal saddle point on if and only if the gradient is zero, i.e.
and the second derivative at is positive definite in and negative definite in , i.e., there exist such that
From definition 3 follows that a locally optimal saddle point is a point for which the following two conditions hold:
Hence, is a local minimizer of and is a local maximizer. We therefore, without loss of generality, prove the statement of the lemma only for the minimizer , namely that
The proof for the maximizer directly follows from this.
If we assume that , then there exists a feasible direction such that , and we can find a step size for s.t. with . Using the smoothness assumptions (Assumption 1), we arrive at the following inequality
Hence, it holds that:
By choosing the gradient descent direction (with s.t. ), we can find a step size such that ,
which contradicts that is a local minimizer. Hence, is a necessary condition for a local minimizer.
To prove the second statement, we again make use of inequality (32) coming from the smoothness assumption and the update s.t. with . From (i) we know that and, therefore, we obtain:
If is not positive semi-definite, then there exists at least one eigenvector with negative curvature, i.e. . This implies that for following the curvature vector decreases the function value, i.e., . This contradicts that is a local minimizer which proves the sufficient condition
a.2 Lemma 5
The gradient mapping for the saddle point problem
with step size is a diffeomorphism.
The following proof is very much based on the proof of proposition 4.5 from .
A necessary condition for a diffeomorphism is bijectivity. Hence, we need to check that is (i) injective, and (ii) surjective for .
Consider two points for which
holds. Then, we have that
from which follows that
For this means , and therefore is injective.
We will show that is surjective by constructing an explicit inverse function for both optimization problems individually. As suggested by , we make use of the proximal point algorithm on the function for the parameters , individually.
For the parameter the proximal point mapping of centered at is given by
Moreover, note that is strongly convex in if :
Hence, the function has a unique minimizer, given by
which means that there is a unique mapping from to under the gradient mapping if .
The same line of reasoning can be applied to the parameter with the negative proximal point mapping of centered at , i.e.
Similarly as before, we can observe that is strictly concave for and that the unique minimizer of yields the update step of . This let’s us conclude that the mapping is surjective for if
Observing that under assumption 2, is continuously differentiable concludes the proof that is a diffeomorphism. ∎
Lemma 5 (Random Initialization).
a.3 Lemma 6
The point is a stationary point of the iterates in Eq. (17) if and only if is a locally optimal saddle point.
The point is a stationary point of the iterates if and only if . Let’s consider w.l.o.g. only the stationary point condition with respect to , i.e.
We prove that the above equation holds only if . This can be proven by a simple contradiction; suppose that , then multiplying both sides of the above equation by yields
Since the left-hand side is negative and the right-hand side is positive, the above equation leads to a contradiction. Therefore, and . This means that and and therefore according to lemma 4, is a locally optimal saddle point. ∎
a.4 Lemma 7
The proof is based on a simple idea: in a neighbourhood of a locally optimal saddle point, can not have extreme curvatures, i.e., . Hence, within the update of Eq. (17) reduces to the gradient update in Eq. (3), which is stable according to [26, 25].
To prove our claim that negative curvature doesn’t exist in , we make use of the smoothness assumption. Suppose that , then the smoothness assumption 1 implies
Similarly, one can show that
Therefore, the extreme curvature direction is zero according to the definition in Eq. (16). ∎
a.5 Lemma 8
Suppose that is an undesired stationary point of the gradient dynamics, namely
Consider the iterates of Eq. (17) starting from in a -neighbourhood of . After one step the iterates escape the -neighbourhood of , i.e.
for a sufficiently small .
Preliminaries: Consider compact notations
Characterizing extreme curvature: The choice of ensures that
holds. Since lies in a -neighbourhood of , we can use the smoothness of to relate the negative curvature at to negative curvature in :
Similarly, one can show that
Combining these two bounds yields
To simplify the above bound, we use the compact notation :
Proof of escaping:
The squared norm of the update can be computed as