Local Saddle Point Optimization: A Curvature Exploitation Approach
Gradient-based optimization methods are the most popular choice for finding local optima for classical minimization and saddle point problems. Here, we highlight a systemic issue of gradient dynamics that arise for saddle point problems, namely the presence of undesired stable stationary points that are no local optima. We propose a novel optimization approach that exploits curvature information in order to escape from these undesired stationary points. We prove that different optimization methods, including gradient method and Adagrad, equipped with curvature exploitation can escape non-optimal stationary points. We also provide empirical results on common saddle point problems which confirm the advantage of using curvature exploitation.
We consider the problem of finding a structured saddle point of a smooth objective, namely solving an optimization problem of the form
Here, we assume that is smooth in and but not necessarily convex in or concave in . This particular problem arises in many applications, such as generative adversarial networks (GAN) , robust optimization , and game theory [35, 22]. Solving the saddle point problem in Eq. (1) is equivalent to finding a point such that
holds for all and . For a non convex-concave function , finding such a saddle point is computationally infeasible. Instead of finding a global saddle point for Eq. (1), we aim for a more modest goal: finding a locally optimal saddle point, i.e. a point for which the condition in Eq. (2) holds true in a local neighbourhood around .
There is a rich literature on saddle point optimization for the particular class of convex-concave functions, i.e. when is convex in and concave in . Although this type of objective function is commonly encountered in applications such as constrained convex minimization, many saddle point problems of interest do not satisfy the convex-concave assumption. One popular example that recently emerged in machine learning is training generative adversarial networks (GANs). In this application, a representation of the data distribution is learned by playing a zero-sum game between two competing neural networks. This yields a saddle point optimization problem which - due to the complex functional representation of the two neural networks - does not fulfill the convexity-concavity condition.
First-order methods are commonly used to solve problem (1) as they have a cheap per-iteration cost and are therefore easily scalable. One particular method of choice is simultaneous gradient descent/ascent, which performs the following iterative updates,
where is a chosen step size which can, e.g., decrease with time or be a bounded constant (i.e. ). The convergence analysis of the above iterate sequence is typically tied to a strong/strict convexity-concavity property of the objective function defining the dynamics. Under such conditions, the gradient method is guaranteed to converge to a desired saddle point . These conditions can also be relaxed to some extent, which will be further discussed in Section 2.
It is known that the gradient method is locally asymptotically stable ; but stability alone is not sufficient to guarantee convergence to a locally optimal saddle point. Through an example, we will later illustrate that the gradient method is indeed stable at some undesired stationary points, at which the structural min-max property
Throughout the paper, we will refer to a desired local saddle point as a local minimum in and maximum in . This characterization implies that the Hessian matrix at does not have a negative curvature direction in (which corresponds to an eigenvector of with a negative associated eigenvalue) and a positive curvature direction in (which corresponds to an eigenvector of with a positive associated eigenvalue). In that regard, curvature information can be used to certify whether the desired min-max structure is met.
In this work, we propose the first saddle point optimization that exploits curvature to guide the gradient trajectory towards the desired saddle points that respect the min-max structure. Since our approach only makes use of the eigenvectors corresponding to the maximum and minimum eigenvalue (rather than the whole eigenspace), we will refer to it as extreme curvature exploitation. We will prove that this type of curvature exploitation avoids convergence to undesired saddles. Our empirical results also confirm the advantage of curvature exploitation in saddle point optimization.
2 Related work
Asymptotical Convergence In the context of optimizing a Lagrangian, the pioneering works of [19, 3] popularized the use of the primal-dual dynamics to arrive at the saddle points of the objective. The work of  analyzed the stability of this method in continuous time proving global stability results under strict convex-concave assumptions. This result was extended in  for a discrete-time version of the subgradient method with a constant step size rule, proving that the iterates converge to a neighborhood of a saddle point. Results for a decreasing step size were provided in [13, 24] while  analyzed an adaptive step size rule with averaged parameters. The work of  has shown that the conditions of the objective can be relaxed, proving asymptotic stability to the set of saddle points is guaranteed if either the convexity or concavity properties are strict, and convergence is pointwise. They also proved that the strictness assumption can be dropped under other linearity assumptions or assuming strongly joint quasiconvex-quasiconcave saddle functions.
However, for problems where the function considered is not strictly convex-concave, convergence to a saddle point is not guaranteed, with the gradient dynamics leading instead to oscillatory solutions . These oscillations can be addressed by averaging the iterates  or using the extragradient method (a perturbed version of the gradient method) [18, 12].
There are also instances of saddle point problems that do not satisfy the various conditions required for convergence. A notable example are generative adversarial networks (GANs) for which the work of  proved local asymptotic stability under certain suitable conditions on the representational power of the two players (called discriminator and generator). Despite these recent advances, the convergence properties of GANs are still not well understood.
Non-asymptotical Convergence An explicit convergence rate for the subgradient method with a constant stepsize was proved in  for reaching an approximate saddle point, as opposed to asymptotically exact solutions. Assuming the function is convex-concave, they proved a sub-linear rate of convergence. Rates of convergence have also been derived for the extragradient method  as well as for mirror descent .
In the context of GANs,  showed that a single-step gradient method converges to a saddle point in a neighborhood around the saddle point in which the function is strongly convex-concave. The work of  studied the theory of non-asymptotic convergence to a local Nash equilibrium. They prove that assuming local strong convexity-concavity, simultaneous gradient descent achieves an exponential rate of convergence near a stable local Nash equilibrium. They also extended this result to other discrete-time saddle point dynamics such as optimistic mirror descent or predictive methods.
Negative Curvature Exploitation The presence of negative curvature in the objective function indicates the existence of a potential descent direction, which is commonly exploited in order to escape saddle points and reach a local minimizer. Among these approaches are trust-region methods that guarantee convergence to a second-order stationary point [8, 31, 6]. While a naïve implementation of these methods would require the computation and inversion of the Hessian of the objective, this can be avoided by replacing the computation of the Hessian by Hessian-vector products that can be computed efficiently in . This is applied e.g. using matrix-free Lanczos iterations  or online variants such as Oja’s algorithm . Sub-sampling the Hessian can furthermore reduce the dependence on by using various sampling schemes [17, 37]. Finally, [2, 38] showed that first-order information can act as a noisy Power method allowing to find a negative curvature direction.
In contrast to these classical results that “blindly” try to escape any type of saddle-point, our aim is to exploit curvature information to reach a specific type of stationary point that satisfies the min-max condition required at the optimum of the objective function.
Assumptions For the sake of further analysis, we require the function to be sufficiently smooth, and its second order derivatives with respect to the parameters and to be non-degenerate.
Assumption 1 (Smoothness).
We assume that is a function, and that its gradient and Hessian are Lipschitz with respect to the parameters and , i.e. we assume that the following inequalities hold:
Assumption 2 (Non-degenerate Hessian).
We assume that the matrices and are non-degenerate for all .
Locally optimal saddles Let us define a -neighbourhood around the point as
with a sufficiently small .
The point is a locally optimal saddle point of the problem in Eq. (1) if
holds for .
With the use of the non-degeneracy assumption on the Hessian matrices, we are able to establish sufficient conditions on to be a locally optimal saddle point.
Suppose that satisfies assumption 2; then, is a locally optimal saddle point on if and only if the gradient with respect to is zero, i.e.
and the second derivative at is positive definite in and negative definite in
4 Undesired Stability
Asymptotic scenarios There are three different asymptotic scenarios for the gradient iterations in Eq. (3): (i) divergence (i.e. ), (ii) being trapped in a loop (i.e. ), and (iii) convergence to a stationary point of the gradient updates (i.e. ). Up to the best of our knowledge, there is no global convergence guarantee for general saddle point optimization. Typical convergence guarantees require convexity-concavity or somewhat relaxed conditions such as quasiconvexity-quasiconcavity of . Instead of global convergence, we therefore consider the more achievable goal of local convergence. Hence, we focus on the third outlined case where we are sure that the method converges to some stationary point for which holds . The question of interest here is whether such a sequence always yields a locally optimal saddle point as defined in Def. 3.
Local stability A stationary point of the gradient iterations can be either stable or unstable. The notion of stability characterizes the behavior of the gradient iterations in a local region around the stationary point. In the neighbourhood of a stable stationary point, successive iterations of the method are not able to escape the region. Conversely, a stationary point is called unstable if it is not stable. The stationary point (for which holds) is a locally stable point of the gradient iterations, if the Jacobian of its dynamics has only eigenvalues within the unit sphere, i.e.
Random initialization In the following, we will use the notion of stability to analyze the asymptotic behavior of the gradient method. We start with a lemma extending known results for general minimization problems that prove that gradient descent with random initialization almost surely converges to a stable stationary point .
Undesired stable stationary point If all stable stationary points of the gradient dynamics are locally optimal saddle points, then the result of Lemma 5 guarantees almost sure convergence to a solution of the saddle point problem in Eq. (1). It is already known that every locally optimal saddle point is a stable stationary point of the gradient dynamics [25, 26]. While for minimization problems, the set of stable stationary points is the same as the set of local minima, this might not be the case for the problem we consider here. Indeed, the gradient dynamics might introduce additional stable points that are not locally optimal saddle points. We illustrate this claim in the next example.
Example Consider the following two-dimensional saddle point problem
with . The critical points of the function, i.e. points for which , are
Evaluating the Hessians at the three critical points gives rise to the following three matrices:
We see that only is a locally optimal saddle point, namely that and , whereas the two other points are both a local minimum in the parameter , rather than a maximum. However, figure 3 illustrates gradient steps converging to the undesired stationary point because it is a locally stable point of the dynamics. Hence, even small perturbations of the gradients in each step can not avoid convergence to this point (see Figure 3).
5 Extreme Curvature Exploitation
The previous example has shown that gradient iterations on the saddle point problem introduce undesired stable points. In this section, we propose a strategy to escape from these stationary points. Our approach is based on exploiting curvature information as in .
Extreme curvature direction Let be the minimum eigenvalue of with its associated eigenvector , and be the maximum eigenvalue of with its associated eigenvector . Then, we define
where is the sign function. Using the above vectors, we define as the extreme curvature direction at .
Algorithm Using the extreme curvature direction, we modify the gradient steps as follows:
This new update step is constructed by adding the extreme curvature direction to the gradient method of Eq. (3). From now on, we will refer to this modified update as the Cesp (curvature exploitation for the saddle point problem) method.
Stability Extreme curvature exploitation has already been used for escaping from unstable stationary points (i.e. saddle points) of gradient descent for minimization problems . In saddle point problems, curvature exploitation is advantageous not only for escaping unstable stationary points but also for escaping undesired stable stationary points of the gradient iterates. The upcoming two lemmas prove that the set of stable stationary points of the Cesp dynamics and the set of locally optimal saddle points are the same – therefore, the optimizer only converges to a solution of the local saddle point problem.
A point is a stationary point of the iterates in Eq. (17) if and only if is a locally optimal saddle point.
We can conclude from the result of Lemma 6 that every stationary point of the Cesp dynamics is a locally optimal saddle point. The next Lemma establishes the stability of these points.
Escaping from undesired saddles Extreme curvature exploitation allows us to escape from undesired saddles. In the next lemma, we show that the optimization trajectory of Cesp stays away from all undesired stationary points of the gradient dynamics.
Suppose that is an undesired stationary point of the gradient dynamics, namely
Consider iterates of Eq. (17) starting from in a -neighbourhood of . After one step the iterates escape the -neighbourhood of , i.e.
for a sufficiently small .
Guaranteed decrease/increase The gradient update step of Eq. (3) has the property that an update with respect to () decreases (increases) the function value. The next lemma proves that Cesp shares the same desirable property.
In each iteration of Eq. (17), decreases in with
and increases in with
as long as the step size is chosen as
Implementation with Hessian-vector products Since storing and computing the Hessian in high dimensions is very costly, we need to find a way to efficiently extract the extreme curvature direction. The most common approach for obtaining the eigenvector corresponding to the largest absolute eigenvalue, (and the eigenvalue itself) of is to run power iterations on as
where is normalized after every iteration and is chosen such that . Since this method only requires implicit Hessian computation through a Hessian-vector product, it can be implemented as efficiently as gradient evaluations . The results of  provide a bound on the number of required iterations to extract the extreme curvature: for the case , iterations suffice to find a vector such that with probability (cf. ).
6 Curvature Exploitation for linear-transformed gradient steps
Linear-transformed gradient optimization Applying a linear transformation to the gradient updates is commonly used to accelerate optimization for various types of problems. The resulting updates can be written in the general form
where is a symmetric, block-diagonal -matrix. Different optimization methods use a different linear transformation . Table 1 in section B in the appendix illustrates the choice of for different optimizers. Adagrad , one of the most popular non-convex optimization methods, belongs to this category.
Extreme curvature exploitation We can adapt Cesp to the linear-transformed variant:
where we choose the linear transformation matrix to be positive definite. This variant of Cesp is also able to filter out the undesired stable stationary points of the gradient method for the saddle point problem. The following lemma proves that it has the same properties as the non-transformed version.
A direct implication of Lemma 10 is that we can also use curvature exploitation for Adagrad. Later, we will experimentally show the advantage of using curvature exploitation for this method.
7.1 Escaping from undesired stationary points of the toy example
Previously, we saw that for the two dimensional saddle point problem on the function of Eq. (13), gradient iterates may converge to an undesired stationary point that is not locally optimal. As shown in Figure 6, Cesp solves this issue. In this example, simultaneous gradient iterates converge to the undesired stationary point for many different initialization parameters, whereas our method always converges to the locally optimal saddle point. A plot of the basin of attraction of the two different optimizers on this example is presented in Figure 10 in the appendix.
7.2 Generative Adversarial Networks
This experiment evaluates the performance of the Cesp method for training a Generative Adversarial Network (GAN), which reduces to solving the saddle point problem
where the functions and are represented by neural networks parameterized with the variables and , respectively. We use the MNIST data set and a simple GAN architecture with 1 hidden layer and 100 units. More technical details are provided in Appendix C.2. Here, we investigate the advantage of curvature exploitation for the Adagrad method, which is a member of the class of linear-transformed gradient methods often used for saddle point problems. Moreover, we make use of Power iterations as described in section 5 to efficiently approximate the extreme curvature vector. Note that since we’re using mini batches in this experiment, we do not have access to the correct gradient information but also rely on an approximation here.
We evaluate the efficacy of the negative curvature step in terms of the spectrum of at a convergent solution . We compare Cesp to the vanilla Adagrad optimizer. Since we are interested in a solution that gives rise to a locally optimal saddle point, we track (an approximation of) the smallest eigenvalue of and the largest eigenvalue of through the optimization. Using these estimates, we can evaluate if a method has converged to a locally optimal saddle point.
The results are shown in figure 7. The decrease in terms of the squared norm of the gradients indicates that both methods converge to a solution. Moreover, both fulfill the condition for a locally optimal saddle point for the parameter , i.e. the maximum eigenvalue of is negative. However, the graph of the minimum eigenvalue of shows that Cesp converges faster, and with less frequent and severe spikes, to a solution where the minimum eigenvalue is zero. Hence, the negative curvature step seems to be able to drive the optimization procedure to regions that yield points closer to a locally optimal saddle point.
More experimental results – in the application of training GANs and robust optimization – are provided in Appendix C.
We focused our study on reaching a solution to the local saddle point problem. First, we have shown that gradient methods have stable stationary points that are not locally optimal, which is a problem exclusively arising in saddle point optimization. Second, we proposed a novel approach that exploits extreme curvature information to avoid the undesired stationary points. We believe this work highlights the benefits of using curvature information for saddle point problems and might open the door to other novel algorithms with stronger global convergence guarantees.
Appendix A Theoretical Analysis
a.1 Lemma 4
Suppose that satisfies assumption 2; then, is a locally optimal saddle point on if and only if the gradient is zero, i.e.
and the second derivative at is positive definite in and negative definite in , i.e., there exist such that
From definition 3 follows that a locally optimal saddle point is a point for which the following two conditions hold:
Hence, is a local minimizer of and is a local maximizer. We therefore, without loss of generality, prove the statement of the lemma only for the minimizer , namely that
The proof for the maximizer directly follows from this.
If we assume that , then there exists a feasible direction such that , and we can find a step size for s.t. with . Using the smoothness assumptions (Assumption 1), we arrive at the following inequality
Hence, it holds that:
By choosing the gradient descent direction (with s.t. ), we can find a step size such that ,
which contradicts that is a local minimizer. Hence, is a necessary condition for a local minimizer.
To prove the second statement, we again make use of inequality (32) coming from the smoothness assumption and the update s.t. with . From (i) we know that and, therefore, we obtain:
If is not positive semi-definite, then there exists at least one eigenvector with negative curvature, i.e. . This implies that for following the curvature vector decreases the function value, i.e., . This contradicts that is a local minimizer which proves the sufficient condition
a.2 Lemma 5
The gradient mapping for the saddle point problem
with step size is a diffeomorphism.
The following proof is very much based on the proof of proposition 4.5 from .
A necessary condition for a diffeomorphism is bijectivity. Hence, we need to check that is (i) injective, and (ii) surjective for .
Consider two points for which
holds. Then, we have that
from which follows that
For this means , and therefore is injective.
We will show that is surjective by constructing an explicit inverse function for both optimization problems individually. As suggested by , we make use of the proximal point algorithm on the function for the parameters , individually.
For the parameter the proximal point mapping of centered at is given by
Moreover, note that is strongly convex in if :
Hence, the function has a unique minimizer, given by
which means that there is a unique mapping from to under the gradient mapping if .
The same line of reasoning can be applied to the parameter with the negative proximal point mapping of centered at , i.e.
Similarly as before, we can observe that is strictly concave for and that the unique minimizer of yields the update step of . This let’s us conclude that the mapping is surjective for if
Observing that under assumption 2, is continuously differentiable concludes the proof that is a diffeomorphism. ∎
Lemma 5 (Random Initialization).
a.3 Lemma 6
The point is a stationary point of the iterates in Eq. (17) if and only if is a locally optimal saddle point.
The point is a stationary point of the iterates if and only if . Let’s consider w.l.o.g. only the stationary point condition with respect to , i.e.
We prove that the above equation holds only if . This can be proven by a simple contradiction; suppose that , then multiplying both sides of the above equation by yields
Since the left-hand side is negative and the right-hand side is positive, the above equation leads to a contradiction. Therefore, and . This means that and and therefore according to lemma 4, is a locally optimal saddle point. ∎
a.4 Lemma 7
The proof is based on a simple idea: in a neighbourhood of a locally optimal saddle point, can not have extreme curvatures, i.e., . Hence, within the update of Eq. (17) reduces to the gradient update in Eq. (3), which is stable according to [26, 25].
To prove our claim that negative curvature doesn’t exist in , we make use of the smoothness assumption. Suppose that , then the smoothness assumption 1 implies
Similarly, one can show that
Therefore, the extreme curvature direction is zero according to the definition in Eq. (16). ∎
a.5 Lemma 8
Suppose that is an undesired stationary point of the gradient dynamics, namely
Consider the iterates of Eq. (17) starting from in a -neighbourhood of . After one step the iterates escape the -neighbourhood of , i.e.
for a sufficiently small .
Preliminaries: Consider compact notations
Characterizing extreme curvature: The choice of ensures that
holds. Since lies in a -neighbourhood of , we can use the smoothness of to relate the negative curvature at to negative curvature in :
Similarly, one can show that
Combining these two bounds yields
To simplify the above bound, we use the compact notation :
Proof of escaping: The squared norm of the update can be computed as