Negative Momentum for Improved Game Dynamics
Games generalize the optimization paradigm by introducing different objective functions for different optimizing agents, known as players. Generative Adversarial Networks (GANs) are arguably the most popular game formulation in recent machine learning literature. GANs achieve great results on generating realistic natural images, however they are known for being difficult to train. Training them involves finding a Nash equilibrium, typically performed using gradient descent on the two players’ objectives. Game dynamics can induce rotations that slow down convergence to a Nash equilibrium, or prevent it altogether. We provide a theoretical analysis of the game dynamics. Our analysis, supported by experiments, shows that gradient descent with a negative momentum term can improve the convergence properties of some GANs.
Generative adversarial networks (GANs) have achieved impressive results in the task of generating natural images (Goodfellow et al., 2014), and pose both a promising approach for generative models and a very active area of research. Unfortunately, they are generally known to be difficult to train (Berthelot et al., 2017; Radford et al., 2015). Unlike simpler tasks, like classification problems, whose training is governed by well-understood optimization dynamics, GANs are often expressed as a smooth, two-player game where the existence of a pure Nash equilibrium may not even be guaranteed.
Goodfellow et al. (2014) have proved that given a powerful-enough discriminator and generator, there exists a unique Nash Equilibrium in which the optimal generator fully captures the data distribution and the discriminator is unable to discriminate between real and fake samples. However, such an equilibrium may not be achieved by a reasonable neural network and may require an arbitrarily small learning rate either due to bad conditioning or the presence of eigenvalues with large imaginary part in the Jacobian of the associated vector-field (Mescheder et al., 2017).
Intuitively, in a min-max game set-up such as the one in GANs, one player takes an action which is followed by a reciprocal action by the other player. Such a behavior could lead to limit cycles that never converge to a Nash Equilibrium. Researchers have proposed methods to improve these dynamics, which typically include some kind of regularization (Gulrajani et al., 2017) or normalization (Miyato et al., 2018). Our contribution is orthogonal to these choices.
Our main claim is that for a class of min-max games, the dynamics of gradient descent can be improved by using a negative momentum term. Adversarial games tend to exhibit rotational dynamics due the existence of strong imaginary components in the Jacobian eigenvalues (Mescheder et al., 2017). The idea behind our approach is that the momentum operator has complex eigenvalues, which can be used to manipulate the natural rotational behavior of adversarial games. We summarize our main contributions as follows:
We show that when the eigenvalues of the Jacobian have a large imaginary part, a negative momentum term can improve the local convergence properties of simultaneous gradient descent.
We confirm the benefits of negative momentum for training GANs with the notoriously ill-behaved saturating loss on both toy settings, and real datasets.
In Section 2, we review some of the existing works that study, mostly through analysis, GANs stability and convergence. Section 3 describes the fundamentals of the analytic setup that we use, inspired by Mescheder et al. (2017). In Section 4, we present a solution for the optimial step size, and discuss the constraints and intuition behind it. In Section 5, we present our theoretical results and guarantees on negative momentum. Section 6 contains experimental results on toy and real datasets.
2 Related Work
Our work is related to two main bodies of literature. That of game dynamics with focues on GAN training and that of momentum-based optimization. This section summarizes the most relevant pieces of work from that literature.
GANs are generally known to be hard to train. Many works have attempted to make GAN training easier using different approaches. Metz et al. (2017) propose to apply an unrolled optimization on the discriminator and then defining an objective for the generator. Similarly, Yadav et al. (2018) propose to stabilize training and also avoid mode collapse by extrapolating the discriminator during optimization, while Daskalakis et al. (2018) extrapolate the next value of the gradient using previous history. Gidel et al. (2018) introduce a variant of the extragradient algorithm extrapolating both the generator and the discriminator. Along a different direction, Roth et al. (2017) introduce a gradient norm penalty in terms of f-divergences which is similar to the gradient penalty proposed by Gulrajani et al. (2017).
Considering that gradient descent does not necessarily converge to the local optima when optimizing multiple objectives, Balduzzi et al. (2018) developed new methods to understand the dynamics of training in general games. They decompose second-order dynamics into two components using Helmholtz decomposition. This decomposition motivated them to propose a new algorithm named SGA (Symplectic Gradient Adjustment) which helps in discovering stable fixed points in general games. Recently, Liang and Stokes (2018) provided a unifying theory for smooth two-player games for non-asymptotic local convergence. They shed light on three stabilizing methods
From another perspective, others have analyzed the training dynamics of GAN and its convergence properties. Heusel et al. (2017) study the convergence properties of GANs with SGD and propose a two time-scale update rule (TTUR) for training GANs and prove that TTUR converges under mild assumptions to a local stationary Nash equilibrium. Mescheder et al. (2017) provide a discussion on how the eigenvalues of the Jacobian govern the local convergence properties of GAN. They argue that presence of eigenvalues with zero real-part and large imaginary-part result in oscillatory behaviors. Nagarajan and Kolter (2017) also analyze the local stability of GANs as an approximated continuous dynamical system. They show that during training of a GAN, the eigenvalues of the Jacobian of the corresponding vector field, are pushed away from 1 along the real axis. In the next section, we extend this discussion, arguing that the eigenvalues of the Jacobian cannot be considered individually. We also show how a negative momentum can modify the eigenvalues of the Jacobian.
From an optimization point of view, numerous studies have been done in the context of understanding momentum and its variants such as (Polyak, 1964; Qian, 1999; Nesterov, 2013; Sutskever et al., 2013), and many more. Among the recent works, using robust control theory, Lessard et al. (2016) studies optimization procedures from a dynamical system perspective and adopt Integral Quadratic Constraints (IQC) to obtain convergence guarantees for optimization algorithms. They cover a variety of algorithms including momentum methods. Although they analyze the global convergence properties of methods as opposed to local analysis, they establish worst-case bounds for smooth and strongly-convex functions. Interestingly, they provide a simple strongly-convex one-dimensional example in which standard momentum fails to converge. Since momentum-based methods, unlike vanilla gradient descent, do not necessarily decrease the function at every step, their analysis could be challenging in general. Wilson et al. (2016) propose Lyapunov functions to overcome this challenge in their analysis. Again from dynamical systems perspective, authors use Ordinary Differential Equation (ODE) to interpret the trajectories found by momentum-based optimization methods. Another important aspect of their work is that they analyze a continuous-time system in which the learning rate is small and subsequently quantize the continuous trajectory. While the global convergence properties of momentum-based methods in convex setup remain obscure, authors in (Ghadimi et al., 2015) analyze momentum and provide global convergence properties for functions with Lipschitz-continuous gradient.
In the next Section, we leverage the rich literature on GAN training and momentum-based optimization and focus on momentum analysis in the context of two-payers games such as GANs.
In this paper, scalars are lower-case letters (i.e., ), vectors are lower-case bold letters (i.e., ), matrices are upper-case bold letters (i.e., ) and the operators are upper-case letters (i.e., ). The spectrum of a squared matrix is denoted by . We use and to respectively denote the real and imaginary part of a complex number.
Game theory formulation of GANs
Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) consist of a discriminator and a generator . While the discriminator’s objective is to discriminate real from fake (generated) samples, the generator’s objective is to fool the discriminator. Ideally, iteratively optimizing both the discriminator and the generator will make the generator fully capture the real data distribution.
From a game theory point of view, GAN training is a differentiable two-player game: the discriminator aims at minimizing its cost function and the generator aims at minimizing its own cost function . Using the same formulation as the one in Mescheder et al. (2017) and Gidel et al. (2018), the GAN objective has the following form,
Given such a game setup, GAN training consists of finding a local Nash Equilibrium, which is a state in which neither the discriminator nor the generator can improve their cost by a small change in their parameters. In order to analyze the dynamics of gradient-based method in vicinity of a Nash Equilibrium, we look at the gradient vector field and its associated Jacobian,
Games in which are called zero-sum games and (1) can be reformulated as a min-max problem. This is the case for the original min-max GAN formulation, but not the case for the non-saturating loss (Goodfellow et al., 2014). Deep learning practitioners often use the latter formulation as it is easier to train.
For a zero-sum game, we note . When the matrices and are zero, the Jacobian is antisymmetric and has pure imaginary eigenvalues. We will call the games with pure imaginary eigenvalues purely adversarial games. This is the case in a simple bilinear game . This game can be formulated in a GAN where the true distribution is a Dirac on 0, the generator is a Dirac on and the Discriminator is linear. This setup was extensively studied in 2D by Mescheder (2018).
Conversely, when is zero and the matrices and are symmetric and definite positive, the Jacobian is symmetric and has real positive eigenvalues. We will call the games with real positive eigenvalues purely cooperative games. This is the case, for example, when the objective function is separable such as where and are two convex functions. Thus, the optimization can be reformulated as two separated minimization of and with respect to their respective parameters.
In this work we are interested in games which are in between the purely adversarial games and the purely cooperative ones, i.e., games which have eigenvalues with negative real part (cooperative component) and a non-zero imaginary part (adversarial component). A simple class of such games is
Note how the cooperative term introduced here is similar to the gradient penalty recently used to stabilize GANs training (Gulrajani et al., 2017).
The goal of this paper is to show that negative momentum can actually damp the adversarial component of a game.
Fixed point dynamics
To analyze the effect of the presence of complex eigenvalues, consider the dynamics of Simultaneous Gradient Method. Simultaneous Gradient Method is defined as the repeated application of the following operator on ,
where is the learning rate. The sequence computed by the gradient method is then
Then, if the gradient method converges, and its limit point is a fixed point of such that is positive-definite, then is a local Nash equilibrium.
Theorem 1 (Proposition 4.4.1 Bertsekas (1999)).
Let be the largest eigenvalue (in magnitude) of . If then, for in a neighborhood of , the sequence converges to the local Nash equilibrium at a rate of .
4 Tuning the step-size
According to Theorem 1, we want to have eigenvalues inside a convergence circle of radius to achieve linear convergence. This requires to be positive-definite, i.e., have eigenvalues with positive real component. Then there exist an optimal step-size which yields a non-trivial convergence rate . The following theorem gives analytic bounds on the optimal step size , and upper-bounds the best convergence rate we can expect.
The optimal step-size , which minimizes the spectral norm of , is the solution of a (convex) quadratic by parts problem, and verifies
where is the spectrum of .
In Theorem 2 we can see that if the Jacobian of has almost pure imaginary eigenvalues then the convergence rate of the gradient method can be arbitrary close to 1. When is positive-definite, the best is attained because of only one or several limiting eigenvalues. We illustrate and interpret these two regimes in Figure 1.
One way to thwart this slow linear convergence rate is to manipulate the imaginary part of the eigenvalues of our fixed point operator. Zhang et al. (2017) provide an analysis of the momentum method for the convex case, showing that momentum can actually help to better condition the model. One even more interesting point from their work is that the best condition number is when the additional momentum has changed the positive real eigenvalues of the initial Jacobian into complex conjugate eigenvalues. Consequently, momentum looks promising to have a beneficial impact on the imaginary part of the eigenvalues of the Jacobian of our game vector field.
5 Negative Momentum
As shown in the previous section, presence of large imaginary eigenvalues can restrict us to use small step sizes which leads to slow convergence rates. In practice, presence of large imaginary part in the eigenvalues seems to be characterized by an oscillation phenomenon where the iterates revolve around the equilibrium . In order to alleviate this convergence issue, we introduce a form of friction into the updates by adding a negative momentum term into the update rule, which will modify the parameter update operator of (4).
in which is the momentum rate. Therefore, the Jacobian has the following form,
Note that for , we recover the Gradient Method .
We claim that in some situations, if is adjusted properly, negative momentum can prevent the iterates of Gradient Method from circling around the Nash Equilibrium and instead guide them towards it by pushing the eigenvalues of the Jacobian towards the origin. In the following theorem, we provide an explicit equation for the eigenvalues of the Jacobian of .
The eigenvalues of are
where and is the complex square root of with positive real part
When is small enough, is a complex number close to . Consequently, is close to , which can be seen as shifting the original eigenvalue, and , the eigenvalue introduced by the state augmentation, is close to 0. (55) formalizes this intuition providing the first order approximation of both eigenvalues.
In Figure 2, we illustrate the effects of negative momentum on the games described in (3), from more cooperative () to more adversarial (). In all cases, negative momentum shifts the original eigenvalues (trajectories in light red) by pushing them to the left towards the origin (trajectories in light blue). However, negative momentum also introduces new eigenvalues due to the state augmentation. Those new eigenvalues have no impact on the convergence rate for more adversarial games ( close to 0), but can be detrimental to the convergence rate in more cooperative settings. Refer to the caption for more details.
Since our goal is to minimize the largest magnitude of the eigenvalues of which are computed in Thm. 3, we want to understand the effect of on these eigenvalues with potential large magnitude. Let , we define the (squared) magnitude that we want to optimize as,
We study the local behavior of for small values of . The following theorem shows that a well suited decreases , which corresponds to faster convergence.
For any non-real , we have,
which is positive if and only if . In particular, the optimal step-size belongs to this interval since we have .
As we have seen previously in Figure 1 and Theorem 2, there are only few eigenvalues which slow down the convergence. Theorem 4 is a local result showing that a small enough negative momentum can improve the magnitude of the limiting eigenvalues in the following cases: when there is only one limiting eigenvalue (since in that case the optimal step size is ) or when there are several limiting eigenvalues and the intersection is not empty. We point out that we do not provide any guarantees on whether this intersection is empty or not and leave it for future work.
Since our result is local, it does not provide any guarantees on large negative value of . Nevertheless, we numerically optimized (11) with respect to and and found that for any non-real fixed eigenvalue , the optimum momentum is negative and the associated optimum step size is larger that . Another interesting aspect of negative momentum is that it allows larger admissible step sizes (see Figure 2 and 3).
Purely adversarial game
It means that the analysis above provides convergence rates for games without any pure imaginary eigenvalues. It excludes the purely adversarial bilinear example ( in (3)). Combining alternating gradient steps and negative momentum can alleviate this issue. We discuss this idea with a study of the bilinear case in Appendix A.
6 Results and Discussion
Min-Max Bilinear Game
[Figure 3] In our first set of experiments, we showcase the effect of negative momentum in a bi-dimensional min-max optimization setup (3) where and (Mescheder, 2018). This setup allows us to bring more intuition to the effect of negative momentum in a GAN setup.
We compare the effect of positive and negative momentum in both cases of alternating and simultaneous gradient method. This game is purely adversarial: simultaneous gradient steps diverges. With alternating step, one can get convergence by using negative momentum.
Mixture of Gaussian
[Figure 4] In this set of experiments we evaluate the effect of using negative momentum for a GAN with saturating loss. The data in this experiment comes from eight Gaussian distributions which are distributed uniformly around the unit circle. The goal is to force the generator to generate 2-D samples that are coming from all of the 8 distributions. Although this looks like a simple task, many GANs fail to generate diverse samples in this setup. This experiment shows whether the algorithm prevents mode collapse or not.
We use fully connected network with 4 hidden ReLU layers where each layer has 256 hidden units. The latent code of the generator is a 8-dimensional multivariate Gaussian. The model is trained for 100,000 iterations with a learning rate of 0.01 for stochastic gradient descent along with values of zero, -0.5 and 0.5 momentum. We observe that negative momentum considerably improves the results compared to positive or zero momentum.
Fashion MNIST and CIFAR 10
[Figures 5 and 6] In our third set of experiments, we use negative momentum in a GAN setup on CIFAR-10 (Krizhevsky and Hinton, 2009) and Fashion-MNIST (Xiao et al., 2017) with saturating loss. We use residual networks for both the generator and the discriminator with no batch-normalization. Following the same architecture as Gulrajani et al. (2017), each residual block is made of two convolutional layers with ReLU activation function. Up-sampling and down-sampling layers are respectively used in the generator and discriminator. We experiment with different values of momentum on the discriminator and a constant value of 0.5 for the momentum of the generator. We observe that using a negative value can generally result in samples with higher quality and inception score. Intuitively, using negative momentum only on the discriminator slows down the learning process of the discriminator and allows for better flow of the gradient to the generator. Figure 6 also shows the results of using negative momentum on the fashion MNIST dataset.
In this paper, we study the effect of using negative values of momentum in a GAN setup. We show for a class of adversarial games, using negative momentum can improve the convergence rate of gradient-based methods by shifting the eigenvalues of the Jacobian appropriately into a smaller convergence disk. We found that in simple yet intuitive examples, using negative momentum makes convergence to the Nash Equilibrium easier. We noted that our intuitions on negative momentum can generalize to saturating GANs on the mixture of Gaussian task along with other datasets such as CIFAR-10 and fashion MNIST. Our experiments highly support the use of negative momentum with saturating loss. Altogether, fully stabilizing learning in GANs requires a deep understanding of the underlying highly non-linear dynamics. We believe our work is a step towards a better understanding of these dynamics. We encourage deep learning researchers and practitioners to include negative values of momentum in their hyper-parameter search.
We believe that our results explain a decreasing trend in momentum values reported in GAN literature in the past few years. Some of the most successful papers use zero momentum Arjovsky et al. (2017); Gulrajani et al. (2017) for architectures that would otherwise call for high momentum values in a non-adversarial setting.
This research was partially supported by the Canada Excellence Research Chair in “Data Science for Real-time Decision-making”, by the NSERC Discovery Grant RGPIN-2017-06936, a Google Focused Research Award and an IVADO grant. Authors would like to thank NVIDIA Corporation for providing the NVIDIA DGX-1 used for this research. Authors are also grateful to Frédéric Bastien, Florian Bordes, Adam Beberg, Cam Moore and Nithya Natesan for their support.
Appendix A Purely adversarial games
In this section we will try to close the gap between simultaneous and alternating gradinet step. In the main text we studied the effect of negative momentum on simultaneous updates for the sake of simplicity. Yet practitioners tend to use alternating steps. We study the effect of alternating steps on the eigenvalues of the bilinear min-max game.
This is the purely adversarial case of our class of games (3). This game has pure imaginary eigenvalues. Theorem 3 tells us that negative momentum alone cannot help: the eigenvalues of the modified operator remain on the imaginary axis. However, we show that a momentum combined with an alternated method provides a convergent algorithm. First let us define the operator of interest,
This operator uses different momentum values for each parameter. With the alternating method, each player influences much more the other which induces more asymmetry in the game. In order, to stabilize such game it seems promising to use a different momentum values for each player. This intuition is backed up by the analysis of the eigenvalues of the Jacobian of ,
The eigenvalues of the Jacobian of are the solution of the quartic functions (polynomial of degree 4).
Particularly for we get
Giving the following set of eigenvalues,
In (15), if the two momentum values and cannot be transposed and consequently the optimal values for and may be different. We leave an extensive study of this polynomial for future work.
Experimentally, we found it hard to get good results with equal momentum values for the discriminator and the generator, indicating that asymmetric momentum could help more than the symmetric one.
proof of Theorem 5.
The Jacobian of is,
Leading to the compressed form
Then the characteristic polynomial of the Jacobian is equal to
Then we can use Lemma 1 to compute the characteristic polynomial of the Jacobian and get
where for the third inequality we multiplied the first block line by and added it to the second block line. Then Using Lemma 1 we get,
where are the eigenvalues of (note that they are positive since is symmetric positive definite). Particularly, when we get,
Appendix B Proof of the Theorems
b.1 Proof of Theorem 1
Let be the largest eigenvalue (in magnitude) of . If then, for in a neighborhood of , the sequence converges linearly to the local Nash equilibrium at a rate of .
In this theorem, the notation stands for
Let us recall that we define the spectral radius of a matrix as,
For brevity let us write for and . Let .
By Proposition A.15 [Bertsekas, 1999] there exists a norm such that its induced matrix norm has the following property:
Then by definition of the sequence and since is a fixed point of , we have that,
Since is assumed to be continuously differentiable by the mean value theorem we have that
for some . Then,
where is the induced matrix norm of .
Since the induced norm of a square matrix is continuous on the its elements and since we assumed that was continuous, there exists such that,
Finally, we get that if , then,
Our goal is to find the eigenvalues of this Jacobian. is a squared matrix that can be trigonalized in . First of all let us recall the following lemma,
Let four matrices such that and commute. Then
where is the determinant of .
See [Zhang, 2006, Section 0.3]. ∎
b.2 Proof of Theorem 2
The eigenvalues of are , for . Our goal is to solve
where is the spectrum of . we can develop the magnitude to get,
Then the function is a convex function quadratic by part. This function goes to as gets larger and then has a minimum over . Then we can notice that the functions reach their minimum for . Consequently, if we order the eigenvalues such that,
we have that
Moreover, it is easy to notice that,
Then developing , we get that,
b.3 Proof of Theorem 3
Recall that the Jacobian of is
Then if we compute the characteristic polynomial of we have,
where and is a upper-triangular matrix. Finally by Lemma 1 we have that,
where . Let one of the we have,
The roots of this polynomial are
where and . This can be rewritten as,
where and is the complex square root of with real positive part (if is a real negative number, we set ). Moreover we have the following Taylor approximation,
b.4 Proof of Theorem 4
Recall the definition of
First let us notice that . Denoting , computing the derivative of give us
which leads to,
The sign of is determined by the sign of which a quadratic function. This function is positive on the interval .
- Optimistic Mirror Descent (OMD), Consensus Optimization (CO) and Predictive Method (PM)
- If is a negative real number we set
- M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
- D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel. The mechanics of n-player differentiable games. arXiv preprint arXiv:1802.05642, 2018.
- D. Berthelot, T. Schumm, and L. Metz. Began: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
- D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
- C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training gans with optimism. 2018.
- E. Ghadimi, H. R. Feyzmahdavian, and M. Johansson. Global convergence of the heavy-ball method for convex optimization. In Control Conference (ECC), 2015 European, pages 310–315. IEEE, 2015.
- G. Gidel, H. Berard, P. Vincent, and S. Lacoste-Julien. A variational inequality perspective on generative adversarial nets. arXiv, 2018.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
- I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein GANs. In NIPS, 2017.
- M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. 2017.
- A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
- L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.
- T. Liang and J. Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. arXiv preprint arXiv:1802.06132, 2018.
- L. Mescheder. On the convergence properties of gan training. arXiv:1801.04406, 2018.
- L. Mescheder, S. Nowozin, and A. Geiger. The numerics of GANs. 2017.
- L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. In ICLR, 2017.
- T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. 2018.
- V. Nagarajan and J. Z. Kolter. Gradient descent GAN optimization is locally stable. 2017.
- Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
- B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
- N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
- A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing training of generative adversarial networks through regularization. In Advances in Neural Information Processing Systems, 2017.
- I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.
- A. C. Wilson, B. Recht, and M. I. Jordan. A lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635, 2016.
- H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- A. Yadav, S. Shah, Z. Xu, D. Jacobs, and T. Goldstein. Stabilizing adversarial nets with prediction methods. 2018.
- F. Zhang. The Schur complement and its applications. Springer Science & Business Media, 2006.
- J. Zhang, I. Mitliagkas, and C. Ré. Yellowfin and the art of momentum tuning. arXiv preprint arXiv:1706.03471, 2017.