Differentiable Game Mechanics
Deep learning is built on the foundational guarantee that gradient descent on an objective function converges to local minima. Unfortunately, this guarantee fails in settings, such as generative adversarial nets, that exhibit multiple interacting losses. The behavior of gradient-based methods in games is not well understood – and is becoming increasingly important as adversarial and multi-objective architectures proliferate. In this paper, we develop new tools to understand and control the dynamics in -player differentiable games.
The key result is to decompose the game Jacobian into two components. The first, symmetric component, is related to potential games, which reduce to gradient descent on an implicit function. The second, antisymmetric component, relates to Hamiltonian games, a new class of games that obey a conservation law akin to conservation laws in classical mechanical systems. The decomposition motivates Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in differentiable games. Basic experiments show SGA is competitive with recently proposed algorithms for finding stable fixed points in GANs – while at the same time being applicable to, and having guarantees in, much more general cases.
1 \jmlrheading2020191-LABEL:LastPage1/192/1919-008Alistair Letcher, David Balduzzi, Sébastien Racanière, James Martens, Jakob Foerster, Karl Tuyls, Thore Graepel \ShortHeadingsDifferentiable Game MechanicsLetcher, Balduzzi, Racanière, Martens, Foerster, Tuyls, Graepel
Game Theory, Generative Adversarial Networks, Deep Learning, Classical Mechanics, Hamiltonian Mechanics, Gradient Descent, Dynamical Systems
A significant fraction of recent progress in machine learning has been based on applying gradient descent to optimize the parameters of neural networks with respect to an objective function. The objective functions are carefully designed to encode particular tasks such as supervised learning. A basic result is that gradient descent converges to a local minimum of the objective function under a broad range of conditions (Lee et al., 2017). However, there is a growing set of algorithms that do not optimize a single objective function, including: generative adversarial networks (Goodfellow et al., 2014; Zhu et al., 2017), proximal gradient TD learning (Liu et al., 2016), multi-level optimization (Pfau and Vinyals, 2016), synthetic gradients (Jaderberg et al., 2017), hierarchical reinforcement learning (Wayne and Abbott, 2014; Vezhnevets et al., 2017), intrinsic curiosity (Pathak et al., 2017; Burda et al., 2019), and imaginative agents (Racanière et al., 2017). In effect, the models are trained via games played by cooperating and competing modules.
The time-average of iterates of gradient descent, and other more general no-regret algorithms, are guaranteed to converge to coarse correlated equilibria in games (Stoltz and Lugosi, 2007). However, the dynamics do not converge to Nash equilibria – and do not even stabilize in general (Mertikopoulos et al., 2018; Papadimitriou and Piliouras, 2018). Concretely, cyclic behaviors emerge even in simple cases, see example 1.
This paper presents an analysis of the second-order structure of game dynamics that allows to identify two classes of games, potential and Hamiltonian, that are easy to solve separately. We then derive symplectic gradient adjustment111Source code is available at https://github.com/deepmind/symplectic-gradient-adjustment. (SGA), a method for finding stable fixed points in games. SGA’s performance is evaluated in basic experiments.
1.1 Background and problem description
Tractable algorithms that converge to Nash equilibria have been found for restricted classes of games: potential games, two-player zero-sum games, and a few others (Hart and Mas-Colell, 2013). Finding Nash equilibria can be reformulated as a nonlinear complementarity problem, but these are ‘hopelessly impractical to solve’ in general (Shoham and Leyton-Brown, 2008) because the problem is PPAD hard (Daskalakis et al., 2009).
Players are primarily neural nets in our setting. For computational reasons we restrict to gradient-based methods, even though game-theorists have considered a much broader range of techniques. Losses are not necessarily convex in any of their parameters, so Nash equilibria are not guaranteed to exist. Even leaving existence aside, finding Nash equilibria in nonconvex games is analogous to, but much harder than, finding global minima in neural nets – which is not realistic with gradient-based methods.
There are at least three problems with gradient-based methods in games. Firstly, the potential existence of cycles (recurrent dynamics) implies there are no convergence guarantees, see example 1 below and Mertikopoulos et al. (2018). Secondly, even when gradient descent converges, the rate of convergence may be too slow in practice because ‘rotational forces’ necessitate extremely small learning rates, see figure 4. Finally, since there is no single objective, there is no way to measure progress. Concretely, the losses obtained by the generator and the discriminator in GANs are not useful guides to the quality of the images generated. Application-specific proxies have been proposed, for example the inception score for GANs (Salimans et al., 2016), but these are of little help during training. The inception score is domain specific and is no substitute for looking at samples. This paper tackles the first two problems.
1.2 Outline and summary of main contributions
The infinitesimal structure of games.
We start with the basic case of a zero-sum bimatrix game: example 1. It turns out that the dynamics under simultaneous gradient descent can be reformulated in terms of Hamilton’s equations. The cyclic behavior arises because the dynamics live on the level sets of the Hamiltonian. More directly useful, gradient descent on the Hamiltonian converges to a Nash equilibrium.
Lemma 1 shows that the Jacobian of any game decomposes into symmetric and antisymmetric components. There are thus two ‘pure’ cases corresponding to when the Jacobian is symmetric and anti-symmetric. The first case, known as potential games (Monderer and Shapley, 1996), have been intensively studied in the game-theory literature because they are exactly the games where gradient descent does converge.
The second case, Hamiltonian222Lu (1992) defined an unrelated notion of Hamiltonian game. games, were not studied previously, probably because they coincide with zero-sum games in the bimatrix case (or constant-sum, depending on the constraints). Zero-sum and Hamiltonian games differ when the losses are not bilinear or when there are more than two players. Hamiltonian games are important because (i) they are easy to solve and (ii) general games combine potential-like and Hamiltonian-like dynamics. Unfortunately, the concept of a zero-sum game is too loose to be useful when there are many players: any -player game can be reformulated as a zero-sum -player game where . In this respect, zero-sum games are as complicated as general-sum games. In contrast, Hamiltonian games are much simpler than general-sum games. Theorem 4 shows that Hamiltonian games obey a conservation law – which also provides the key to solving them, by gradient descent on the conserved quantity.
The general case, neither potential nor Hamiltonian, is more difficult and is therefore the focus of the remainder of the paper. Section 3 proposes symplectic gradient adjustment (SGA), a gradient-based method for finding stable fixed points in general games. Appendix A contains TensorFlow code to compute the adjustment. The algorithm computes two Jacobian-vector products, at a cost of two iterations of backprop. SGA satisfies a few natural desiderata explained in section 3.1: () it is compatible with the original dynamics; and it is guaranteed to find stable equilibria in () potential and () Hamiltonian games.
For general games, correctly picking the sign of the adjustment (whether to add or subtract) is critical since it determines the behavior near stable and unstable equilibria. Section 2.4 defines stable equilibria and contrasts them with local Nash equilibria. Theorem 10 proves that SGA converges locally to stable fixed points for sufficiently small parameters (which we quantify via the notion of an additive condition number). While strong, this may be impractical or slow down convergence significantly. Accordingly, lemma 11 shows how to set the sign so as to be attracted towards stable equilibria and repelled from unstable ones. Correctly aligning SGA allows higher learning rates and faster, more robust convergence, see theorem 15. Finally, theorem 17 tackles the remaining class of saddle fixed points by proving that SGA locally avoids strict saddles for appropriate parameters.
We investigate the empirical performance of SGA in four basic experiments. The first experiment shows how increasing alignment allows higher learning rates and faster convergence, figure 4. The second set of experiments compares SGA with optimistic mirror descent on two-player and four-player games. We find that SGA converges over a much wider range of learning rates.
The last two sets of experiments investigate mode collapse, mode hopping and the related, less well-known problem of boundary distortion identified in Santurkar et al. (2018). Mode collapse and mode hopping are investigated in a setup involving a two-dimensional mixture of 16 Gaussians that is somewhat more challenging than the original problem introduced in Metz et al. (2017). Whereas simultaneous gradient descent completely fails, our symplectic adjustment leads to rapid convergence – slightly improved by correctly choosing the sign of the adjustment.
Finally, boundary distortion is studied using a 75-dimensional spherical Gaussian. Mode collapse is not an issue since there the data distribution is unimodal. However, as shown in figure 10, a vanilla GAN with RMSProp learns only one of the eigenvalues in the spectrum of the covariance matrix, whereas SGA approximately learns all of them.
The appendix provides some background information on differential and symplectic geometry, which motivated the developments in the paper. The appendix also explores what happens when the analogy with classical mechanics is pushed further than perhaps seems reasonable. We experiment with assigning units (in the sense of masses and velocities) to quantities in games, and find that type-consistency yields unexpected benefits.
1.3 Related work
Nash (1950) was only concerned with existence of equilibria. Convergence in two-player games was studied in Singh et al. (2000). WoLF (Win or Learn Fast) converges to Nash equilibria in two-player two-action games (Bowling and Veloso, 2002). Extensions include weighted policy learning (Abdallah and Lesser, 2008) and GIGA-WoLF (Bowling, 2004). Infinitesimal Gradient Ascent (IGA) is a gradient-based approach that is shown to converge to pure Nash equilibria in two-player two-action games. Cyclic behaviour may occur in case of mixed equilibria. Zinkevich (2003) generalised the algorithm to -action games called GIGA. Optimistic mirror descent approximately converges in two-player bilinear zero-sum games (Daskalakis et al., 2018), a special case of Hamiltonian games. In more general settings it converges to coarse correlated equilibria.
Convergence has also been studied in various -player settings, see Rosen (1965); Scutari et al. (2010); Facchinei and Kanzow (2010); Mertikopoulos and Zhou (2016). However, the recent success of GANs, where the players are neural networks, has focused attention on a much larger class of nonconvex games where comparatively little is known, especially in the -player case. Heusel et al. (2017) propose a two-time scale methods to find Nash equilibria. However, it likely scales badly with the number of players. Nagarajan and Kolter (2017) prove convergence for some algorithms, but under very strong assumptions (Mescheder et al., 2018). Consensus optimization (Mescheder et al., 2017) is closely related to our proposad algorithm, and is extensively discussed in section 3. A variety of game-theoretically or minimax motivated modifications to vanilla gradient descent have been investigated in the context of GANs, see Mertikopoulos et al. (2019); Gidel et al. (2018).
Learning with opponent-learning awareness (LOLA) infinitesimally modifies the objectives of players to take into account their opponents’ goals (Foerster et al., 2018). However, Letcher et al. (2019) recently showed that LOLA modifies fixed points and thus fails to find stable equilibria in general games.
Symplectic gradient adjustment was independently discovered by Gemp and Mahadevan (2018), who refer to it as “crossing-the-curl”. Their analysis draws on powerful techniques from variational inequalities and monotone optimization that are complementary to those developed here – see for example Gemp and Mahadevan (2016, 2017); Gidel et al. (2019). Using techniques from monotone optimization, Gemp and Mahadevan (2018) obtained more detailed and stronger results than ours, in the more particular case of Wasserstein LQ-GANs, where the generator is linear and the discriminator is quadratic (Feizi et al., 2017; Nagarajan and Kolter, 2017).
Network zero-sum games are shown to be Hamiltonian systems in Bailey and Piliouras (2019). The implications of the existence of invariant functions for games is just beginning to be understood and explored.
Dot products are written as or . The angle between two vectors is . Positive definiteness is denoted .
2 The Infinitesimal Structure of Games
In contrast to the classical formulation of games, we do not constrain the parameter sets to the probability simplex or require losses to be convex in the corresponding players’ parameters. Our motivation is that we are primarily interested in use cases where players are interacting neural nets such as GANs (Goodfellow et al., 2014), a situation in which results from classical game theory do not straightforwardly apply.
Definition 1 (differentiable game)
A differentiable game consists in a set of players and corresponding twice continuously differentiable losses . Parameters are where . Player controls , and aims to minimize its loss.
It is sometimes convenient to write where concatenates the parameters of all the players other than the , which is placed out of order by abuse of notation.
The simultaneous gradient is the gradient of the losses with respect to the parameters of the respective players:
By the dynamics of the game, we mean following the negative of the vector field, , with infinitesimal steps. There is no reason to expect to be the gradient of a single function in general, and therefore no reason to expect the dynamics to converge to a fixed point.
2.1 Hamiltonian Mechanics
Hamiltonian mechanics is a formalism for describing the dynamics in classical physical systems, see Arnold (1989); Guillemin and Sternberg (1990). The system is described via canonical coordinates . For example, often refers to position and to momentum of a particle or particles.
The Hamiltonian of the system is a function that specifies the total energy as a function of the generalized coordinates. For example, in a closed system the Hamiltonian is given by the sum of the potential and kinetic energies of the particles. The time evolution of the system is given by Hamilton’s equations:
An importance consequence of the Hamiltonian formalism is that the dynamics of the physical system – that is, the trajectories followed by the particles in phase space – live on the level sets of the Hamiltonian. In other words, the total energy is conserved.
2.2 Hamiltonian Mechanics in Games
The next example illustrates the essential problem with gradients in games and the key insight motivating our approach.
Example 1 (Conservation of energy in a zero-sum unconstrained bimatrix game)
Zero-sum games, where , are well-studied. The zero-sum game
has a Nash equilibrium at . The simultaneous gradient rotates around the Nash, see figure 1.
The matrix admits singular value decomposition (SVD) . Changing to coordinates and gives and . Introduce the Hamiltonian
Remarkably, the dynamics can be reformulated via Hamilton’s equations in the coordinates given by the SVD of :
The vector field cycles around the equilibrium because conserves the Hamiltonian’s level sets (i.e. ). However, gradient descent on the Hamiltonian converges to the Nash equilibrium. The remainder of the paper explores the implications and limitations of this insight.
Papadimitriou and Piliouras (2016) recently analyzed the dynamics of Matching Pennies (essentially, the above example) and showed that the cyclic behavior covers the entire parameter space. The Hamiltonian reformulation directly explains the cyclic behavior via a conservation law.
2.3 The Generalized Helmholtz Decomposition
The Jacobian of a game with dynamics is the -matrix of second-derivatives , where is the entry of the -dimensional vector . Concretely, the Jacobian can be written as
where is the -block of -order derivatives. The Jacobian of a game is a square matrix, but not necessarily symmetric. Note: Greek indices run over parameter dimensions whereas Roman indices run over players.
Lemma 1 (generalized Helmholtz decomposition)
The Jacobian of any vector field decomposes uniquely into two components where is symmetric and is antisymmetric.
Any matrix decomposes uniquely as where and are symmetric and antisymmetric. The decomposition is preserved by orthogonal change-of-coordinates: given orthogonal matrix , we have since the terms remain symmetric and antisymmetric. Applying the decomposition to the Jacobian yields the result.
The connection to the classical Helmholtz decomposition in calculus is sketched in appendix B. Two natural classes of games arise from the decomposition:
A game is a potential game if the Jacobian is symmetric, i.e. if . It is a Hamiltonian game if the Jacobian is antisymmetric, i.e. if .
Potential games are well-studied and easy to solve. Hamiltonian games are a new class of games that are also easy to solve. The general case is more difficult, see section 3.
2.4 Stable Fixed Points (SFPs) vs Local Nash Equilibria (LNEs)
There are (at least) two possible solution concepts in general differentiable games: stable fixed points and local Nash equilibria.
A point is a local Nash equilibrium if, for all , there exists a neighborhood of such that for .
We introduce local Nash equilibria because finding global Nash equilibria is unrealistic in games involving neural nets. Gradient-based methods can reliably find local – but not global – optima of nonconvex objective functions (Lee et al., 2016, 2017). Similarly, gradient-based methods cannot be expected to find global Nash equilibria in nonconvex games.
A fixed point with is stable if and is invertible, unstable if and a strict saddle if has an eigenvalue with negative real part. Strict saddles are a subset of unstable fixed points.
The definition is adapted from Letcher et al. (2019), where conditions on the Jacobian hold at the fixed point; in contrast, Balduzzi et al. (2018a) imposed conditions on the Jacobian in a neighborhood of the fixed point. We motivate this concept as follows.
Positive semidefiniteness, , is a minimal condition for any reasonable notion of stable fixed point. In the case of a single loss , the Jacobian of is the Hessian of , i.e. . Local convergence of gradient descent on single functions cannot be guaranteed if , since such points are strict saddles. These are almost always avoided by Lee et al. (2017), so this semidefinite condition must hold.
Another viewpoint is that invertibility and positive semidefiniteness of the Hessian together imply positive definiteness, and the notion of stable fixed point specializes, in a one-player game, to local minima that are detected by the second partial derivative test. These minima are precisely those which gradient-like methods provably converge to. Stable fixed points are defined by analogy, though note that invertibility and semidefiniteness do not imply positive definiteness in -player games since may not be symmetric.
Finally, it is important to impose only positive semi-definiteness to keep the class as large as possible. Imposing strict positivity would imply that the origin is not an SFP in the cyclic game from Example 1, while clearly deserving of being so.
The conditions and are equivalent to the conditions on the symmetric component and respectively, since
for all , by antisymmetry of . This equivalence will be used throughout.
Stable fixed points and local Nash equilibria are both appealing solution concepts, one from the viewpoint of optimisation by analogy with single objectives, and the other from game theory. Unfortunately, neither is a subset of the other:
Example 2 (stable local Nash)
Let and . Then
There is a stable fixed point with invertible Hessian at , since and invertible. However any neighbourhood of contains some small for which , so the origin is not a local Nash equilibrium.
Example 3 (local Nash stable)
Let . Then
There is a fixed point at which is a local (in fact, global) Nash equilibrium since and for all . However has eigenvalues and , so is not a stable fixed point.
In Example 3, the Nash equilibrium is a saddle point of the common loss . Any algorithm that converges to Nash equilibria will thus converge to an undesirable saddle point. This rules out local Nash equilibrium as a solution concept for our purposes. Conversely, Example 2 emphasises the better notion of stability whereby player 1 may have a local incentive to deviate from the origin immediately, but would later be punished for doing so since the game is locally dominated by the terms, whose only ‘resolution’ or ‘stable minimum’ is the origin (see Example 1).
2.5 Potential Games
Potential games were introduced by Monderer and Shapley (1996). It turns out that our definition of potential game above coincides with a special case of the potential games of Monderer and Shapley (1996), which they refer to as exact potential games.
Definition 5 (classical definition of potential game)
A game is a potential game if there is a single potential function and positive numbers such that
for all and all , see Monderer and Shapley (1996).
A game is a potential game iff for all , which is equivalent to
See Monderer and Shapley (1996).
If for all then equation (11) is equivalent to requiring that the Jacobian of the game is symmetric.
In an exact potential game, the Jacobian coincides with the Hessian of the potential function , which is necessarily symmetric.
Monderer and Shapley (1996) refer to the special case where for all as an exact potential game. We use the shorthand ‘potential game’ to refer to exact potential games in what follows.
Potential games have been extensively studied since they are one of the few classes of games for which Nash equilibria can be computed (Rosenthal, 1973). For our purposes, they are games where simultaneous gradient descent on the losses corresponds to gradient descent on a single function. It follows that descent on converges to a fixed point that is a local minimum of or a saddle.
2.6 Hamiltonian Games
Hamiltonian games, where the Jacobian is antisymmetric, are a new class games. They are related to the harmonic games introduced in Candogan et al. (2011), see section B.4. An example from Balduzzi et al. (2018b) may help develop intuition for antisymmetric matrices:
Example 4 (antisymmetric structure of tournaments)
Suppose competitors play one-on-one and that the probability of player beating player is . Then, assuming there are no draws, the probabilities satisfy and . The matrix of logits is then antisymmetric. Intuitively, antisymmetry reflects a hyperadversarial setting where all pairwise interactions between players are zero-sum.
Hamiltonian games are closely related to zero-sum games.
Example 5 (an unconstrained bimatrix game is zero-sum iff it is Hamiltonian)
Consider bimatrix game with and , but where the parameters are not constrained to the probability simplex. Then and the Jacobian components have block structure
The game is Hamiltonian iff iff iff .
However, in general there are Hamiltonian games that are not zero-sum and vice versa.
Example 6 (Hamiltonian game that is not zero-sum)
Fix constants and and suppose players and minimize losses
with respect to and respectively.
Example 7 (zero-sum game that is not Hamiltonian)
Players 1 and 2 minimize
The game actually has potential function .
Hamiltonian games are quite different from potential games. In a Hamiltonian game there is a Hamiltonian function that specifies a conserved quantity. In potential games the dynamics equal ; in Hamiltonian games the dynamics are orthogonal to . The orthogonality implies the conservation law that underlies the cyclic behavior in example 1.
Theorem 4 (conservation law for Hamiltonian games)
Let . If the game is Hamiltonian then
preserves the level sets of since .
If the Jacobian is invertible and then gradient descent on converges to a stable fixed point.
Direct computation shows for any game. The first statement follows since in Hamiltonian games.
For the second statement, the directional derivative is where since by anti-symmetry. It follows that .
For the third statement, gradient descent on will converge to a point where . If the Jacobian is invertible then clearly . The fixed-point is stable since in a Hamiltonian game, recall remark 1.
In fact, is a Hamiltonian function for the game dynamics, see appendix B for a concise explanation. We use the notation throughout the paper. However, can only be interpreted as a Hamiltonian function for when the game is Hamiltonian.
There is a precise mapping from Hamiltonian games to symplectic geometry, see appendix B. Symplectic geometry is the modern formulation of classical mechanics (Arnold, 1989; Guillemin and Sternberg, 1990). Recall that periodic behaviors (e.g. orbits) often arise in classical mechanics. The orbits lie on the level sets of the Hamiltonian, which expresses the total energy of the system.
We have seen that fixed points of potential and Hamiltonian games can be found by descent on and respectively. This section tackles finding stable fixed points in general games.
3.1 Finding Stable Fixed Points
There are two classes of games where we know how to find stable fixed points: potential games where converges to a local minimum and Hamiltonian games where , which is orthogonal to , finds stable fixed points.
In the general case, the following desiderata provide a set of reasonable properties for an adjustment of the game dynamics. Recall that is the angle between the vectors and .
To find stable fixed points, an adjustment to the game dynamics should satisfy
compatible333Two nonzero vectors are compatible if they have positive inner product. with game dynamics: ;
compatible with potential dynamics:
if the game is a potential game then ;
compatible with Hamiltonian dynamics:
If the game is Hamiltonian then ;
attracted to stable equilibria:
in neighborhoods where , require ;
repelled by unstable equilibria:
in neighborhoods where , require .
for some .
Desideratum does not guarantee that players act in their own self-interest – this requires a stronger positivity condition on dot-products with subvectors of , see Balduzzi (2017). Desiderata and imply that the adjustment behaves correctly in potential and Hamiltonian games respectively.
To understand desiderata and , observe that gradient descent on will find local minima that are fixed points of the dynamics. However, we specifically wish to converge to stable fixed points. Desideratum and require that the adjustment improves the rate of convergence to stable fixed points (by finding a steeper angle of descent), and avoids unstable fixed points.
More concretely, desiderata can be interpreted as follows. If points at a stable equilibrium then we require that points more towards the equilibrium (i.e. has smaller angle). Conversely, desiderata requires that if points away then the adjustment should point further away.
The unadjusted dynamics satisfies all the desiderata except .
3.2 Consensus Optimization
Since gradient descent on the function finds stable fixed points in Hamiltonian games, it is natural to ask how it performs in general games. If the Jacobian is invertible, then iff . Thus, gradient descent on converges to fixed points of .
However, there is no guarantee that descent on will find a stable fixed point. Mescheder et al. (2017) propose consensus optimization, a gradient adjustment of the form
Unfortunately, consensus optimization can converge to unstable fixed points even in simple cases where the ‘game’ is to minimize a single function:
Example 8 (consensus optimization can converge to a global maximum)
Consider a potential game with losses with . Then
Note that and
Descent on converges to the global maximum unless .
Although consensus optimization works well in two-player zero-sum, it cannot be considered a candidate algorithm for finding stable fixed points in general games since it fails in the basic case of potential games. Consensus optimization only satisfies desiderata and .
3.3 Symplectic Gradient Adjustment
The problem with consensus optimization is that it can perform worse than gradient descent on potential games. Intuitively, it makes bad use of the symmetric component of the Jacobian. Motivated by the analysis in section 2, we propose symplectic gradient adjustment, which takes care to only use the antisymmetric component of the Jacobian when adjusting the dynamics.
The symplectic gradient adjustment (SGA)
satisfies – for , with and .
First claim: by anti-symmetry of . Second claim: in a potential game, so . Third claim: since by assumption. Note that desiderata and are true even when . This will prove useful, since example 9 shows that it may be necessary to pick negative near . Section 3.5 shows how to also satisfy desiderata and .
We begin by analysing convergence of SGA near stable equilibria. The following lemma highlights that the interaction between the symmetric and antisymmetric components is important for convergence. Recall that two matrices and commute iff . That is, and commute iff . Intuitively, two matrices commute if they have the same preferred coordinate system.
If is symmetric positive semidefinite and commutes with then points towards stable fixed points for non-negative :
First observe that , where the first equality holds since the expression is a scalar, and the second holds since and . It follows that if . Finally rewrite the inequality as
since and by positivity of , and .
The lemma suggests that in general the failure of and to commute should be important for understanding the dynamics of . We therefore introduce the additive condition number to upper-bound the worst-case noncommutativity of , which allows to quantify the relationship between and . If , then commutes with all matrices. The larger the additive condition number , the larger the potential failure of to commute with other matrices.
Let be a symmetric matrix with eigenvalues . The additive condition number444The condition number of a positive definite matrix is . of is . If is positive semidefinite with additive condition number then implies
If is negative semidefinite, then implies
The inequalities are strict if is invertible.
We prove the case ; the case is similar. Rewrite the inequality as
Let and , where is the identity matrix. Then
since by construction and because by the anti-symmetry of . It therefore suffices to show that the inequality holds when and .
Since is positive semidefinite, there exists an upper-triangular square-root matrix such that and so . Further,
since . Putting the observations together obtains
Set and . We can continue the above computation
Finally, for any in the range , which is to say, for any . The kernel of and the kernel of coincide. If is in the kernel of , resp. , it cannot be in the kernel of , resp. and the term is positive. Otherwise, the term is positive.
The theorem above guarantees that SGA always points in the direction of stable fixed points for sufficiently small. This does not technically guarantee convergence; we use Ostrowski’s theorem to strengthen this formally. Applying Ostrowski’s theorem will require taking a more abstract perspective by encoding the adjusted dynamics into a differentiable map of the form .
Theorem 8 (Ostrowski)
Let be a continuously differentiable map on an open subset , and assume is a fixed point. If all eigenvalues of are strictly in the unit circle of , then there is an open neighbourhood of such that for all , the sequence of iterates of converges to . Moreover, the rate of convergence is at least linear in .
This is a standard result on fixed-point iterations, adapted from Ortega and Rheinboldt (2000, 10.1.3).
A matrix is called positive stable if all its eigenvalues have positive real part. Assume is a fixed point of a differentiable game such that is positive stable for in some set . Then SGA converges locally to for and sufficiently small.
Let . By definition of fixed points, and so
is positive stable by assumption, namely has eigenvalues with . Writing for the iterative procedure given by SGA, it follows that
has eigenvalues , which are in the unit circle for small . More precisely,
which is always possible for . Hence has eigenvalues in the unit circle for , and we are done by Ostrowski’s Theorem since is a fixed point of .
Let be a stable fixed point and the additive condition number of . Then SGA converges locally to for all and sufficiently small.
By Theorem 5 and the assumption that is a stable fixed point with invertible Jacobian, we know that
for . The proof does not rely on any particular property of , and can trivially be extended to the claim that
for all non-zero vectors . In particular this can be rewritten as
which implies positive definiteness of . A positive definite matrix is positive stable, and any matrices and have identical spectrum. This implies also that is positive stable, and we are done by the corollary above.
We conclude that SGA converges to an SFP if is small enough, where ‘small enough’ depends on the additive condition number.
This section explains desiderata – and shows how to pick to speed up convergence towards stable and away from unstable fixed points. In the example below, almost any choice of positive results in convergence to an unstable equilibrium. The problem arises from the combination of a weak repellor with a strong rotational force.
Example 9 (failure case for )
Suppose is small and
with an unstable equilibrium at . The dynamics are
Finally observe that
which converges to the unstable equilibrium if .
We now show how to pick the sign of to avoid unstable equilibria. First, observe that . It follows that for :
A criterion to probe the positive/negative definiteness of is thus to check the sign of . The dot product can take any value if is neither positive nor negative (semi-)definite. The behavior near saddle points will be explored in Section 3.7.
Recall that desiderata requires that, if points at a stable equilibrium then we require that points more towards the equilibrium (i.e. has smaller angle). Conversely, desiderata requires that, if points away then the adjustment should point further away. More formally,
Let and be two vectors. The infinitesimal alignment of with a third vector is
The following lemma allows us to rewrite the infinitesimal alignment in terms of known (computable) quantities, from which we can deduce the correct choice of .
When is the symplectic gradient adjustment,
where the denominator has no linear term in because . It follows that the sign of the infinitesimal alignment is
Intuitively, computing the sign of provides a check for stable and unstable fixed points. Computing the sign of checks whether the adjustment term points towards or away from the nearby fixed point. Putting the two checks together yields a prescription for the sign of , as follows.
Desiderata – are satisfied for such that .
If we are in a neighborhood of a stable fixed point then . It follows by lemma 11 that and so choosing leads to the angle between and being smaller than the angle between and , satisfying desideratum . The proof for the unstable case is similar.
Alignment and convergence rates.
Gradient descent is also known as the method of steepest descent. In general games, however, does not follow the steepest path to fixed points due to the ‘rotational force’, which forces lower learning rates and slows down convergence.
The following lemma provides some intuition about alignment. The idea is that, the smaller the cosine between the ‘correct direction’ and the ‘update direction’ , the smaller the learning rate needs to be for the update to stay in a unit ball, see figure 3.
Lemma 13 (alignment lemma)
If and are unit vectors with then for . In other words, ensuring that is closer to the origin than requires smaller learning rates as the angle between and gets larger.
Check iff . The result follows.
The next lemma is a standard technical result from the convex optimization literature.
Let be a convex Lipschitz smooth function satisfying for all . Then
for all .
See Nesterov (2004).
Finally, we show that increasing alignment helps speed convergence:
Suppose is convex and Lipschitz smooth with . Let where . Then the optimal step size is where , with
By the lemma 14,