On Gradient-Based Learning in Continuous Games This work was published in the SIAM Journal on Mathematics for Data Science on February 18, 2020 (https://doi.org/10.1137/18M1231298). This work was funded by the National Science Foundation Award CNS:1656873 and the Defense Advanced Research Projects Agency Award FA8750-18-C-0101

On Gradient-Based Learning in Continuous Games 1

Abstract

We introduce a general framework for competitive gradient-based learning that encompasses a wide breadth of multi-agent learning algorithms, and analyze the limiting behavior of competitive gradient-based learning algorithms using dynamical systems theory. For both general-sum and potential games, we characterize a non-negligible subset of the local Nash equilibria that will be avoided if each agent employs a gradient-based learning algorithm. We also shed light on the issue of convergence to non-Nash strategies in general- and zero-sum games, which may have no relevance to the underlying game, and arise solely due to the choice of algorithm. The existence and frequency of such strategies may explain some of the difficulties encountered when using gradient descent in zero-sum games as, e.g., in the training of generative adversarial networks. To reinforce the theoretical contributions, we provide empirical results that highlight the frequency of linear quadratic dynamic games (a benchmark for multi-agent reinforcement learning) that admit global Nash equilibria that are almost surely avoided by policy gradient.

\ShortHeadings

On Gradient-Based Learning in Continuous GamesMazumdar, Ratliff, and Sastry \firstpageno1

\editor{keywords}

continuous games, gradient-based algorithms, multi-agent learning

1 Introduction

With machine learning algorithms increasingly being deployed in real world settings, it is crucial that we understand how the algorithms can interact, and the dynamics that can arise from their interactions. In recent years, there has been a resurgence in research efforts on multi-agent learning, and learning in games. The recent interest in adversarial learning techniques also serves to show how game theoretic tools can be being used to robustify and improve the performance of machine learning algorithms. Despite this activity, however, machine learning algorithms are still being treated as black-box approaches and being naïvely deployed in settings where other algorithms are actively changing the environment. In general, outside of highly structured settings, there exists no guarantees on the performance or limiting behaviors of learning algorithms in such settings.

Indeed, previous work on understanding the collective behavior of coupled learning algorithms, either in competitive or cooperative settings, has mainly looked at games where the global structure is well understood like bilinear gamesSingh et al. (2000); Hommes and Ochea (2012); Mertikopoulos et al. (2018); Leslie and Collins (2005), convex games Mertikopoulos and Zhou (2019); Rosen (1965), or potential games Monderer and Shapley (1996), among many others. Such games are more conducive to the statement of global convergence guarantees since the assumed global structure can be exploited.

In games with fewer assumptions on the players’ costs, however, there is still a lack of understanding of the dynamics and limiting behaviors of learning algorithms. Such settings are becoming increasingly prevalent as deep learning is increasingly being used in game theoretic settings Goodfellow et al. (2014); Foerster et al. (2018); Abdallah and Lesser (2008); Zhang and Lesser (2010).

Gradient-based learning algorithms are extremely popular in a variety of these multi-agent settings due to their versatility, ease of implementation, and dependence on local information. There are numerous recent papers in multi-agent reinforcement learning that employ gradient-based methods (see, e.g.Abdallah and Lesser (2008); Foerster et al. (2018); Zhang and Lesser (2010)), yet even within this well-studied class of learning algorithms, a thorough understanding of their convergence and limiting behaviors in general continuous games is still lacking.

Generally speaking, in both the game theory and the machine learning communities, two of the central questions when analyzing the dynamics of learning in games are the following:

Q1.

Are all attractors of the learning algorithms employed by agents equilibria relevant to the underlying game?

Q2.

Are all equilibria relevant to the game also attractors of the learning algorithms agents employ?

In this paper, we provide some answers to the above questions for the class of gradient-based learning algorithms by analyzing their limiting behavior in general continuous games. In particular, we leverage the continuous time limit of the more naturally discrete multi-agent learning algorithms. This allows us to draw on the extensive theory of dynamical systems and stochastic approximation to make statements about the limiting behaviors of these algorithms in both deterministic and stochastic settings. The latter is particularly relevant since it is common for stochastic gradient methods to be used in multi-agent machine learning contexts.

Analyzing gradient-based algorithms through the lens of dynamical systems theory has recently yielded new insights into their behavior in the classical optimization setting Wilson et al. (2016); Scieur et al. (); Lee et al. (2016). We show that a similar type of analysis can also help understand the limiting behaviors of gradient-based algorithms in games. We remark, however, that there is a fundamental difference between the dynamics that are analyzed in much of the single-agent, gradient-based learning and optimization literature and the ones we analyze in the competitive multi-agent case: the combined dynamics of gradient-based learning schemes in games do not necessarily correspond to a gradient flow. This may seem a subtle point, but it it turns out to be extremely important.

Gradient flows admit desirable convergence guarantees—e.g., almost sure convergence to local minimizers—due to the fact that they preclude flows with the worst geometries Pemantle (2007). In particular, they do not exhibit non-equilibrium limiting behavior such as periodic orbits. Gradient-based learning in games, on the other hand, does not preclude such behavior. Moreover, as we show, asymmetry in the dynamics of gradient-play in games can lead to surprising behaviors such as non-relevant limiting behaviors being attracting under the flow of the game dynamics and relevant limiting behaviors, such as a subset of the Nash equilibria being almost surely avoided.

1.1 Related Work

The study of continuous games is quite extensive (see e.g. Basar and Olsder (1998); Osborne (1994)), though in large part the focus has been on games admitting a fair amount of structure. The behavior of learning algorithms in games is also well-studied (see e.g. Fudenberg and Levine (1998)). In this section, we comment on the most relevant prior work and defer a more comprehensive discussion of our results in the context of prior work to Section 6.

As we noted, previous work on learning in games in both the game theory literature, and more recently from the machine learning community, has largely focused on addressing (Q1) whether all attractors of the learning dynamics are game-relevant equilibria, and (Q2) whether all game-relevant equilibria are also attractors of the learning dynamics. The primary type of game-relevant equilibrium considered in the investigation of these two questions is a Nash equilibrium.

The majority of the existing work has focused on Q1. In fact, a large body of prior work focuses on games with structures that preclude the existence of non-Nash equilibria. Consequently, answering Q1 reduces to analyzing the convergence of various learning algorithms (including gradient-play) to the unique Nash equilibrium or the set of Nash equilibria. This is often shown by exploiting the game structure. Examples of classes of games where gradient-play has been well-studied are potential games Monderer and Shapley (1996), concave or monotone games Rosen (1965); Bravo et al. (2018); Mertikopoulos and Zhou (2019), and gradient-play over the space of stochastic policies in two-player finite-action bilinear games Singh et al. (2000). In the latter setting, other gradient-like algorithms such as multiplicative weights have also been studied fairly extensively Hommes and Ochea (2012), and have been shown to converge to cycling behaviors.

Some works have also attempted to address Q1 in the context of gradient-play in two-player zero-sum games. Concurrently with this paper, for a general class of “sufficiently smooth” two-player, zero-sum games it was shown that there exists stationary points for gradient-play that are non-Nash Daskalakis and Panageas (2018)2. In such games, it has also been shown that gradient-play can converge to cycles (see, e.g., Mertikopoulos et al. (2018); Wesson and Rand (2016); Hommes and Ochea (2012)).

There is also related work in more general games on the analysis of when Nash equilibria are attracting for gradient-based approaches (i.e. Q2). Sufficient conditions for this to occur are the conditions for stable differential Nash equilibria introduced in Ratliff et al. (2013, 2014, 2016) and the condition for variational stability later analyzed in Mertikopoulos and Zhou (2019). We remark that these conditions are equivalent for the classes of games we consider. Neither of these works give conditions under which Nash equilibria are avoided by gradient-play or comment on other attracting behaviors.

Expanding on this rich body of literature (only the most relevant of which is covered in our short review), in this paper we provide answers to Q1 without imposing structure on the game outside regularity conditions on the cost functions by exploiting the observation that gradient-based learning dynamics are not gradient flows. We also provide answers to Q2 by demonstrating that a non-trivial set of games admit Nash equilibria that are almost surely avoided by gradient-play. We give explicit conditions for when this occurs. Using similar analysis tools, we also provide new insights into the behavior of gradient-based learning in structured classes of games such as zero-sum and potential games.

1.2 Contributions and Organization

We present a general framework for modeling competitive gradient-based learning that applies to a broad swath of learning algorithms. In Section 3, we draw connections between the limiting behavior of this class of algorithms and game-theoretic and dynamical systems notions of equilibria. In particular, we construct general-sum and zeros-sum games that admit non-Nash attracting equilibria of the gradient dynamics. Such points are attracting under the learning dynamics, yet at least one player—and potentially all of them—has a direction in which they could unilaterally deviate to decrease their cost. Thus, these non-Nash equilibria are of questionable game theoretic relevance and can be seen as artifacts of the players’ algorithms.

In Section 4, we show that policy gradient multi-agent reinforcement learning (MARL), generative adversarial networks (GANs), gradient-based multi-agent multi-armed bandits, among several other common multi-agent learning settings, conform to this framework. The framework is amenable to tools for analysis from dynamical systems theory.

Also in Section 4, we show that a subset of the local Nash equilibria in general-sum games and potential games is avoided almost surely when each player employs a gradient-based algorithm. We show that this holds in two broad settings: the full information setting when each player has oracle access to their gradient but randomly initializes their first action, and a partial information setting where each player has access to an unbiased estimate of their gradient.

Thus, we provide a negative answer to both Q1 and Q2 for –player general-sum games, and highlight the nuances present in zero-sum and potential games. We also show that the dynamics formed from the individual gradients of agents’ costs are not gradient flows. This in turn implies that competitive gradient-based learning in general-sum games may converge to periodic orbits and other non-trivial limiting behaviors that arise in, e.g., chaotic systems.

To support the theoretical results, we present empirical results in Section 5 that show that policy gradient algorithms avoid global Nash equilibria in a large number of linear quadratic (LQ) dynamic games, a benchmark for MARL.

We conclude in Section 6 with a discussion of the implications of our results and some links with prior work as well as some comments on future directions.

2 Preliminaries

Consider agents indexed by . Each agent has their own decision variable , where is their finite-dimensional strategy space of dimension . Define to be the finite-dimensional joint strategy space with dimension . Each agent is endowed with a cost function with and such that where we use the notation to make the dependence on the action of the agent , and the actions of all agents excluding agent , explicit. The agents seek to minimize their own cost, but only have control over their own decision variable . In this setup, agents’ costs are not necessarily aligned with one another, meaning they are competing.

Given the game , agents are assumed to update their strategies simultaneously according to a gradient-based learning algorithm of the form

(1)

where is agent ’s step-size at iteration .

We analyze the following two settings:

  1. Agents have oracle access to the gradient of their cost with respect to their own choice variable—i.e.  where denotes the derivative of with respect to .

  2. Agents have an unbiased estimator of their gradient—i.e.,  where is a zero mean, finite variance stochastic process.

We refer to the former setting as deterministic gradient-based learning and the latter setting as stochastic gradient-based learning. Assuming that all agents are employing such algorithms, we aim to analyze the limiting behavior of the agents’ strategies. To do so, we leverage the following game-theoretic notion of a Nash equilibrium.

{definition}

A strategy is a local Nash equilibrium for the game if, for each , there exists an open set such that that and for all . If the above inequalities are strict, then we say is a strict local Nash equilibrium.

The focus on local Nash equilibria is due to our lack of assumptions on the agents’ cost functions. If for each , then a local Nash equilibrium is a global Nash equilibrium. This holds in e.g the bimatrix games and the linear quadratic games we analyze in Section 5. Depending on the agents’ costs, a game may admit anywhere from one to a continuum of local or global Nash equilibria; or none at all.

3 Linking Games and Dynamical Systems

In this section, we draw links between the limiting behavior of dynamical systems and game-theoretic notions of equilibria in three broad classes of continuous games. For brevity, the proofs of the propositions in this section are supplied in Appendix A. A high-level summary of the links we draw is shown in Figure 1.

Define to be the vector of player derivatives of their own cost functions with respect to their own choice variables. When each player is employing a gradient-based learning algorithm, the joint strategy of the players, (in the limit as the agents’ step-sizes go to zero) follows the differential equation

A point is said to be an equilibrium, critical point, or stationary point of the dynamics if . Stationary points of are joint strategies from which, under gradient-play, the agents do not move. We note that is a necessary condition for a point to be a local Nash equilibrium Ratliff et al. (2016). Hence, all local Nash equilibria are critical points of the joint dynamics .

Central to dynamical systems theory is the study of limiting behavior and its stability properties. A classical result in dynamical systems theory allows us to characterize the stability properties of an equilibrium by analyzing the Jacobian of the dynamics at . The Jacobian of is defined by

Since is a matrix of second derivatives, it is sometimes referred to as the ‘game Hessian’. Similar to the Hessian matrix of a gradient flow, allows us to further characterize the critical points of by their properties under the flow of . Let for denote the eigenvalues of at where —that is, is the eigenvalue with the smallest real part. Of particular interest are asymptotically stable equilibria.

{definition}

A point is a locally asymptotically stable equilibrium of the continuous time dynamics if and for all .

Locally asymptotically stable equilibria have two properties of interest. First, they are isolated, meaning that there exists a neighborhood around them in which no other equilibria exist. Second, they are exponentially attracting under the flow of , meaning that if agents initialize in a neighborhood of a locally asymptotically stable equilibrium and follow the dynamics described by , they will converge to exponentially fast Sastry (1999). This, in turn, implies that a discretized version of , namely

(2)

converges locally for appropriately selected step size at a rate of . Such results motivate the study of the continuous time dynamical system in order to understand convergence properties of gradient-based learning algorithms of the form (1).

Another important class of critical points of a dynamical system are saddle points. {definition} A point is a saddle point of the dynamics if and is such that . A saddle point such that for and for with is a strict saddle point of the continuous time dynamics .

Strict saddle points are especially relevant to our analysis since their neighborhoods are characterized by stable and unstable manifolds Sastry (1999). When the agents evolve according to the dynamics solely on the stable manifold, they converge exponentially fast to the critical point. However, when they evolve solely on the unstable manifold, they diverge from the equilibrium exponentially fast. Agents whose strategies lie on the union of the two manifolds asymptotically avoid the equilibrium. We make use of this general fact in Section 4.1.

To better understand the links between the critical points of the gradient dynamics and the Nash equilibria of the game, we make use of an equivalent characterization of strict local Nash that leverages first and second order conditions on player cost functions. This makes them simpler objects to link to the various dynamical systems notions of equilibria than local Nash equilibria.

{definition}

[Ratliff et al. (2013, 2016)] A point is a differential Nash equilibrium for the game defined by if and for each .

In Ratliff et al. (2014), it was shown that local Nash equilibria are generically differential Nash equilibria where (i.e., is non-degenerate). Thus, in the space of games where the agents’ costs are at least twice differentiable, the set of games that admit local Nash equilibria that are not non-degenerate differential Nash equilibria is of measure zero Ratliff et al. (2014). In Ratliff et al. (2014) it was also shown that non-degenerate Nash equilibria are structurally stable, meaning that small perturbations to the agents’ costs functions will not change the fundamental nature of the equilibrium. This also implies that gradient-play with slightly biased estimators of the gradient will not have vastly different behaviors in neighborhoods of equilibria.

Given these different equilibrium notions of the learning dynamics and the underlying game, let us define the following sets which will be useful in stating the results in the following sections. For a game , denote the sets of strict saddle points and locally asymptotically stable equilibria of the gradient dynamics, , as and , respectively, where we recall that . Similarly, denote the set of local Nash equilibria, differential Nash equilibria, and non-degenerate differential Nash equilibria of as , , and , respectively. As previously mentioned, in almost all continuous games. The key takeaways of this section are summarized in Figure 1.

Figure 1: Links between the equilibria of generic continuous games and their properties under the gradient dynamics .

3.1 General-sum games

We first analyze the properties of local Nash equilibria under the joint gradient dynamics in -player general-sum games.

{proposition}

A non-degenerate differential Nash equilibrium is either a locally asymptotically stable equilibrium or a strict saddle point of —i.e., . Locally asymptotically stable differential Nash equilibria satisfy the notion of variational stability introduced in Mertikopoulos and Zhou (2019). In fact, a simple analysis shows that the definitions of variationally stable equilibria and locally asymptotically stable differential Nash equilibria Ratliff et al. (2013) are equivalent in the games we consider—i.e., games where each players’ cost is at least twice continuously differentiable. We remark that, from the definition of asymptotic stability, the gradient dynamics have a convergence rate in the neighborhood of such equilibria.

An important point to make is that not every locally asymptotically stable equilibrium of is a non-degenerate differential Nash equilibrium. Indeed, the following proposition provides an entire class of games whose corresponding gradient dynamics admit locally asymptotically stable equilibria that are not local Nash equilibria. {proposition} In the class of general-sum continuous games, there exists a continuum of games containing games such that , and moreover, . {proof} Consider a two player game on where

for constants . The Jacobian of is given by

(3)

If and , then the unique stationary point is neither a differential Nash nor a local Nash equilibria since the necessary conditions are violated (i.e., ). However, if and , the eigenvalues of have positive real parts and is asymptotically stable. Further, this clearly holds for a continuum of games. Thus, the set of locally asymptotically stable equilibria that are not Nash equilibria may be arbitrarily large.

The, preceding proposition shows that there exists attracting critical points of the gradient dynamics in general-sum continuous games that are not Nash equilibria and may not be even relevant to the game. Thus, this provides a negative answer to Q2 (whether all attracting equilibria in general-games are game-relevant for the learning dynamics).

{remark}

We note that, by definition, the non-Nash locally asymptotically stable equilibria (or non-Nash equilibria) do not satisfy the second-order conditions for Nash equilibria. Thus, at these joint strategies, at least one player – and maybe all of them – has a direction in which they would unilaterally deviate if they were not using gradient descent. As such, we view convergence to these points to be undesirable.

3.2 Zero-sum games

Let us now restrict our attention to two-player zero-sum games, which often arise when training GANs, in adversarial learning, and in MARL Goodfellow et al. (2014); Omidshafiei et al. (2017); Chivukula and Liu (2017). In such games, one player can be seen as minimizing with respect to their decision variable and the other as minimizing with respect to theirs. The following proposition shows that all differential Nash equilibria in two-player zero-sum games are locally asymptotically stable equilibria under the flow of .

{proposition}

For an arbitrary two-player zero-sum game, on , if is a differential Nash equilibrium, then is both a non-degenerate differential Nash equilibrium and a locally asymptotically stable equilibrium of —that is, .

This result guarantees that the differential Nash equilibria of zero-sum games are isolated and exponentially attracting under the flow of . This in turn guarantees that simultaneous gradient-play has a local linear rate of convergence to all local Nash equilibria in all zero-sum continuous games. Thus, the answer to Q1 is the context of zero-sum games is “yes”, since all Nash equilibria are attracting for the gradient dynamics.

The converse of the preceding proposition, however, is not true. Not every locally asymptotically stable equilibrium in two-player zero-sum games are non-degenerate differential Nash equilibria. Indeed, there may be many locally asymptotically stable equilibria in a zero-sum game that are not local Nash equilibria. The following proposition highlights this fact. {proposition} In the class of zero-sum continuous games, there exists a continuum of games such that for each game , . {proof} Consider the two-player zero-sum game on where

and . The Jacobian of is given by

If and , then has eigenvalues with strictly positive real part, but the unique stationary point is not a differential Nash equilibrium—since —and, in fact, is not even a Nash equilibrium. Indeed,

Thus, there exists a continuum of zero-sum games with a large set of locally asymptotically stable equilibria of the corresponding dynamics that are not differential Nash.

The, preceding proposition again shows that there exists non-Nash equilibria of the gradient dynamics in zero-sum continuous games. Thus, this proposition also provides a negative answer to Q2 in the context of zero-sum games.

3.3 Potential Games

One last set of games with interesting connections between the Nash equilibria and the critical points of the gradient dynamics is the class known as potential games. This particularly nice class of games are ones for which corresponds to a gradient flow under a coordinate transformation—that is, there exists a function (commonly referred to as the potential function) such that for each , . We remark that due to the equivalence this class of games is sometimes referred to as an exact potential game. Note that a necessary and sufficient condition for to be a potential game is that is symmetric Monderer and Shapley (1996)—that is, . This gives potential games the desirable property that the only locally asymptotically stable equilibria of the gradient dynamics are local Nash equilibria.

{proposition}

For an arbitrary potential game, on , if is a locally asymptotically stable equilibrium of (i.e., ), then is a non-degenerate differential Nash equilibrium (i.e., ).

The full proof of Proposition 3.3 is supplied in Appendix A. The preceding proposition rules out non-Nash locally asymptotically stable equilibria of the gradient dynamics in potential games, and implies that every local minimum of a potential game must be a local Nash equilibrium. Thus, in potential games, unlike in general-sum and zero-sum games, the answer to Q2 is positive. However, the following proposition shows that the existence of a potential function is not enough to rule out local Nash equilibria that are saddle points of the dynamics. {proposition} In the class of continuous games, there exist a continuum of potential games containing games that admit Nash equilibria that are saddle points of the dynamics —i.e., such that for some , . {proof} Consider the game on described by

where . The Jacobian of is given by

If , then is a local Nash equilibrium. However, if , has one positive and one negative eigenvalue and is a saddle point of the gradient dynamics. Thus, there exists a continuum of potential games where a large set of differential Nash equilibria are strict saddle points of .

Proposition 3.3 demonstrates a surprising fact about potential games. Even though all minimizers of the potential function must be local Nash equilibria, not all local Nash equilibria are minimizers of the potential function.

3.4 Main Takeaways

The main takeaways of this section are summarized in Figure 1. We note that for zero-sum games, Proposition 3.2 shows that . Since the inclusion is strict, the answer to Q2 in such games is “no”. For general-sum games, Proposition 3.1 allows us to to conclude that there do exist attracting, non-Nash equilibria. Thus, the answer to Q2 is also “no”. In potential games, since the answer is “yes”.

In the following sections, we provide answers to Q1 by showing that all local Nash equilibria in are avoided almost surely by gradient-based algorithms in both the deterministic and stochastic settings. In particular, since in potential and general-sum games, one cannot give a positive answer to Q1 in either of these classes of games.

4 Convergence of Gradient-Based Learning

In this section, we provide convergence and non-convergence results for gradient-based algorithms. We also include a high-level overview of well-known algorithms that fit into the class of learning algorithms we consider; more detail can be found in Appendix C.

4.1 Deterministic Setting

We first address convergence to equilibria in the deterministic setting in which agents have oracle access to their gradients at each time step. This includes the case where agents know their own cost functions and observe their own actions as well as their competitors’ actions—and hence, can compute the gradient of their cost with respect to their own choice variable.

Since we have assumed that each agent has their own learning rate (i.e. step sizes ), the joint dynamics of all the players are given by

(4)

where with and element-wise. By a slight abuse of notation, is defined to be element-wise multiplication of and where is multiplied by the first components of , is multiplied by the next components, and so on.

We remark that this update rule immediately distinguishes gradient-based learning in games from gradient descent. By definition, the dynamics of gradient descent in single-agent settings always correspond to gradient flows —i.e evolves according to an ordinary differential equation of the form for some function . Outside of the class of exact potential games we defined in Section 3, the dynamics of players’ actions in games are not afforded this luxury—indeed, is not in general symmetric (which is a necessary condition for a gradient flow). This makes the potential limiting behaviors of highly non-trivial to characterize in general-sum games.

The structure present in a gradient-flow implies strong properties on the limiting behaviors of . In particular, it precludes the existence of limit cycles or periodic orbits (limiting behaviors of dynamical systems where the state of system cycles infinitely through a set of states with a finite period) and chaos (an attribute of nonlinear dynamical systems where the system’s behavior can vary extremely due to slight changes in initial position) Sastry (1999). We note that both of these behaviors can occur in the dynamics of gradient-based learning algorithms in games3.

Despite the wide breadth of behaviors that gradient dynamics can exhibit in competitive settings, we are still make statements about convergence (and non-convergence) to certain types of equilibria. To do so, we first make the following standard assumptions on the smoothness of the cost functions and the magnitude of the agents’ learning rates .

Assumption 1

For each , with , , and where is the induced -norm.

Given these assumptions, the following result rules out converging to strict saddle points.

{theorem}

Let and satisfy Assumption 1. Suppose that is open and convex. If , the set of initial conditions from which competitive gradient-based learning converges to strict saddle points is of measure zero.

We remark that the above theorem holds for in particular, since holds trivially in this case. It is also important to note that, as we point out in Section 3, local Nash equilibria can be strict saddle points. Thus, all local Nash equilibria that are strict saddle points for are avoided almost surely by gradient-play even with oracle gradient access and random initializations. This holds even when players randomly initialize uniformly in an arbitrarily small ball around such Nash equilibria. In Section 5, we show that many linear quadratic dynamic games have a strict saddle point as their global Nash equilibrium. For brevity, we provide the proof of Theorem 4.1 in Appendix A, and provide a proof sketch below.

{proof}

[Proof sketch of Theorem 4.1] The core of the proof is the celebrated stable manifold theorem from dynamical systems theory, presented in Theorem A.2. We construct the set of initial positions from which gradient-play will converge to strict saddle points and then use the stable manifold theorem to show that the set must have measure zero in the players’ joint strategy space. Therefore, with a random initialization players will never evolve solely on the stable manifold of strict saddles and they will consequently diverge from such equilibria.

To be able to invoke the stable manifold theorem, we first show that the mapping is a diffeomorphism, which is non-trivial due to the fact that we have allowed each agent to have their own learning rate and is not symmetric. We then iteratively construct the set of initializations that will converge to strict saddle points under the game dynamics. By the stable manifold theorem, and the fact that is a diffeomorphism, the stable manifold of a strict saddle point must be measure zero. Then, by induction we show that the set of all initial points that converge to a strict saddle point must also be measure zero.

In potential games we can strengthen the above non-convergence result and give convergence guarantees. {corollary} Consider a potential game on open, convex and where each for . Let be a prior measure with support which is absolutely continuous with respect to the Lebesgue measure and assume exists. Then, under Assumption 1, competitive gradient-based learning converges to non-degenerate differential Nash equilibria almost surely. Moreover, the non-degenerate differential Nash to which it converges is generically a local Nash equilibrium.

Corollary 4.1 guarantees that in potential games, gradient-play will converge to a differential Nash equilibrium. Combining this with Theorem 4.1 guarantees that the differential Nash equilibrium it converges to is a local minimizer of the potential function. A simple implication of this result is that gradient-based learning in potential games cannot exhibit limit cycles or chaos.

Of note is the fact that the agents do not need to be performing gradient-based learning on to converge to Nash almost surely. That is, they do not need to know the function ; they simply need to follow the derivative of their own cost with respect to their own choice variable, and they are guaranteed to converge to a local Nash equilibrium that is a local minimizer of the potential function.

We note that convergence to Nash equilibria is a known characteristic of gradient-play in potential games. However, our analysis also highlights that gradient-play will avoid a subset of the Nash equilibria of the game. This is surprising given the particularly strong structural properties of such games. The proof for Corollary 4.1 is provided in Appendix A and follows from Proposition 3.3, Theorem 4.1, and the fact that is symmetric in potential games.

Implications and Interpretation of Convergence Analysis

Both Theorem 4.1 and Corollary 4.1 show that gradient-play in multi-agent settings avoids strict saddles almost surely even in the deterministic setting. Combined with the analysis in Section 3 which shows that (local) Nash equilibria can be strict saddles of the dynamics for general-sum games, this implies that a subset of the Nash equilibria are almost surely avoided by individual gradient-play, a potentially undesirable outcome in view of Q1 (whether all Nash equilibria are attracting for the learning dynamics). In Section 5, we show that the global Nash equilibrium is a saddle point of the gradient dynamics in a large number of randomly sampled LQ dynamic games. This suggests that policy gradient algorithms may fail to converge in such games, which is highly undesired. This is in stark contrast to the single agent setting where policy gradient has been shown to converge to the unique solution of LQR problems Fazel et al. (2018).

In Section 3, we also showed that local Nash equilibria of potential games can be strict saddles points of the potential function. Non-convergence to such points in potential games is not necessarily a bad result since this in turn implies convergence to a local minimizer of the potential function (as shown in Lee et al. (2016); Panageas and Piliouras (2016)) which are guaranteed to be local Nash equilibria of the game. However, these results do imply that one cannot answer “yes” to Q1 in potential games since some of the Nash equilibria are not attracting under gradient-play.

In zero-sum games, where local Nash equilibria cannot be strict saddle points of the gradient dynamics, our result suggests that eventually gradient-based learning algorithms will escape saddle points of the dynamics.

The almost sure avoidance of all equilibria that are saddle points of the dynamics further implies that if (3) converges to a critical point , then —i.e., is locally asymptotically stable for . This may not be a desired property however, since we showed in Section 3 that zero-sum and general-sum games both admit non-Nash LASE.

Since gradient-play in games generally does not result in a gradient flow, other types of limiting behaviors such as limit cycles can occur in gradient-based learning dynamics. Theorem 4.1 says nothing about convergence to other limiting behaviors. In the following sections we prove that the results described in this section extend to the stochastic gradient setting. We also formally define periodic orbits in the context of dynamical systems and state stronger results on avoidance of some more complex limiting behaviors like linearly unstable limit cycles.

4.2 Stochastic Setting

We now analyze the stochastic case in which agents are assumed to have an unbiased estimator for their gradient. The results in this section allow us to extend the results from the deterministic setting to a setting where each agent builds an estimate of the gradient of their loss at the current set of strategies from potentially noisy observations of the environment. Thus, we are able to analyze the limiting behavior of a class of commonly used machine learning algorithms for competitive, multi-agent settings. In particular, we show that agents will almost surely not converge to strict saddle points. In Appendix B.1, we show that the gradient dynamics will actually avoid more general limiting behaviors called linearly unstable cycles which we define formally.

To perform our analysis, we make use of tools and ideas from the literature on stochastic approximations (see e.g Borkar (2008)). We note that the convergence of stochastic gradient schemes in the single-agent setting has been extensively studied Robbin (1971); Pemantle (1990); Bottou (2010); Mertikopoulos and Staudigl (2018). We extend this analysis to the behavior of stochastic gradient algorithms in games.

We assume that each agent updates their strategy using the update rule

(5)

for some zero-mean, finite-variance stochastic process . Before presenting the results for the stochastic case, let us comment on the different learning algorithms that fit into this framework.

Examples of Stochastic Gradient-Based Learning

Class Gradient Learning Rule
Gradient-Play
GANs
MA Policy Gradient
Individual Q-learning
MA Gradient Bandits ,
MA Experts ,
Table 1: Example problem classes that fit into competitive gradient-based learning rules. Details on the derivation of these update rules as gradient-based learning schemes is provided in Appendix C.

The stochastic gradient-based learning setting we study is general enough to include a variety of commonly used multi-agent learning algorithms. The classes of algorithms we include is hardly an exhaustive list, and indeed many extensions and altogether different algorithms exist that can be considered members of this class. In Table 1, we provide the gradient-based update rule for six different example classes of learning problems: (i) gradient-play in non-cooperative continuous games, (ii) GANs, (iii) multi-agent policy gradient, (iv) individual Q-learning, (v) multi-agent gradient bandits, and (vi) multi-agent experts. We provide a detailed analysis of these different algorithms including the derivation of the gradient-based update rules along with some interesting numerical examples in Appendix C. In each of these cases, one can view an agent employing the given algorithm as building an unbiased estimate of their gradient from their observation of the environment.

For example, in multi-agent policy gradient (see, e.g., (Sutton and Barto, 2017, Chapter 13)), agents’ costs are defined as functions of a parameter vector that parameterize their policies . The parameters are agent ’s choice variable. By following the gradient of their loss function, they aim to tune the parameters in order to converge to an optimal policy . Perhaps surprisingly, it is not necessary for agent to have access to or even in order for them to construct an unbiased estimate of the gradient of their loss with respect to their own choice variable as long as they observe the sequence of actions, say , of all other agents generated. These actions are implicitly determined by the other agents’ policies . Hence, in this case if agent observes , where are the reward, action, and state of agent , then this is enough to construct an unbiased estimate of their gradient. We provide further details on multi-agent policy gradient in Appendix C.

Stochastic Gradient Results

Returning to the analysis of (5), we make the following standard assumptions on the noise processes Robbin (1971); Robbins and Siegmund (1985).

Assumption 2

The stochastic process satisfies the assumptions , and a.s., for , where is an increasing family of -fields—i.e. filtration, or history generated by the sequence of random variables—given by .

We also make new assumptions on the players’ step-sizes. These are standard assumptions in the stochastic approximation literature and are needed to ensure that the noise processes are asymptotically controlled.

Assumption 3

For each , with , is –Lipschitz with , the step-sizes satisfy for all and and , and a.s.

Let and denotes the inner product. The following theorem extends the results of Theorem 4.1 to the stochastic gradient dynamics in games. {theorem} Consider a game on . Suppose each agent adopts a stochastic gradient algorithm that satisfies Assumptions 2 and 3. Further, suppose that for each , there exists a constant such that for every unit vector . Then, competitive stochastic gradient-based learning converges to strict saddle points of the game on a set of measure zero. The proof follows directly from showing that (5) satisfies Theorem A.3, provided the assumptions of the theorem hold. The assumption that rules out degenerate cases where the noise forces the stochastic dynamics onto the stable manifold of strict saddle points.

Theorem 4.2.2 implies that the dynamics of stochastic gradient-based learning defined in (5), have the same limiting properties as the deterministic dynamics vis-à-vis saddle points. Thus, the implications described in Section 4.1.1 extend to the stochastic gradient setting. In particular, stochastic gradient-based algorithms will avoid a non-negligible subset of the Nash equilibria in general-sum and potential games. Further, in zero-sum and general-sum games, if the players fo converge to a critical point, that point may be a non-Nash equilibrium.

Further Convergence Results for Stochastic Gradient-Play in Games

As we demonstrated in Section 4.1, outside of potential games, the dynamics of gradient-based learning algorithms in games are not gradient flows. As such, the players’ actions can converge to more complex sets than simple equilibria. A particularly prominent class of limiting behaviors for dynamical systems are known as limit cycles (see e.g Sastry (1999)). Limit cycles (or periodic orbits) are sets of states such that each state is visited at periodic intervals ad infinitum under the dynamics. Thus, if the gradient-based algorithms converge to a limit cycle they will cycle infinitely through the same sequence of actions. Like equilibria, limit cycles can be stable or unstable under the dynamics , meaning that the dynamics can either converge to or diverge from them depending on their initializations.

We remark that the existence of oscillatory behaviors and limit cycles has been observed in the dynamics of of gradient-based learning in various settings like the training of Generative Adversarial Networks Daskalakis et al. (2017), and multiplicative weights in finite action games Mertikopoulos et al. (2018). We simply emphasize that the existence of such limiting behaviors is due to the fact that the dynamics are no longer gradient flows. This fact also allows for other complex limiting behaviors like chaos4 to exist in the dynamics of gradient-based learning in games. We also show in Appendix B.1 that gradient-based learning avoids some limit cycles.

In Appendix B.1, we formalize the notion of a limit cycle and its stability in the stochastic setting. Using these concepts, we then provide an analogous theorem to Theorem 4.2.2 which states that competitive stochastic gradient-based learning converges to linearly unstable limit cycles—a parallel notion to strict saddle points but pertaining to more general limit sets—on a set of measure zero, provided that analogous assumptions to those in the statement of Theorem 4.2.2 hold. Providing such guarantees requires a bit more mathematical formalism, and as such we leave the details of these results to Appendix B.

In pursuit of a more general class of games with desirable convergence properties, in Appendix B.2 we also introduce a generalization of potential games, namely Morse-Smale games, for which the combined gradient dynamics correspond to a Morse-Smale vector field Hirsch (1976); Palis and Smale (1970). In such games players are guaranteed to converge to only (linearly stable) cycles or equilibria. In such games, however, players may still converge to non-Nash equilibria and avoid a subset of the Nash equilibria.

5 Saddle Point LNE in LQ Dynamic Games

In this section, we present empirical results that show that a non-negligible subset of two-player LQ games have local Nash equilibria that are strict saddle points of the gradient dynamics. LQ games serve as good benchmarks for analyzing the limiting behavior of gradient-play in a non-trivial setting since they are known to admit global Nash equilibria that can be found be solving a coupled set of Riccati equations Basar and Olsder (1998). LQ games can also be cast as multi-agent reinforcement learning problems where each agent has a policy that is a linear function of the state and a quadratic reward function. Gradient-play in LQ games can therefore be seen as a form of policy gradient.

The empirical results we now present imply that, even in the relatively straightforward case of linear dynamics, linear feedback policies, and quadratic costs, policy gradient multi-agent reinforcement learning would be unable to find the local Nash equilibrium in a non-negligible subset of problems.

LQ game setup For simplicity, we consider two-player LQ games in . Consider a discrete time dynamical system defined by

(6)

where is the state at time , and are the control inputs of players and , respectively, and , , and are the system matrices. We assume that player searches for a linear feedback policy of the form that minimizes their loss which is given by

where and are the cost matrices on the state and input, respectively. We note that the two players are coupled through the dynamics since is constrained to obey the update equation (6). The vector of player derivatives is given by where

Note that there is a slight abuse of notation here as we are treating as a matrix and as the vectorization of a matrix. The matrices and can be found by solving the Riccati equations

for a given . As shown in Basar and Olsder (1998), global Nash equilibria of LQ games can be found by solving coupled Ricatti equations. Under the following assumption, this can be done using an analogous method to the method of Lyapunov iterations outlined in Li and Gajic (1995) for continuous time LQ games.

Assumption 4

Either or is stabilizable-detectable.

Further information on the uniqueness of Nash equilibria in LQ games and the method of Lyapunov iterations can be found in Basar and Olsder (1998) and Li and Gajic (1995) respectively.

Generating LQ games with strict saddle point Nash equilibria Without loss of generality, we assume is stabilizable-detectable. Given that we have a method of finding the global Nash equilibrium of the LQ game, we now present our experimental setup.

We fix , , , and and parametrize , and by and respectively. The shared dynamics matrix has entries that are sampled from the uniform distribution supported on . For each value of the parameters , , and , we randomly sample different matrices. Then, for each LQ game defined in terms of each of the sets of parameters, we find the optimal feedback matrices using the method of Lyapunov iterations, and we numerically approximate using auto-differentiation tools and check its eigenvalues.

The exact values of the matrices are defined as follows: with each of the entries sampled from the uniform distribution on ,

The results for various combinations of the parameters and are shown in Figure 2. For all of the different parameter configurations considered, we found that in anywhere from of the randomly sampled LQ games, there was a global Nash equilibrium that was a strict saddle point of the gradient dynamics. Of particular interest is the fact that for all values of and we tested, at least of the LQ games had a global Nash equilibrium with the strict saddle property. In the worst case, around of the LQ games for the given values of and admitted such Nash equilibria.

Figure 2: Frequency (out of 1000) of randomly sampled LQ games with global Nash equilibria that are avoided by policy-gradient. The experiment was run 10 times and the average frequency is shown by the solid line. The shaded region demarcates the confidence interval of the experiment. (left) is varied in , . (right) is varied in , .
{remark}

These empirical observations imply that multi-agent policy gradient, even in the relatively straightforward setting of linear dynamics, linear policies, and quadratic costs, has no guarantees of convergence to the global Nash equilibria in a non-negligible number of games. Further investigation is warranted to validate this fact theoretically. This in turn supports the idea that for more complicated cost functions, policy classes, and dynamics, local Nash equilibria with the strict saddle property are likely to be very common.

6 Discussion and Future Directions

In this paper we provided answers to the following two questions for classes of gradient-based learning algorithms:

Q1.

Are all attractors of the learning algorithms employed by agents equilibria relevant to the underlying game?

Q2.

Are all equilibria relevant to the game also attractors of the learning algorithms agents employ?

We answered these questions in general-sum, zero-sum, and potential games without imposing structure on the game outside regularity conditions on the cost functions by exploiting the observation that gradient-based learning dynamics are not gradient flows. Our analysis, was shown in Section C to apply to a number of commonly used methods in multi-agent learning.

6.1 Links with Prior Work

As we noted, previous work on learning in games in both the game theory literature, and more recently from the machine learning community, has largely focused on Q1, though some recent work has analyzed Q2 in the setting of zero-sum games.

In the seminal work by Rosen Rosen (1965), –player concave or monotone games are shown to either admit a unique Nash equilibrium or a continuum of Nash equilibria, all of which are attracting under gradient-play. The structure present in these games rules out the existence of non-Nash equilibria.

Two-player, finite-action bilinear games have also been extensively studied. In Singh et al. (2000), the authors investigate the convergence of the gradient dynamics in such games. Additionally, the dynamics of other (non gradient-based) algorithms like multiplicative weights have been studied in Hommes and Ochea (2012) among many others. In such settings, the structure guarantees that there exists a unique global Nash equilibrium and no other critical points of the gradient dynamics. As such, non-Nash equilibria, cannot exist.

In the study of learning dynamics in the class of zero-sum games, it has been shown that cycles can be attractors of the dynamics (see, e.g., Mertikopoulos et al. (2018); Wesson and Rand (2016); Hommes and Ochea (2012)). Concurrently with our results, Daskalakis and Panageas (2018) also showed the existence of non-Nash attracting equilibria in this setting.

In more general settings, there has been some analysis of the limiting behavior of gradient-play though the focus has been for the most part, on giving sufficient conditions under which Nash equilibria are attracting under gradient-play. For example,  Ratliff et al. (2013, 2014, 2016), introduced the notion of a differential Nash equilibrium which is characterized by first and second order conditions on the players’ individual cost functions and which we made extensive use of. Following this body of work, Mertikopoulos and Zhou (2019) also investigated the local convergence of gradient-play in continuous games. They showed that if a Nash equilibrium satisfies a property known as variational stability, the equilibrium is attracting under gradient play. In twice continuously differentiable games, this condition coincides exactly with the definition of stable differential Nash equilibria. Though these works analyze a general class of games, the focus of the analysis is solely on the local characterization and computation (via gradient play) of local Nash equilibria. As such, the issues of non-convergence that we show in this paper were not discussed.

6.2 Open Questions

Our results suggest that gradient-play in multi-agent settings has fundamental problems. Depending on the players’ costs, in general games and even potential games, which have a particularly nice structure, a subset of the Nash equilibria will be almost surely avoided by gradient-based learning when the agents randomly initialize their first action. In zero-sum and general-sum games, even if the algorithms do converge, they may have converged to a point that has no game theoretic relevance, namely a non-Nash locally asymptotically stable equilibrium.

Lastly, these results show that limit cycles persist even under a stochastic update scheme. This explains the empirical observations of limit cycles in gradient dynamics presented in  Daskalakis et al. (2017); Leslie and Collins (2005); Hommes and Ochea (2012). It also implies that gradient-based learning in multi-agent reinforcement learning, multi-armed bandits, generative adversarial networks, and online optimization all admit limit cycles under certain loss functions. Our empirical results show that these problems are not merely of theoretical interest, but also have great relevance in practice.

Which classes of games have all Nash being attracting for gradient-play and which classes preclude the existence of non-Nash equilibria is an open and particularly interesting question. Further, the question of whether gradient-based algorithms can be constructed for which only game-theoretically relevant equilibria are attracting is of particular importance as gradient-based learning is increasingly implemented in game theoretic settings. Indeed, more generally, as learning algorithms are increasingly deployed in markets and other competitive environments understanding and dealing with such theoretical issues will become increasingly important.

Appendix A Proofs of the Main Results

This appendix contains the full proofs of the results in the paper.

a.1 Proofs on Links Between Dynamical Systems and Games

We begin with a proof of Proposition 3.1 that all differential Nash equilibria are either strict saddle points or asymptotically stable equilibria of the gradient dynamics. This relies mainly on the definitions of strict saddle points, locally asymptotically stable equilibria, and non-degenerate differential Nash equilibria and simple linear algebra.

{proof}

[Proof of Proposition 3.1] Suppose that is a non-degenerate differential Nash equilibrium. We claim that . Since is a differential Nash equilibrium, for each ; these are the diagonal blocks of . Further implies that . Since , . Thus, it is not possible for all the eigenvalues to have negative real part. Since is non-degenerate, so that none of the eigenvalues can have zero real part. Hence, at least one eigenvalue has strictly positive real part.

To complete the proof, we show that the conditions for non-degenerate differential Nash equilibrium are not sufficient to guarantee that is locally asymptotically stable for the gradient dynamics—that is, not all eigenvalues of have strictly positive real part. We do this by constructing a class of games with the strict saddle point property. Consider a class of two player games on defined as follows:

In this game, the Jacobian of the gradient dynamics is given by

(7)

with . If is a non-degenerate differential Nash equilibria, and which implies that . Choosing such that will guarantee that one of the eigenvalues of is negative and the other is positive, making a strict saddle point. This shows that non-degenerate differential Nash equilibria can be strict saddle points of the combined gradient dynamics.

Hence, for any game , a non-degenerate differential Nash equilibrium is either a locally asymptotically stable equilibrium or a strict saddle point, but it not strictly unstable or strictly marginally stable (i.e. having eigenvalues all on the imaginary axis).

The proof of Proposition 3.2, which claims that all differential Nash equilibria in zero-sum games are locally asymptotically stable, again just relies on basic linear algebra and the definition of a differential Nash equilibrium.

{proof}

[Proof of Proposition 3.2] Consider a two player game on with . For such a game,

Note that . Suppose that is a differential Nash equilibrium and let with and . Then, since and for , a differential Nash equilibrium. Since is arbitrary, this implies that is positive definite and hence, clearly non-degenerate. Thus, for two-player zero-sum games, all differential Nash equilibria are both non-degenerate differential Nash equilibria and locally asymptotically stable equilibria of

The proof that all locally asymptotically stable equilibria in potential games are differential Nash equilibria relies on the symmetry of in potential games.

{proof}

[Proof of Proposition 3.3] The proof follows from the definition of a potential game. Since is a potential game, it admits a potential function such that for all . This, in turn, implies that at a locally asymptotically stable equilibrium of , , where is the Hessian matrix of the function . Further must have strictly positive eigenvalues for to be a locally asymptotically stable equilibrium of . Since the Hessian matrix of a function must be symmetric, , must be positive definite, which through Sylvester’s criterion ensures that each of the diagonal blocks of is positive definite. Thus, we have that the existence of a potential function guarantees that the only locally asymptotically stable equilibria of , are differential Nash equilibria.

a.2 Proofs for Deterministic Setting

We now present the proof of Theorem 4.1 and its corollaries. The proof of relies on the celebrated stable manifold theorem (Shub, 1978, Theorem III.7), Smale (1967). Given a map , we use the notation to denote the –times composition of . {theorem}[Center and Stable Manifolds (Shub, 1978, Theorem III.7), Smale (1967)] Let be a fixed point for the local diffeomorphism where is an open neighborhood of in and . Let be the invariant splitting of into generalized eigenspaces of corresponding to eigenvalues of absolute value less than one, equal to one, and greater than one. To the invariant subspace there is an associated local –invariant embedded disc called the local stable center manifold of dimension and ball around such that , and if for all , then . Some parts of the proof follow similar arguments to the proofs of results in Lee et al. (2016); Panageas and Piliouras (2016) which apply to (single-agent) gradient-based optimization. Due to the different learning rates employed by the agents and the introduction of the differential game form , the proof differs.

{proof}

[Proof of Theorem 4.1] The proof is composed of two parts: (a) the map is a diffeomorphism, and (b) application of the stable manifold theorem to conclude that the set of initial conditions is measure zero.

(a) is diffeomorphism We claim the mapping is a diffeomorphism. If we can show that is invertible and a local diffeomorphism, then the claim follows. Consider and suppose so that . The assumption implies that satisfies the Lipschitz condition on . Hence, . Let where —that is, is an diagonal matrix with repeated on the diagonal times. Then, since .

Now, observe that . If is invertible, then the implicit function theorem (Lee, 2012, Theorem C.40) implies that is a local diffeomorphism. Hence, it suffices to show that does not have an eigenvalue of . Indeed, letting be the spectral radius of a matrix , we know in general that for any square matrix and induced operator norm so that Of course, the spectral radius is the maximum absolute value of the eigenvalues, so that the above implies that all eigenvalues of have absolute value less than .

Since is injective by the preceding argument, its inverse is well-defined and since is a local diffeomorphism on , it follows that is smooth on . Thus, is a diffeomorphism.

(b) Application of the stable manifold theorem Consider all critical points to the game—i.e. . For each , let be the open ball derived from Theorem A.2 and let . Since , Lindelõf’s lemma Kelley (1955)—every open cover has a countable subcover—gives a countable subcover of . That is, for a countable set of critical points with , we have that .

Starting from some point , if gradient-based learning converges to a strict saddle point, then there exists a and index such that for all . Again, applying Theorem A.2 and using that —which we note is obviously true if —we get that .

Using the fact that