Convergence Analysis of Gradient-Based Learning with Non-Uniform Learning Rates in Non-Cooperative Multi-Agent Settings
Considering a class of gradient-based multi-agent learning algorithms in non-cooperative settings, we provide local convergence guarantees to a neighborhood of a stable local Nash equilibrium. In particular, we consider continuous games where agents learn in (i) deterministic settings with oracle access to their gradient and (ii) stochastic settings with an unbiased estimator of their gradient. Utilizing the minimum and maximum singular values of the game Jacobian, we provide finite-time convergence guarantees in the deterministic case. On the other hand, in the stochastic case, we provide concentration bounds guaranteeing that with high probability agents will converge to a neighborhood of a stable local Nash equilibrium in finite time. Different than other works in this vein, we also study the effects of non-uniform learning rates on the learning dynamics and convergence rates. We find that much like preconditioning in optimization, non-uniform learning rates cause a distortion in the vector field which can, in turn, change the rate of convergence and the shape of the region of attraction. The analysis is supported by numerical examples that illustrate different aspects of the theory. We conclude with discussion of the results and open questions.
The characterization and computation of equilibria such as Nash equilibria and its refinements constitutes a significant focus in non-cooperative game theory. Several natural questions arises including “how do players find such equilibria?” and “how should the learning process be interpreted?” With these questions in mind, a variety of fields have focused their attention on the problem of learning in games. This has, in turn, lead to a plethora of learning algorithms including gradient play, fictitious play, best response, and multi-agent reinforcement learning among others .
From an applications point of view, a more recent trend is in the adoption of game theoretic models of algorithm interaction in machine learning applications. For instance, game theoretic tools are being used to improve the robustness and generalizability of machine learning algorithms; e.g., generative adversarial networks have become a popular topic of study demanding the use of game theoretic ideas to provide performance guarantees . In other work from the learning community, game theoretic concepts are being leveraged to analyze the interaction of learning agents—see, e.g., [15, 21, 3, 33, 23]. Even more recently, convergence analysis to Nash equilibria has been called into question ; in its place is a proposal to consider game dynamics as the meaning of the game. This is an interesting perspective as it is well known that in general learning dynamics do not obtain an Nash equilibrium even asymptotically—see, e.g., —and, perhaps more interestingly, many learning dynamics exhibit very interesting limiting behaviors including periodic orbits and chaos—see, e.g., [6, 7, 17, 16].
Despite this activity, we still lack a complete understanding of the dynamics and limiting behaviors of coupled, competing learning algorithms. One may imagine that the myriad results on convergence of gradient descent in optimization readily extend to the game setting. Yet, they do not since gradient-based learning schemes in games do not correspond to gradient flows, a class of flows that are guaranteed to converge to local minimizers almost surely. In particular, the gradient-based learning dynamics for competitive, multi-agent settings have a non-symmetric Jacobian and as a consequence their dynamics may admit complex eigenvalues and non-equilibrium limiting behavior such as periodic orbits. In short, this fact makes it difficult to extend many of the optimization approaches to convergence in single-agent optimization settings to multi-agent settings primarily due to the fact that steps in the direction of individual gradients of players’ costs do not guarantee that each agents cost decreases. In fact, in games, as our examples highlight, a player’s cost can increase when they follow the gradient of their own cost. Counterintuitively, agents can also converge to local maxima of their own costs despite descending their own gradient. These behaviors are due to the coupling between the agents.
Some of the questions that remain unaddressed and to which we provide partial answers include the derivation of error bounds and convergence rates. These are important for ensuring performance guarantees on the collective behavior and can help provide guarantees on subsequent control or incentive policy synthesis. We also investigate the question of how naturally arising features of the learning process for autonomous agents, such as their learning rates, impact the learning path and limiting behavior. This further exposes interesting questions about the overall quality of the limiting behavior and the cost accumulated along the learning path—e.g., is it better to be a slow or fast learner both in terms of the cost of learning and the learned behavior?
We study convergence of a broad class of gradient-based multi-agent learning algorithms in non-cooperative settings by leveraging the framework of -player continuous games along with tools from numerical optimization and dynamical systems theory. We consider a class of learning algorithms
where is the choice variable or action of player , is its learning rate, and is derived from the gradient of a function that abstractly represents the cost of player . The key feature of non-cooperative settings is coupling of an agent’s cost through all other agents’ choice variables .
We consider two settings: (i) agents have oracle access to and (ii) agents have an unbiased estimator for . The class of gradient-based learning algorithms we study encompases a wide variety of approaches to learning in games including multi-agent policy gradient, gradient-based approaches to adversarial learning, and multi-agent gradient-based online optimization. For both the deterministic (oracle gradient access) and the stochastic (unbiased estimators) settings, we provide convergence results for both uniform learning rates—i.e., where for each player —and for non-uniform learning rates. The latter of which arises more naturally in the study of the limiting behavior of autonomous learning agents.
In the deterministic setting, we derive asymptotic and finite-time convergence rates for the coupled learning processes to a refinement of local Nash equilibria known as differential Nash equilibria  (a class of equilibria that are generic amongst local Nash equilibria). In the stochastic setting, leveraging the results of stochastic approximation and dynamical systems, we derive asymptotic convergence guarantees to stable local Nash equilibria as well as high-probability, finite-time guarantees for convergence to a neighborhood of a Nash equilibrium. The analytical results are supported by several illustrative numerical examples. We also provide discussion on the effect of non-uniform learning rates on the learning path—that is, different learning rates warp the vector field dynamics. Coordinate based learning rates are typically leveraged in gradient-based optimization schemes to speed up convergence or avoid poor quality local minima. In games, however, the interpretation is slightly different since each of the coordinates of the dynamics corresponds to minimizing a different cost function along the respective coordinate axis. The resultant effect is a distortion of the vector field in such a way that it has the effect of leading the joint action to a point which has a lower value for the slower player relative to the flow of the dynamics given a uniform learning rate and the same initialization. In this sense, it seems that the answer to the question posed above is that it is most beneficial for an agent to have the slower learning rate.
The remainder of the paper is organized as follows. We start with mathematical and game-theoretic preliminaries in Section 2 which is followed by the main convergence results for the deterministic setting (Section 3) and the stochastic setting (Section 4). Within each of the latter two sections, we present convergence results for both the case where agents have uniform and non-uniform learning rates. In Section 5, we present several numerical examples which help to illustrate the theoretical results and also highlight some directions for future inquiry. Finally, we conclude with discussion and future work in Section 6.
Consider a setting in which at iteration , each agent updates their choice variable by the process
where is agent ’s learning rate, denotes the choices of all agents excluding the -th agent, and . Within the above setting, the class of learning algorithms we consider is such that for each , there exists a sufficiently smooth function , such that is either , where denotes the derivative with respect to , or an unbiased estimator of —i.e., where .
The collection of costs on where is agent ’s cost function and is their action space defines a continuous game. In this continuous game abstraction, each player aims to selection an action that minimizes their cost given the actions of all other agents, . That is, players myopically update their actions by following the gradient of their cost with respect to their own choice variable. For a symmetric matrix , let be its eigenvalues. For a matrix , let be the spectrum of .
For each , for and is –Lipschitz.
Let denote the second partial derivative of with respect to and denote the partial derivative of with respect to . The game Jacobian—i.e., the Jacobian of —is given by
The entries of the above matrix are dependent on , however, we drop this dependence where obvious. Note that each is symmetric under Assumption 1, yet is not. This is an important point and causes the subsequent analysis to deviate from the typical analysis of (stochastic) gradient descent.
The most common characterization of limiting behavior in games is that of a Nash equilibrium. The following definitions are useful for our analysis.
A strategy is a local Nash equilibrium for the game if for each there exists an open set such that and for all . If the above inequalities are strict, is a strict local Nash equilibrium.
A point is said to be a critical point for the game if .
We denote the set of critical points as . Analogous to single-player optimization settings, for each player, viewing all other players’ actions as fixed, there are necessary and sufficient conditions which characterize local optimality.
Proposition 1 ().
If is a local Nash equilibrium of the game , then and . On the other hand, if and , then is a local Nash equilibrium.
The sufficient conditions in the above result give rise to the following definition of a differential Nash equilibrium.
Definition 3 ().
A strategy is a differential Nash equilibrium if and for each .
Differential Nash need not be isolated. However, if is non-degenerate—meaning that —for a differential Nash , then is an isolated strict local Nash equilibrium. Non-degenerate differential Nash are generic amongst local Nash equilibria and they are structurally stable  which ensures they persist under small perturbations. This result also implies an asymptotic convergence result: if the spectrum of is strictly in the right-half plane (i.e. ), then a differential Nash equilibrium is (exponentially) attracting under the flow of [28, Proposition 2]. We say such equilibria are stable.
3 Deterministic Setting
The multi-agent learning framework we analyze is such that each agent’s rule for updating their choice variable consists of the agent modifying their action in the direction of their individual gradient . Let us first consider the setting in which each agent has oracle access to . The learning dynamics are given by
where with denoting the identity matrix. Within this setting we consider both the cases where the agents have a constant uniform learning rate—i.e., —and where their learning rates are non-uniform, but constant—i.e., is not necessarily equal to for any , .
Let be the symmetric part of . Define
where is a –radius ball around . For a stable differential Nash , let be a ball of radius around the equilibrium that is contained in the region of attraction for 111Many techniques exists for approximating the region of attraction; e.g., given a Lyapunov function, its largest invariant level set can be used as an approximation . Since , the converse Lyapunov theorem guarantees the existence of a local Lyapunov function.. Let with be the largest ball contained in the region of attraction of .
3.1 Uniform Learning Rates
With for each , the learning rule (2) can be thought of as a discretized numerical scheme approximating the continuous time dynamics
With a judicious choice of learning rate , (2) will converge (at an exponential rate) to a locally stable equilibrium of the dynamics.
Consider an –player continuous game satisfying Assumption 1. Let be a stable differential Nash equilibrium. Suppose agents use the gradient-based learning rule with learning rates where is the smallest positive such that . Then, for , exponentially.
The above result provides a range for the possible learning rates for which (2) converges to a stable differential Nash equilibrium of assuming agents initialize in a ball contained in the region of attraction of . Note that the usual assumption in gradient-based approaches to single-objective optimization problems (in which case is symmetric) is that , where objective being minimized is -Lipschitz. This is sufficient to guarantee convergence since the spectral radius of a matrix is always less than any operator norm which, in turn, ensures that for each . If the game is a potential game—i.e., there exists a function such that for each which occurs if and only if —then convergence analysis coincides with gradient descent so that any where is the Lipschitz constant of results in local asymptotic convergence.
The convergence guarantee in Proposition 2 is asymptotic in nature. It is often useful, from both an analysis and synthesis perspective, to have non-asymptotic or finite-time convergence results. Such results can be used to provide guarantees on decision-making processes wrapped around the coupled learning processes of the otherwise autonomous agents. The next result, provides a finite-time convergence guarantee for gradient-based learning where agents uniformly use a fixed step size.
Let be defined as before with the added condition that it be defined to be the largest ball in the region of attraction such that on the symmetric part of —i.e., —is positive definite.
Consider a game on satisfying Assumption 1. Let be a stable differential Nash equilibrium. Suppose and that . Then, given , the gradient-based learning dynamics with learning rate obtains an –differential Nash such that for all
Before we proceed to the proof, let us remark on the assumption that . First, is always true; indeed, suppressing the dependence on ,
where denotes the largest singular value of its argument. Thus, the condition that is generally true; for equality to hold, the symmetric part of would have repeated eigenvalues, which is not generic. Hence, we include this assumption in Theorem 1, but note that it is not restrictive and is fairly benign.
Proof of Theorem 1.
First, note that where . Now, given , by the mean value theorem,
Hence, it suffices to show that for the choice of , the eigenvalues of are in the unit circle. Indeed, since , we have that
If is less than one, then the dynamics are contracting. For notational convenience, we drop the explicit dependence on . Since on ,
where the last inequality holds for . Hence,
Since , we have that so that
This, in turn, implies that for all . ∎
Note that is selected to minimize . Hence, this is the fastest learning rate given the worst case eigenstructure of over the ball for the choice of operator norm . We note, however, that faster convergence is possible as indicated by Proposition 2 and observed in the examples in Section 5. Indeed, we note that the spectral radius of a matrix is always less than its maximum singular value—i.e. —so it is possible to contract at a faster rate. We remark that if was symmetric (i.e., in the case of a potential game  or a single-agent optimization problem), then . In games, however, is not symmetric.
3.2 Non-Uniform Learning Rates
Let us now consider the case when agents have their own individual learning rate , yet still have oracle access to their individual gradients. This is, of course, more natural in the study of autonomous learning agents as opposed to efforts for computing Nash equilibria for a given game.
Consider an –player game satisfying Assumption 1. Let be a stable differential Nash equilibrium. Suppose agents use the gradient-based learning rule with learning rates such that for all . Then, for , exponentially.
The proof is a direct application of Ostrowski’s theorem . We provide a simple proof via Lyapunov argument for posterity.
Mazumdar and Ratliff  show that (2) will almost surely avoid strict saddle points of the dynamics, some of which are Nash equilibria in non-zero sum games. Note that the set of critical points contains more than just the local Nash equilibria. Hence, except on a set of measure zero, (2) will converge to a stable attractor of which includes stable limit cycles and stable local non-Nash critical points.
Letting , since for some , , the expansion
holds, where satisfies so that given , there exists an such that for all .
Suppose that for all so that there exists such that for all . For , let be the largest such that for all . Furthermore, let , where , be arbitrary. Then, given , gradient-based learning with learning rates obtains an –differential Nash equilibrium in finite time—i.e., for all where .
We note that the proposition can be more generally stated with the assumption that , in which case there exists some defined in terms of bounds on powers of . We provide the proof of this in Appendix A.1. We also note that these results hold even if is not a diagonal matrix as we have assumed as long as .
A perhaps more interpretable finite bound stated in terms of the game structure can also be obtained. Consider the case in which players adopt learning rates with . Given a stable differential Nash equilibrium , let be the largest ball of radius contained in the region of attraction on which is positive definite where so that , and define
Given a stable differential Nash equilibrium , let be the largest ball contained in the region of attraction on which is positive definite—i.e., .
Suppose that Assumption 1 holds and that is a stable differential Nash equilibrium. Let , , , and for each , with . Then, given , the gradient-based learning dynamics with learning rates obtain an –differential Nash such that for all
First, note that where . Now, given , by the mean value theorem,
Hence, it suffices to show that for the choice of , the eigenvalues of live in the unit circle. Then an inductive argument can be made with the inductive hypothesis that . Let . Then we need to show that has eigenvalues in the unit circle. Since , we have that
If is less than one, where the norm is the operator –norm, then the dynamics are contracting. For notational convenience, we drop the explicit dependence on . Then,
The first inequality holds since . Indeed, first observe that the singular values of are the same as those of since the latter is positive definite symmetric. Thus, by noting that and employing Cauchy-Schwartz, we get that and hence, the inequality.
Using the above to bound , we have . Since , so that . This, in turn, implies that for all .
Multiple learning rates lead to a scaling rows which can have a significant effect on the eigenstructure of the matrix, thereby making the relationship between and difficult to reason about. None-the-less, there are numerous approaches to solving nonlinear systems of equations (or differential equations expressed as a set of nonlinear system of equations) that employ preconditioning (i.e., coordinate scaling). The purpose of using a preconditioning matrix is to rescale the problem and achieve better or faster convergence. Many of these results directly translate to convergence guarantees for learning in games when the learning rates are not uniform; however, in the case of understanding convergence properties for autonomous agents learning an equilibrium—as opposed to computing an equilibrium—the preconditioner is not subject to design. Perhaps this reveals an interesting direction of future research in terms of synthesizing games or learning rules via incentivization or otherwise exogenous control policies for either coordinating agents or improving the learning process—e.g., using incentives to induce a particular equilibrium while also encouraging faster learning.
4 Stochastic Setting
In this section, we consider gradient-based learning rules for each agent where the agent does not have oracle access to their individual gradients, but rather has an unbiased estimator in its place. In particular, for each player , consider the noisy gradient-based learning rule given by
where is the learning rate and is an independent identically distributed stochastic process. In order to prove a high-probability, finite sample convergence rate, we can leverage recent results for convergence of nonlinear stochastic approximation algorithms. The key is in formulating the the learning rule for the agents and in leveraging the notion of a stable differential Nash equilibrium which has analogous properties as a locally stable equilibrium for a nonlinear dynamical system. Making the link between the discrete time learning update and the limiting continuous time differential equation and its equilibria allows us to draw on rich existing convergence analysis tools.
In the first part of this section, we provide convergence rate results for the case where the agents use a uniform learning rate—i.e. . In the second part of this section, we extend these results to the case where agents use non-uniform learning rates—that is, each agent has its own learning rate —by incorporating some additional assumptions and leveraging two-timescale analysis techniques from dynamical systems theory.
We require some modified assumptions in this section on the learning process structure.
The gradient-based learning rule (3) satisfies the following:
Given the filtration , are conditionally independent. Moreovoer, for each , almost surely (a.s.), and a.s. for some constants .
For each , the stepsize sequence contain positive scalars such that
Each for some and each and are – and –Lipschitz, respectively.
4.1 Uniform Learning Rates
Before concluding, we specialize to the case in which agents have the same learning rate sequence for each .
Suppose that is a stable differential Nash equilibrium of the game and that Assumption 2 holds (excluding A2b.iii). For each , let and
Fix any such that where is the region of attraction of . There exists constants and functions and so that whenever and , where is such that for all , the samples generated by the gradient-based learning rule satisfy
where the constants depend only on parameters and the dimension . Then stochastic gradient-based learning in games obtains an –stable differential Nash in finite time with high probability.
The above theorem implies that for all with high probability for some constant that depends only on , and .
Since is a stable differential Nash equilibrium, is positive definite and is positive definite for each . Thus is a locally asymptotically stable hyperbolic equilibrium point of . Hence, the assumptions of Theorem 1.1  are satisfied so that we can invoke the result which gives us the high probability bound for stochastic gradient-based learning in games. ∎
The above theorem has a direct corollary specializing to the case where the gradient-based learning rule with uniform stepsizes is initialized inside a ball of radius constained in the region of attraction—i.e., .
4.2 Non-Uniform Learning Rates
Consider now that agents have their own learning rates for each . In environments with several autonomous agents, as compared to the objective of computing Nash equilibria in a game, it is perhaps more reasonable to consider the scenario in which the agents have their own individual learning rate. For the sake of brevity, we show the convergence result in detail for the two agent case—that is, where . We note that the extension to agents is straightforward. The proof leverages recent results from the theory of stochastic approximation presented in  and we note that our objective here is to show that they apply to games and provide commentary on the interpretation of the results in this context.
The gradient-based learning rules are given by
so that with , in the limit , the above system can be thought of as approximating the singularly perturbed system
Indeed, since —i.e., at a faster rate than —updates to appear to be equilibriated for the current quasi-static as the dynamics in (5) suggest.
4.2.1 Asymptotic Convergence in the Non-Uniform Learning Rate Setting
For fixed , the system has a globally asymptotically stable equilibrium .
Define the continuous time accumulated after samples of to be and define for to be the trajectory of . Furthermore, define the event .
Since for some , it is locally Lipschitz and, on the event , it is bounded. It thus induces a continuous globally integrable vector field, and therefore satisfies the assumptions of Proposition 4.1 of . Moreover, under Assumption 2, the assumptions of Proposition 4.2 of  are satisfied. Hence, invoking said propositions, we get the desired result. ∎
This result essentially says that the slow player’s sample path asymptotically tracks the flow of
If we additionally assume that the slow component also has a global attractor, then the above theorem gives rise to a stronger convergence result.
Given as in Assumption 3, the system has a globally asymptotically stable equilibrium .
More generally, the process will converge almost surely to the internally chain transitive set of the limiting dynamics (5) and this set contains the stable Nash equilibria. If the only internally chain transitive sets for (5) are isolated equilibria (this occurs, e.g., if the game is a potential game), then converges almost surely to a stationary point of the dynamics, a subset of which are stable local Nash equilibria.
It is also worth commenting on what types of games will satisfy these assumptions. To satisfy Assumption 3, it is sufficient for the fastest player’s cost function to be convex in their choice variable.
Note that could still be a spurious stable non-Nash point still since the above implies that , which does not necessarily imply that .
Remark 2 (Relaxation to Local Asymptotic Stability.).
Under relaxed assumptions on global asymptotic stability, we can obtain high-probability results on convergence to locally asymptotically stable attractors. If it is assumed that is in the region of attraction for a locally asymptotically stable attractor, then the above results can be stated with only the assumption of a locally asymptotic stability. However, this is difficult to ensure in practice. To relax the result to a local guarantee regardless of the initialization requires conditioning on an unverifiable event—i.e., the high-probability bound in this case is conditioned on the event belongs to a compact set , which depends on the sample point, of where is the region of attraction of . None-the-less, it is possible to leverage results from stochastic approximation , [10, Chapter 2] to prove local versions of the results for non-uniform learning rates. Further investigation is required to provide concentration bounds for not only games but stochastic approximation in general.
4.2.2 High-Probability, Finite-Sample Guarantees with Non-Uniform Learning Rates
In the stochastic setting, the learning dynamics are stochastic approximation updates, and non-uniform learning rates lead to a multi-timescale setting. The results leverage recent theoretical guarantees for two-timescale analysis of stochastic approximation such as .
Let denote the linear interpolates between sample points and, as in the preceding sub-section, let denote the continuous time flow of with initial data where . Alekseev’s formula is a nonlinear variation of constants formula that provides solutions to perturbations of differential equations using a local linear approximation. We can apply it to the asymptotic pseudo-trajectories in each timescale. For these local approximations, linear systems theory lets us find growth rate bounds for the perturbations, which can, in turn, be used to bound the normed difference between the continuous time flow and the asymptotic pseudo-trajectories. More detail is provided in Appendix A.2.