Gradient Descent-Ascent Provably Converges to Strict Local Minmax Equilibria with a Finite Timescale Separation

# Gradient Descent-Ascent Provably Converges to Strict Local Minmax Equilibria with a Finite Timescale Separation

## Abstract

This paper concerns the local stability and convergence rate of gradient descent-ascent in two-player non-convex, non-concave zero-sum games. We study the role that a finite timescale separation parameter has on the learning dynamics where the learning rate of player 1 is denoted by and the learning rate of player 2 is defined to be . Existing work analyzing the role of timescale separation in gradient descent-ascent has primarily focused on the edge cases of players sharing a learning rate () and the maximizing player approximately converging between each update of the minimizing player (). For the parameter choice of , it is known that the learning dynamics are not guaranteed to converge to a game-theoretically meaningful equilibria in general as shown by Mazumdar et al. (2020) and Daskalakis and Panageas (2018). In contrast, Jin et al. (2020) showed that the stable critical points of gradient descent-ascent coincide with the set of strict local minmax equilibria as . In this work, we bridge the gap between past work by showing there exists a finite timescale separation parameter such that is a stable critical point of gradient descent-ascent for all if and only if it is a strict local minmax equilibrium. Moreover, we provide an explicit construction for computing along with corresponding convergence rates and results under deterministic and stochastic gradient feedback. The convergence results we present are complemented by a non-convergence result: given a critical point that is not a strict local minmax equilibrium, then there exists a finite timescale separation such that is unstable for all . Finally, we extend the stability and convergence results regarding gradient descent-ascent to gradient penalty regularization methods for generative adversarial networks (Mescheder et al., 2018) and empirically demonstrate on the CIFAR-10 and CelebA datasets the significant impact timescale separation has on training performance.

\AtAppendix\counterwithin

lemmasection \AtAppendix\counterwithinpropositionsection \AtAppendix\counterwithintheoremsection \AtAppendix\counterwithincorollarysection \AtAppendix\counterwithindefinitionsection \AtAppendix\counterwithinclaimsection

## 1 Introduction

In this paper we study learning in zero-sum games of the form

 minx1∈X1maxx2∈X2f(x1,x2)

where the objective function of the game is assumed to be sufficiently smooth and potentially non-convex and non-concave in the strategy spaces and respectively with each a precompact subset of . This general problem formulation has long been fundamental in game theory (Başar and Olsder, 1998) and recently it has become central to machine learning with applications in generative adversarial networks (Goodfellow et al., 2014), robust supervised learning (Madry et al., 2018; Sinha et al., 2018), reinforcement and multi-agent reinforcement learning (Rajeswaran et al., 2020; Zhang et al., 2019), imitation learning (Ho and Ermon, 2016), constrained optimization (Cherukuri et al., 2017), and hyperparameter optimization (MacKay et al., 2019; Lorraine et al., 2020) among several others. From a game-theoretic viewpoint, the work on learning in games can, in some sense, be viewed as explaining how equilibria arise through an iterative competition for optimality (Fudenberg et al., 1998). However, in machine learning, the primary purpose of learning dynamics is to compute equilibria efficiently for the sake of providing meaningful solutions to problems of interest.

As a result of this perspective, there has been significant interest in the study of gradient descent-ascent owing to the fact that the learning rule is computationally efficient and a natural analogue to gradient descent from function optimization. Formally, the learning dynamics are given by each player myopically updating a strategy with an individual gradient as follows:

 x+1 =x1−γ1D1f(x1,x2) x+2 =x2+γ2D2f(x1,x2).

The analysis of gradient descent-ascent is complicated by the intricate optimization landscape in non-convex, non-concave zero-sum games. To begin, there is the fundamental question of what type of solution concept is desired. Given the class of games under consideration, local solution concepts have been proposed and are often taken to be the goal of a learning algorithm. The primary notions of equilibrium that have been adopted are the local Nash and local minmax/Stackelberg concepts with a focus on the set of strict local equilibrium that can be characterized by gradient-based sufficient conditions. Following several past works, from here on we refer to strict local Nash equilibrium and strict local minmax/Stackelberg equilibrium as differential Nash equilibrium and differential Stackelberg equilibrium, respectively.

Regardless of the equilibrium notion under consideration, a number of past works highlight failures of standard gradient descent-ascent in non-convex, non-concave zero-sum games. Indeed, it has been shown gradient descent-ascent with a shared learning rate () is prone to reaching critical points that are neither differential Nash equilibrium nor differential Stackelberg equilibrium (Daskalakis and Panageas, 2018; Mazumdar et al., 2020; Jin et al., 2020). While an important negative result, it does not rule out the prospect that gradient descent-ascent may be able to guarantee equilibrium convergence as it fails to account for a key structural parameter of the learning dynamics, namely the ratio of learning rates between the players.

Motivated by the observation that the order of play between players is fundamental to the definition of the game, the role of timescale separation in gradient descent-ascent has been explored theoretically in recent years (Heusel et al., 2017; Chasnov et al., 2019; Jin et al., 2020). On the empirical side of past work, it has been widely demonstrated and prescribed that timescale separation in gradient descent-ascent between the generator and discriminator, either by heterogeneous learning rates or unrolled updates, is crucial to improving the solution quality when training generative adversarial networks (Goodfellow et al., 2014; Arjovsky et al., 2017; Heusel et al., 2017). Denoting as the learning rate of the player 1, the learning rate of player 2 can be redefined as where is the ratio of learning rates or timescale separation parameter. The work of Jin et al. (2020) took a meaningful step toward understanding the effect of timescale separation in gradient descent-ascent by showing that as the stable critical points of the learning dynamics coincide with the set of differential Stackelberg equilibrium. In simple terms, the aforementioned result implies that all ‘bad critical points’ (that is, critical points lacking game-theoretic meaning) become unstable as the timescale separation approaches infinity and that all ‘good critical points’ (that is, game-theoretically meaningful equilibria) remain or become stable as the timescale separation approaches infinity. While a promising theoretical development on the local stability of the underlying dynamics, it does not lead to a practical, implementable learning rule or necessarily provide an explanation for the satisfying performance in applications of gradient descent-ascent with a finite timescale separation. It remains an open question to fully understand gradient descent-ascent as a function of the timescale separation and to determine whether the desirable behavior with an infinite timescale separation is achievable for a range of finite learning rate ratios.

This paper continues the theoretical study of gradient descent-ascent with timescale separation in non-convex, non-concave zero-sum games. We focus our attention on answering the remaining open questions regarding the behavior of the learning dynamics with finite learning rate ratios and provide a number of conclusive results. Notably, we develop necessary and sufficient conditions for a critical point to be stable for a range of finite learning rate ratios. The results imply that differential Stackelberg equilibria are stable for a range of finite learning rate ratios and that non-equilibria critical points are unstable for a range of finite learning rate ratios. Together, this means gradient descent-ascent only converges to differential Stackelberg equilibrium for a range of finite learning rate ratios. To our knowledge, this is the first provable guarantee of its kind for an implementable first-order method. Moreover, the technical results in this work rely on tools that have not appeared in the machine learning and optimization communities analyzing games and expose a number interesting directions of future research. Explicitly, the notion of a guard map, which is arguably even an obscure tool in modern control and dynamical systems theory, is ‘rediscovered’ in this work as a technique for analyzing the stability of game dynamics.

### 1.1 Contributions

To motivate our primary theoretical results, we present a self-contained description of what is known about the local stability of gradient descent-ascent around critical points in Section 3.1. The existing results primarily concern gradient descent-ascent without timescale separation and with a ratio of learning rates approaching infinity (see Figure 1 for a graphical depiction of known results in each regime). In contrast, this paper is focused on characterizing the stability and convergence of gradient descent-ascent across a range of finite learning rate ratios. To hint at what is achievable in this realm, we present simple examples for which gradient descent-ascent converges to non-equilibrium critical points and games with differential Stackelberg equilibrium that are unstable with respect to gradient descent-ascent without timescale separation (see Examples 1 and 2, Section 3). While the existence of such examples is known (Daskalakis and Panageas, 2018; Mazumdar et al., 2020; Jin et al., 2020), we demonstrate in them that a finite timescale separation is sufficient to remedy the undesirable stability properties of gradient descent-ascent without timescale separation.

Toward characterizing this phenomenon in its full generality, we provide intermediate results which are known, but we prove using technical tools not yet broadly seen and exploited by this community. To begin, it is known that the set of differential Nash equilibrium are stable with respect to gradient descent-ascent (Mazumdar and Ratliff, 2019; Daskalakis and Panageas, 2018), and that they remain stable for any timescale separation parameter  (Jin et al., 2020). We provide a proof for this result (Proposition 3) using the concept of quadratic numerical range (Tretter, 2008). Furthermore, Jin et al. (2020) recently showed that as the timescale separation , the stable critical points of gradient descent-ascent coincide with the set of differential Stackelberg equilibrium. We reveal that this result has long existed in the literature on singularly perturbed systems (Kokotovic et al., 1986, Chapter 2 and the citations within) and provide a proof (see Proposition 4) using analysis methods from the aforementioned line of work that are novel to the literature on learning in games from the machine learning and optimization communities in recent years.

A relevant line of study on singularly perturbed systems is that of characterizing the range of perturbation parameters for which a system is stable (Kokotovic et al., 1986; Saydy et al., 1990; Saydy, 1996). Debatably introduced by Saydy et al. (1990), guardian or guard maps act as a certificate that the roots of a polynomial lie in a particular guarded domain for a range of parameter values. Historically, guard maps serve as a tool for studying the stability of parameterized families of dynamical systems. We bring this tool to learning in games and construct a map that guards a class of Hurwitz stable matrices parameterized by the timescale separation parameter in order to analyze the range of learning rate ratios for which a critical point is stable with respect to gradient descent-ascent. This technique leads to the following result.

###### Informal Statement of Theorem 3.

Consider a sufficiently regular critical point of gradient descent-ascent. There exists a such that is stable for all if and only if is a differential Stackelberg equilibrium.

Theorem 3 confirms that there does indeed exist a range of finite learning ratios such that a differential Stackelberg equilibrium is stable with respect to gradient descent-ascent. Moreover, such a range of learning rate ratios only exists if a critical point is a differential Stackelberg equilibrium. As we show in Corollary 2, the former implication of Theorem 3 nearly immediately implies there exists a such that gradient descent-ascent converges locally asymptotically for all if and only if is a differential Stackelberg equilibrium given a suitably chosen learning rate and deterministic gradient feedback. We give an explicit asymptotic rate of convergence in Theorem 5 and characterize the iteration complexity in Corollary 3. Moreover, we extend the convergence guarantees to stochastic gradient feedback in Theorem 6.

The latter implication of Theorem 3 says that there exists a finite learning rate ratio such that a non-equilibrium critical point of gradient descent-ascent is unstable. Building off of this, we complement the stability result of Theorem 3 with the following analagous instability result.

###### Informal Statement of Theorem 4.

Consider any stable critical point of gradient descent-ascent which is not a differential Stackelberg equilibrium. There exists a finite learning rate ratio such that is unstable for all .

Theorem 4 establishes that there exists a range of finite learning ratios non-equilibrium critical points are unstable with respect to gradient descent-ascent. This implies that for a suitably chosen finite timescale separation, gradient descent-ascent avoids critical points lacking game-theoretic meaning. Together, Theorem 3 and Theorem 4 answer affirmatively that gradient descent-ascent with timescale separation can guarantee equilibrium convergence, which answers a standing open question. Moreover, we provide explicit constructions for computing and given a critical point. In fact our construction of in Theorem 3 is tight, and this is confirmed by our numerical experiments.

We finish the theoretical analysis of gradient descent-ascent in this paper by connecting to the literature on generative adversarial networks. We show under common assumptions on generative adversarial networks (Nagarajan and Kolter, 2017; Mescheder et al., 2018) that the introduction of gradient penalty based regularization to the discriminator does not change the set of critical points for the dynamics and, further, there exists a finite learning rate ratio such that for any learning rate ratio and any non-negative, finite regularization parameter , the continuous time limiting regularized learning dynamics remain stable, and hence, there is a range of learning rates for which the discrete time update locally converges asymptotically.

###### Informal Statement of Theorem 7.

Consider training a generative adversarial network with a gradient penalty (for any fixed regularization parameter ) via a zero-sum game with generator network , discriminator network , and loss such that relaxed realizable assumptions are satisfied for a critical point . Then, is a differential Stackelberg equilibrium, and for any , gradient descent-ascent converges locally asymptotically. Moreover, an asymptotic rate of convergence is given in Corollary 5.

The theoretical results we provide are complemented by extensive experiments. In simulation, we explore a number of interesting behaviors of gradient descent-ascent with timescale separation analyzed theoretically including differential Stackelberg equilibria shifting from being unstable to stable and non-equilibrium critical points moving from being stable to unstable. Furthermore, we examine how the vector field and the spectrum of the game Jacobian evolve as a function of the timescale separation and explore the relationship with the rate of convergence. We experiment with gradient descent-ascent on the Dirac-GAN proposed by Mescheder et al. (2018) and illustrate the interplay between timescale separation, regularization, and rate of convergence. Building on this, we train generative adversarial networks on the CIFAR-10 and CelebA datasets with regularization and demonstrate that timescale separation can benefit performance and stability. In the experiments we observe that regularization and timescale separation are intimately connected and there is an inherent tradeoff between them. This indicates that insights made on simple generative adversarial network formulations may carry over to the complex problems where players are parameterized by neural networks.

Collectively, the primary contribution of this paper is the near-complete characterization of the behavior of gradient descent-ascent with finite timescale separation. Moreover, by introducing a novel set of analysis tools to this literature, our work opens a number of future research questions. As an aside, we believe these technical tools open up novel avenues for not only proving results about learning dynamics in games, but also for synthesizing algorithms.

### 1.2 Organization

The organization of this paper is as follows. Preliminaries on game theoretic notions of equilibria, gradient-based learning algorithms, and dynamical systems theory are reviewed in Section 2.

Convergence analysis proceeds in two phases. In Section 3, we study the stability properties of the continuous time limiting dynamical system given a timescale separation between the minimizing and maximizing players. Specifically, we show the first result on necessary and sufficient conditions for convergence of the continuous time limiting system corresponding to gradient descent-ascent with time scale separation to game theoretically meaningful equilibria (i.e., local minmax equilibria in zero-sum games). Following this, in Section 4, we provide convergence guarantees for the original discrete time dynamical system of interest (namely, gradient descent ascent). Using the results in the proceeding section, we show that gradient descent-ascent converges to a critical point if and only if it is a differential Stackelberg equilibrium (i.e., a sufficiently regular local minmax). In addition, we characterize the iteration complexity of gradient descent-ascent dynamics and provide finite-time bounds on local convergence to approximate local Stackelberg equilibria.

We apply the main results in the preceding sections to generative adversarial networks in Section 5, and in Section 6 we present several illustrative examples including generative adversarial networks where we show that tuning the learning rate ratio along with regularization and the exponential moving average hyperparameter significantly improves the Fréchet Inception Distance (FID) metric for generative adversarial networks.

Given its length, prior to concluding in Section 7, we review related work drawing connections to solution concepts, gradient descent-ascent learning dynamics, applications to adversarial learning where the success of heuristics provide strong motivation for the theoretical work in this paper, and historical connections to dynamical systems theory. Throughout the sections proceeding Section 7, we draw connections to related works and results in an effort to place our results in the context of the literature. We conclude in Section 8 with a discussion on the significance of the results and open questions. The appendix includes the majority of the detailed proofs as well as additional experiments and commentary.

## 2 Preliminaries

In this section, we review game theoretic and dynamical systems preliminaries. Additionally, we formulate the class of learning rules analyzed in this paper.

### 2.1 Game Theoretic Preliminaries

A two–player zero-sum continuous game is defined by a collection of costs where and with for some and where with each a precompact subset of for . Let be the dimension of the joint strategy space . Player seeks to minimize their cost function with respect to their choice variable where is the vector of all other actions with .

There are two natural equilibrium concepts for such games depending on the order of play—i.e., the Nash equilibrium concept in the case of simultaneous play and the Stackelberg equilibrium concept in the case of hierarchical play. Each notion of equilibria can be characterized as the intersection points of the reaction curves of the players (Başar and Olsder, 1998).

###### Definition 1 (Local Nash Equilibrium).

The joint strategy is a local Nash equilibrium on , where , if , for all and for all . Furthermore, if the inequalities are strict, we say is a strict local Nash equilibrium.

###### Definition 2 (Local Stackelberg Equilibrium).

Consider for where, without loss of generality, player 1 is the leader (minimizing player) and player 2 is the follower (maximizing player). The strategy is a local Stackelberg solution for the leader if, ,

 supx2∈rU2(x∗1)f(x∗1,x2)≤supx2∈rU2(x1)f(x1,x2),

where is the reaction curve. Moreover, for any , the joint strategy profile is a local Stackelberg equilibrium on .

Predicated on existence,1 equilibria can be characterized in terms of sufficient conditions on player costs. Indeed, in continuous games, first and second order conditions on player cost functions leads to a differential characterization (i.e., necessary and sufficient conditions) of local Nash equilibria reminiscent of optimality conditions in nonlinear programming (Ratliff et al., 2016).2

We denote as the derivative of with respect to , as the partial derivative of with respect to , as the partial derivative of with respect to , and as the total derivative.3

###### Proposition 1 (Necessary and Sufficient Conditions for Local Nash (Ratliff et al., 2016, Thm. 1 & 2)).

If is a local Nash equilibrium of the zero-sum game , then , , and . On the other hand, if , , and and , then is a local Nash equilibrium.

The following definition, characterized by sufficient conditions for a local Nash equilibrium as defined in Definition 1, was first introduced in (Ratliff et al., 2013).

###### Definition 3 (Differential Nash Equilibrium (Ratliff et al., 2013)).

The joint strategy is a differential Nash equilibrium if , , and .

Analogous sufficient conditions can be stated to characterize a local Stackelberg equilibrium as defined in Definition 2.4 Suppose that for some . Given a point at which , the implicit function theorem (Abraham et al., 2012, Thm. 2.5.7) implies there is a neighborhood of and a unique map such that for all , . The map is referred to as the implicit map. Further, . Note that is a generic condition (cf. (Fiez et al., 2020, Lem. C.1)). Let be the total derivative of and analogously, let be the second-order total derivative.

###### Definition 4 (Differential Stackelberg Equilibrium (Fiez et al., 2020)).

The joint strategy is a differential Stackelberg equilibrium if , , , and .

Observe that in a general sum setting the first order conditions for player are equivalent the total derivative of being zero at the candidate critical point where is implicitly defined as a function of via the implicit mapping theorem applied to . Since in this paper and in Definition 4, the class of games is zero sum, and (along with the condition that which is implied by the second order conditions) are sufficient to imply that the total derivative is zero.

The Jacobian of the first order necessary and sufficient condition—i.e., conditions that define potential candidate differential Nash and/or Stackelberg equilibria—is a useful mathematical object for understanding convergence properties of gradient based learning rules as we will see in subsequent sections. Consider the vector of individual gradients which define first order conditions for a differential Nash equilibrium. Let denote the Jacobian of which is defined by

 J(x)=[D21f1(x)D12f1(x)D21f2(x)D22f2(x)]. (1)

We recall from Fiez et al. (2020) an alternative (to Definition 4, but equivalent set of sufficient conditions for a differential Stackelberg in terms of . Let denote the Schur complement of with respect to the block-row matrix in .

###### Proposition 2 (Proposition 1  Fiez et al. (2020)).

Consider a zero-sum game defined by with and player 1 (without loss of generality) taken to be the leader (minimizing player). Let satisfy and . Then and if and only if is a differential Stackelberg equilibria.

###### Remark 1 (A comment on the genericity of differential Stackelberg/Nash equilibria.).

Due to Fiez et al. (2020, Theorem 1), differential Stackelberg are generic amongst local Stackelberg equilibria and, similarly, due to Mazumdar and Ratliff (2019, Theorem 2), differential Nash equilibria are generic amongst local Nash equilibria. This means that the property of being a differential Stackelberg (respectively, differential Nash) equilibrium in a zero-sum game is generic in the class of zero-sum games defined by functions—that is, for almost all (in some formall sense) zero-sum games, all the local Stackelberg/Nash are differential Stackelberg/Nash.

As noted above, in this paper we focus on settings in which agents or players in this game are seeking equilibria of the game via a learning algorithm. We study arguably the most natural learning rule in zero-sum continuous games: gradient descent-ascent (GDA). This gradient-based learning rule is a simultaneous gradient play algorithm in that agents update their actions at each iteration simultaneously.

Gradient descent-ascent is defined as follows. At iteration , each agent updates their choice variable by the process

 xi,k+1=xi,k−γigi(xi,k,x−i,k) (2)

where is agent ’s learning rate, and is agent ’s gradient-based update mechanism. For simultaneous gradient play,

 g(x)=(D1f1(x),D2f2(x)) (3)

is the vector of individual gradients and in a zero-sum setting, GDA is defined using where the first player is the minimizing player and the second player is the maximizing player.

One of the key contributions of this paper is that we provide convergence rates for settings in which there is a timescale separation between the learning processes of the minimizing and maximizing players—i.e., settings in which the agents learning rates are not homogeneous. Define where denotes the identity matrix. Let be the learning rate ratio and define . The -GDA dynamics are given by

 xk+1=xk−γ1Λτg(xk). (4)

Tools for Convergence Analysis. We analyze the iteration complexity or local asymptotic rate of convergence of learning rules of the form (2) in the neighborhood of an equilibrium. Given two real valued functions and , we write if there exists a positive constant such that . For example, consider iterates generated by (2) with initial condition and critical point . Suppose that we show . Then, we write where .

### 2.3 Dynamical Systems Primer

In this paper, we study learning rules employed by agents seeking game-theoretically meaningful equilibria in continuous games. Dynamical systems tools for both continuous and discrete time play a crucial role in this analysis.

Stability. Before we proceed, we recall and remark on some facts from dynamical systems theory concerning stability of equilibria in the continuous-time dynamics

 ˙x=−g(x) (5)

relevant to convergence analysis for the discrete-time learning dynamics in (2). Observe that equilibria are shared between (2) and (5). Our focus is on the subset of equilibria that satisfy Definition 4, and the subset thereof defined in Definition 3. Recall the following equivalent characterizations of stability for an equilibrium of (5) in terms of the Jacobian matrix .

###### Theorem 1 ((Khalil, 2002, Thm. 4.15)).

Consider a critical point of . The following are equivalent: (a) is a locally exponentially stable equilibrium of ; (b) ; (c) there exists a symmetric positive-definite matrix such that .

It was shown in (Ratliff et al., 2016, Prop. 2) that if the spectrum of at a differential Nash equilibrium is in the open left-half complex plane—i.e., —then is a locally exponentially stable equilibrium of (5). Indeed, if all agents learn at the same rate so in (2), then a straightforward application of the Spectral Mapping Theorem (Callier and Desoer, 2012, Thm. 4.7) ensures that an exponentially stable equilibrium for (5) is locally exponentially stable in (2) so long as is chosen sufficiently small (Chasnov et al., 2019). However, this observation does not directly tell us how to select or the resulting iteration complexity in an asymptotic or finite-time sense; furthermore, this line of reasoning does not apply when agents learn at different rates ( in (2)).

Limiting dynamical systems. The continuous time dynamical system takes the form due to the timescale separation . Such a system is known as a singularly perturbed system or a multi-timescale system in the dynamical systems theory literature (Kokotovic et al., 1986), particularly where is small. Singularly perturbed systems are classically expressed as

 ˙x=−D1f1(x,z)ϵ˙z=−D2f2(x,z) (6)

where is most often a physically meaningful quantity inherent to some dynamical system that describes the evolution of some physical phenomena; e.g., in circuits it may be a constant related to device material properties, and in communication networks, it is often the speed at which data flows through a physical medium such as cable.

In the classical asymptotic analysis of a system of the form (6)—which we write more generally as for the purpose of the following observations—the goal is to obtain an approximate solution, say , such that the approximation error is small in some norm for small and, further, the approximate solution is expressed in terms of a reduced order system. Such results have significance in terms of revealing underlying structural properties of the original system and its corresponding state for small . One of the contributions of this work is that we take a similar analysis approach in order to reveal underlying structural properties of the optimization landscape of zero-sum games/minimax optimization problems. Indeed, asymptotic methods can reveal multiple timescale structures that are inherent in many machine learning problems, as we observe in this paper for zero-sum games. One key point of separation in applying dynamical systems theory to the study of algorithms versus physical system dynamics—in particular, learning in games—this parameter no longer necessarily is a physical quantity but is most often a hyper-parameter subject to design. In this paper, treating the inverse of the learning rate ratio as a timescale separation parameter, we combine the asymptotic analysis tools from singular perturbation theory with tools from algebra to obtain convergence guarantees.

Leveraging Linearization to Infer Qualitative Properties. The Hartman-Grobman theorem asserts that it is possible to continuously deform all trajectories of a nonlinear system onto trajectories of the linearization at a fixed point of the nonlinear system. Informally, the theorem states that if the linearization of the nonlinear dynamical system around a fixed point —i.e., —has no zero or purely imaginary eigenvalues, then there exists a neighborhood of and a homeomorphism —i.e., —taking trajectories of and mapping them onto those of . In particular, .

Given a dynamical system , the state or solution of the system at time starting from at time is called the flow and is denoted .

###### Theorem 2 (Hartman-Grobman (Sastry, 1999, Thm. 7.3); (Teschl, 2000, Thm. 9.9)).

Consider the -dimensional dynamical system with equilibrium point . If has no zero or purely imaginary eigenvalues, there is a homeomorphism defined on a neighborhood of taking orbits of the flow to those of the linear flow of —that is, the flows are topologically conjugate. The homeomorphism preserves the sense of the orbits and is chosen to preserve parameterization by time.

The above theorem says that the qualitative properties of the nonlinear system in the vicinity (which is determined by the neighborhood ) of an isolated equilibrium are determined by its linearization if the linearization has no eigenvalues on the imaginary axes in the complex plane. We also remark that Hartman-Grobman can also be applied to discrete time maps (cf. Sastry (1999, Thm. 2.18)) with the same qualitative outcome.

Internally Chain Transitivity. In proving results for stochastic gradient descent-ascent, we leverage what is known as the ordinary differential equation method in which the flow of the limiting continuous time system starting at sample points from the stochastic updates of the players actions is compared to asymptotic psuedo-trajectories—i.e., linear interpolations between sample points. To understand stability in the stochastic case, we need the notion of internally chain transitive sets. For more detail, the reader is referred to (Alongi and Nelson, 2007, Chap. 2–3).

A closed set is an invariant set for a differential equation if any trajectory with satisfies for all . Let be a flow on a metric space . Given , and , an -chain from to with respect to and is a pair of finite sequences in and in , denoted together by , such that for . A set is (internally) chain transitive with respect to if is a non-empty closed invariant set with respect to such that for each , and there exists an -chain from to . A compact invariant set is invariantly connected if it cannot be decomposed into two disjoint closed nonempty invariant sets. It is easy to see that every internally chain transitive set is invariantly connected.

## 3 Stability of Continuous Time GDA with Timescale Separation

To characterize the convergence of -GDA, we begin by studying its continuous time limiting system

 ˙x=−Λτg(x), (7)

where we recall that . Throughout this section, the class of zero-sum games we consider are sufficiently smooth, meaning that for some . The Jacobian of the system from (7) in zero-sum games of the form is given as

 Jτ(x)=[D21f(x)D12f(x) −τD⊤12f(x)−τD22f(x)]. (8)

By analyzing the stability of the continuous time system as a function of the timescale separation using the Jacobian from (8) in this section, we can then draw conclusions about the stability and convergence of the discrete time system -GDA in Section 4.

The organization of this section is as follows. To begin, we present a collection of preliminary observations in Section 3.1 regarding the stability of continuous time gradient descent-ascent with timescale separation to motivate the results in the subsequent subsections by establishing known results and introducing alternative analysis methods that the technical results in this paper build on. Then, in Sections 3.2 and 3.3 respectively, we present necessary and sufficient conditions for stability of the continuous time system around critical points in terms of the learning rate ratio along with sufficient conditions to guarantee the instability of the continuous time system around non-equilibrium critical points in terms of the timescale separation.

### 3.1 Preliminary Observations

In Figure 1 we present a graphical representation of known results on the stability of gradient descent-ascent with timescale separation in continuous time, where we remark that such results nearly directly imply equivalent conclusions regarding the discrete time system -GDA with a suitable choice of learning rate . The primary focus of past work has been on the edge cases of and . For , the set of differential Nash equilibrium are stable, but differential Stackelberg equilibrium may be stable or unstable, and non-equilibrium critical points can be stable. As , the set of differential Nash equilibrium remain stable, each differential Stackelberg equilibrium is guaranteed to become stable, and each non-equilibrium critical point must be unstable. We fill the gap between the known results by providing results as a function of finite . With an eye toward this goal, we now provide examples and preliminary results that illustrate the type of guarantees that may be achievable for a range of finite learning rate ratios.

To start off, we consider the set of differential Nash equilibrium. It is nearly immediate from the structure of the Jacobian that each differential Nash equilibrium is stable for  (Mazumdar et al., 2020; Daskalakis and Panageas, 2018). Moreover, Jin et al. (2020) showed that regardless of the value of , the set of differential Nash equilibrium remain stable. In other words, the desirable stability characteristics of differential Nash equilibrium are retained for any choice of timescale separation. We state this result as a proposition for later reference and since our proof technique relies on the concept of quadratic numerical range (Tretter, 2008), which has not appeared previously in this context. The proof of Proposition 3 is provided in Appendix B.

###### Proposition 3.

Consider a zero-sum game defined by for some . Suppose that is a differential Nash equilibrium. Then, for all .

Fiez et al. (2020) show that the set of differential Nash equilibrium is a subset of the set of differential Stackelberg equilibrium. In other words, any differential Nash equilibrium is a differential Stackelberg equilibrium, but a differential Stackelberg equilibrium need not be a differential Nash equilibrium. Moreover, Jin et al. (2020) show that the result of Proposition 3 fails to extend from differential Nash equilibria to the broader class of differential Stackelberg equilibrium. Indeed, not all differential Stackelberg equilibrium are stable with respect to the continuous time limiting dynamics of gradient descent-ascent without timescale separation. However, as the following example demonstrates, differential Stackelberg equilibrium that are unstable without timescale separation can become stable for a range of finite timescale learning rate ratios.

###### Example 1.

Within the class of zero-sum games, there exists differential Stackelberg equilibrium that are unstable with respect to and stable with respect to for all where is finite. Indeed, consider the quadratic zero-sum game defined by the cost

 f(x1,x2)=12[x1x2]⊤⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣−v0−v0012v012v−v0−12v0012v0−v⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦[x1x2]

where and . The unique critical point of the game given by is a differential Stackelberg equilibrium since , and . The spectrum of the Jacobian of is given by

 spec(Jτ(x∗))={v(2τ+1±√4τ2−8τ+1)4,v(τ−2±√τ2−12τ+4)4}.

Observe that for , for any so that the differential Stackelberg equilibrium is never stable for the choice of . However, for any , for all , meaning that the differential Stackelberg equilibrium is indeed stable with respect to the dynamics for a range of finite learning rate ratios.

We explore Example 1 further via simulations in Section 6.1. The key takeaway from Example 1 is that it is clearly not always necessary for the timescale separation to approach infinity in order to guarantee the stability of a differential Stackelberg equilibrium and instead there exists a sufficient finite learning rate ratio. Put simply, the undesirable property of differential Stackelberg equilibria not being stable with respect to gradient descent-ascent without timescale separation can potentially be remedied with only a finite timescale separation.

It is well-documented that some stable critical points of the continuous time gradient descent-ascent limiting dynamics without timescale separation can lack game-theoretic meaning, as they may be neither a differential Nash equilibria nor differential Stackelberg equilibria (Mazumdar et al., 2020; Daskalakis and Panageas, 2018; Jin et al., 2020). The following example demonstrates that such undesirable critical points that are stable without timescale separation can become unstable for a range of finite learning ratios.

###### Example 2.

Within the class of zero-sum games, there exists non-equilibrium critical points that are stable with respect to and unstable with respect to for all where is finite. Indeed, consider a zero sum game defined by the cost

 f(x1,x2)=12[x1x2]⊤⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣12v012v00−14v012v12v014v0012v0−12v⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦[x1x2] (9)

where and . The unique critical point of the game given by is neither a differential Nash equilibrium nor a differential Stackelberg equilibrium since and . The spectrum of the Jacobian of is given by

 spec(Jτ(x∗))={v(2τ−1±√4τ2−12τ+1)8,v(2−τ±√τ2−12τ+4)8}.

Given , for any so that the non-equilibrium critical point is in fact stable for the choice of timescale separation . However, for any , for all , meaning that the non-equilibrium critical point is unstable with respect to the dynamics for a range of finite learning rate ratios.

The game construction from (9) is quadratic and as a result has a unique critical point. Games can be constructed in which critical points lacking game-theoretic meaning that are stable without timescale separation become unstable for all even in the presence of multiple equilibria. Indeed, consider a zero-sum game defined by the cost

 f(x1,x2)=54(x211+2x11x21+12x221−12x212+2x12x22−x222)(x11−1)2+x211(∑2i=1(x1i−1)2−(x2i−1)2). (10)

This game has critical points at , , and . The critical points and are differential Nash equilibria and are consequently stable for any choice of . The critical point is neither a differential Nash equilibrium nor a differential Stackelberg equilibrium. Moreover, the Jacobian of for the game defined by (9) with is identical to that for the game defined by (10). As a result, we know that is stable without timescale separation, but for all so that the non-equilibrium critical point is again unstable with respect to the dynamics for a range of finite learning rate ratios.

We investigate the game defined in (10) from Example 2 with simulations in Section 6.2. In an analogous manner to Example 1, Example 2 demonstrates that it is not always necessary for the timescale separation to approach infinity in order to guarantee non-equilibrium critical points become unstable as there can exist a sufficient finite learning rate ratio. This is to say that the unwanted property of non-equilibrium critical points being stable without timescale separation can also potentially be remedied with only a finite timescale separation.

The examples of this section have provided evidence that there exists a range of finite learning rate ratios for which differential Stackelberg equilibrium are stable and a range of learning rate ratios for which non-equilibrium critical points are unstable. Yet, no result has appeared in the literature on gradient descent-ascent with timescale separation confirming this behavior in general. We focus on doing precisely that in the subsection that follows. Before doing so, we remark on the closest existing result. As mentioned previously Jin et al. (2020) show that as , the set of stable critical points with respect to the dynamics coincide with the set of differential Stackelberg equilibrium. However, an equivalent result in the context of general singularly perturbed systems has been known in the literature (cf.  Kokotovic et al. 1986, Chap. 2). We give a proof based on this type of analysis because it reveals a new set of analysis tools to the study of game-theoretic formulations of machine learning and optimization problems; a proof sketch is given below while the full proof is given in Appendix F.

###### Proposition 4.

Consider a zero-sum game defined by for some . Suppose that is such that and . Then, as , if and only if is a differential Stackelberg equilibrium.

###### Proof Sketch..

The basic idea in showing this result is that there is a (local) transformation of coordinates from the linearized dynamics of , which we write as

 ˙x=[A11A12−τA⊤12τA22]x,

in a neighborhood of a critical point to an upper triangular system that depends parametrically on and hence, the asymptotic behavior is readily obtainable from the block diagonal components of the system in the new coordinates. Indeed, consider the change of variables for the second player so that

 [˙x1˙z]=[A11−A12L(τ−1)A12R(L,τ)A22+τ−1L(τ−1)A12][x1z] (11)

where

 R(L,τ)=−A⊤12−A22L(τ−1)+τ−1L(τ−1)A11−τ−1L(τ−1)A12L(τ−1)=0

A transformation of coordinates such that always exists (cf. Lemma 7, Appendix F). Hence, the characteristic equation of (11) can be expressed as

 χ(s,τ)=τnχs(s,τ)χf(p,τ)=0

where and with . As , . Consequently, of the eigenvalues of , denoted by , are the roots of the slow characteristic equation and the rest of the eigenvalues are denoted by for and where are the roots of the fast characteristic equation . The roots of are precisely those of the (first) Schur complement of while the roots of are precisely those of . ∎

This simple transformation of coordinates to an upper triangular dynamical system shown in (11) leads immediately to the asymptotic result in Proposition 4. It also shows that if the eigenavlues of are distinct5 and similarly, so are those of (although, and are allowed to have eigenvalues in common), then the asymptotic results from Proposition 4 imply the following approximations for the elements of :

 λi =λi(S1(Jτ(x∗))+O(τ−1), i=1,…,n1, λj+n1 =τ(λj(−D22f(x∗))+O(τ−1)), j=1,…,n2.

This follows simply by observing that when the eigenvalues are distinct, the derivatives and are well-defined by the implicit mapping theorem and the total derivative of and , respectively.

### 3.2 Necessary and Sufficient Conditions for Stability

The proof of Proposition 4 provides some intuition for the next result, which is one of our main contributions. Indeed, as shown in Kokotovic et al. (1986, Chap. 2), as the first eigenvalues of tend to fixed positions in the complex plane defined by the eigenvalues of , while the remaining eigenvalues tend to infinity, with the linear rate , along as asymptotes defined by the eigenvalues of . The asymptotic splitting of the spectrum provides some intuition for the following result.

###### Theorem 3.

Consider a zero-sum game defined by for some . Suppose that is such that and and are non-singular. There exists a such that for all if and only if is a differential Stackelberg equilibrium.

Before getting into the proof sketch, we provide some intuition for the construction of and along the way revive an old analysis tool from dynamical systems theory which turns out to be quite powerful in analyzing stability properties of parameterized systems.

Construction of . There is still the question of how to construct such a and do so in a way that is as tight as possible. Recall Theorem 1 which states that a matrix is exponentially stable if and only if there exists a symmetric positive definite such that . The operator is known as the Lyapunov operator. Given a positive definite , is stable if and only if there exists a unique solution to

 ((J⊤τ(x∗)⊗I)+(I⊗J⊤τ(x∗)))vec(P)=(J⊤τ(x∗)⊕J⊤τ(x∗))vec(P)=vec(Q) (12)

where and denote the Kronecker product and Kronecker sum, respectively.6 The existence of a unique solution occurs if and only if and have no eigenvalues in common. Hence, using the fact that eigenvalues vary continuously, if we imagine varying and examining the eigenvalues of the map , this will tell us the range of for which remains in .

This method of varying parameters and determining when the roots of a polynomial (or correspondingly, the eigenvalues of a map) cross the boundary of a domain uses what is known as a guardian or guard map (cf. Saydy et al. (1990)). In particular, the guard map provides a certificate that the roots of a polynomial lie in a particular guarded domain for a range of parameter values. Formally, let be the set of all real matrices or the set of all polynomials of degree with real coefficients. Consider an open subset of with closure and boundary . The map is said to be a guardian map for if for all ,

 ν(x)=0 ⟺ x∈∂S.

Consider an open subset of the complex plane that is symmetric with respect to the real axis (e.g., the open left-half complex plane ). Then, elements of are said to be stable relative to . Given a pathwise connected subset of , a domain and a guard map , it is known that the family is stable relative to if and only if it is nominally stable—i.e., for some —and