Learning in Games with Noisy Payoff Observations

# On the robustness of learning in games with stochastically perturbed payoff observations

Mario Bravo Universidad de Santiago de Chile, Departamento de Matemática y Ciencia de la Computación, Av.Libertador Bernardo O’Higgins 3363, Santiago, Chile  and  Panayotis Mertikopoulos CNRS (French National Center for Scientific Research), LIG, F-38000 Grenoble, France
and Univ. Grenoble Alpes, LIG, F-38000 Grenoble, France
http://mescal.imag.fr/membres/panayotis.mertikopoulos
###### Abstract.

Motivated by the scarcity of accurate payoff feedback in practical applications of game theory, we examine a class of learning dynamics where players adjust their choices based on past payoff observations that are subject to noise and random disturbances. First, in the single-player case (corresponding to an agent trying to adapt to an arbitrarily changing environment), we show that the stochastic dynamics under study lead to no regret almost surely, irrespective of the noise level in the player’s observations. In the multi-player case, we find that dominated strategies become extinct and we show that strict Nash equilibria are stochastically stable and attracting; conversely, if a state is stable or attracting with positive probability, then it is a Nash equilibrium. Finally, we provide an averaging principle for -player games, and we show that in zero-sum games with an interior equilibrium, time averages converge to Nash equilibrium for any noise level.

###### Key words and phrases:
Dominated strategies; learning; Nash equilibrium; regret minimization; regularization; robustness; stochastic game dynamics; stochastic stability
###### 2010 Mathematics Subject Classification:
Primary 60H10, 37N40, 91A26; secondary 60H30, 60J70, 91A22
The authors are greatly indebted to Roberto Cominetti for arranging the visit of the second author to the University of Chile and for his many constructive comments. The authors would also like to express their gratitude to Mathias Staudigl for his many insightful comments and suggestions, to Bill Sandholm, Josef Hofbauer, and Yannick Viossat for helpful discussions, and to two anonymous referees for their detailed remarks and recommendations.
Part of this work was carried out during the authors’ visit to the Hausdorff Research Institute for Mathematics at the University of Bonn in the framework of the Trimester Program “Stochastic Dynamics in Economics and Finance” and during the second author’s visit to the University of Chile. MB was partially supported by Fondecyt grant No. 11151003, and the Núcleo Milenio Información y Coordinación en Redes ICM/FIC RC130003. PM was partially supported by the French National Research Agency (grant nos. NETLEARN–13–INFR–004 and GAGA–13–JS01–0004–01) and the French National Center for Scientific Research (grant no. REAL.NET–PEPS–JCJC–INS2I–2014.)

## 1. Introduction

A central question in game-theoretic learning is whether the outcome of a learning process (viewed here as a plausible model for the behavior of optimizing agents) is also justifiable from the point of view of rationality – e.g. whether the process leads to a Nash equilibrium or a state where no dominated strategies are present. In that regard, one of the most widely studied learning procedures is the exponential weight (EW) algorithm that was originally introduced by Vovk (1990) and Littlestone and Warmuth (1994) in the context of multi-armed bandit problems. In a game-theoretic setting, the algorithm simply prescribes that players score their actions based on their cumulative payoffs and then assign choice probabilities proportionally to the exponential of each action’s score. As such, the EW algorithm in continuous time (Sorin, 2009) is described by the dynamics

 ˙ykα =vkα, (EW) xkα =exp(ykα)∑βexp(ykβ),

where, somewhat informally, denotes the payoff to the -th action of player , is the action’s performance score (cumulative payoff) over time and is the corresponding mixed strategy weight.

Given its long history and its link with the replicator dynamics of evolutionary game theory (discussed below), the rationality properties of (EW) are also relatively well understood. To name the most important ones: a) dominated strategies become extinct in the long run; b) limits of interior trajectories and stable rest points are Nash equilibria; c) strict equilibria are stable and attracting; and d) empirical distributions of play converge to equilibrium in -player zero-sum games with an interior equilibrium (Hofbauer and Sigmund, 1998; Sandholm, 2010). More recently, Sorin (2009) also showed that (EW) is universally consistent, i.e. players have no regret for following (EW) instead of any other fixed strategy in a dynamically changing environment.

On the other hand, a crucial limitation in the above considerations is that players are assumed to have perfect observations of their actions’ rewards. In practical applications of game theory (e.g. in economics, finance and traffic networks), “perfect feedback” requirements are often too stringent, especially in games with massively many actions and/or players. Accordingly, an important question that arises is whether (EW) retains its rationality properties in the presence of noise and stochastically perturbed payoff observations. Somewhat surprisingly, this is indeed the case (at least for some of them): even if the payoffs in (EW) are perturbed by a Brownian noise term of arbitrarily high variance, dominated strategies become extinct and strict equilibria remain stable and attracting with high probability (Mertikopoulos and Moustakas, 2009, 2010). Thus, even when the players’ true payoffs are masked by noise and uncertainty, the reinforcement learning principle behind (EW) allows players to weed out the noise and leads to similar outcomes as in the noiseless, deterministic regime.

Motivated by the robustness of exponential learning in noisy environments, we examine here a broad class of game-theoretic learning procedures where players adjust their strategies by playing an approximate best response to the vector of their actions’ cumulative payoffs – possibly subject to random disturbances and noise. In the case of perfect payoff observations, this scheme boils down to the reinforcement learning dynamics considered by Mertikopoulos and Sandholm (2016) who showed that the properties discussed above still hold in the absence of noise. With this in mind, our main contribution is to show that feedback noise does not really matter: even if the players’ payoff observations are subject to arbitrarily high (and possibly state-dependent or correlated) noise, dominated strategies become extinct (roughly at the same rate as in deterministic environments); strict Nash equilibria remain (stochastically) stable and attracting; and, in the converse direction, if a state is (stochastically) stable or attracting with positive probability, then it is also a Nash equilibrium. Finally, if players use a decreasing learning parameter to adjust the weight of their scoring process over time, the stochastic learning dynamics under study lead to no regret (a.s.) and their time average converges to equilibrium in -player zero-sum games with an interior equilibrium.

Our analysis also highlights an important difference between “static” solution concepts (such as Nash equilibria and strategic dominance) and more “dynamic” notions (such as regret minimization). Whereas the speed of convergence to static target states is accelerated by the use of a large, constant learning parameter (noise and disturbances notwithstanding), the rate of regret minimization is optimized by using a learning parameter that decays proportionally to (and which guarantees an bound for the players’ cumulative regret under uncertainty). This disparity only appears in the noisy regime and is due to the fact that players need to be more conservative when facing a fluid environment that varies with time in an (a priori) unpredictable fashion. Otherwise, if players have access to noiseless payoff observations, they can be signficantly more greedy and achieve lower regret faster by using a constant learning parameter.

### 1.1. Related work

The long-term rationality properties of exponential learning in a game-theoretic setting were first studied in conjunction with those of the replicator dynamics, one of the most widely studied dynamical systems for population evolution under natural selection. Indeed, a simple differentiation of (EW) reveals that the evolution of the players’ mixed strategies under (EW) follows the differential equation:

 ˙xkα=xkα[vkα−∑βxkβvkβ], (RD)

which is simply the (multi-population) replicator dynamics of Taylor and Jonker (1978). In this context, Akin (1980), Nachbar (1990) and Samuelson and Zhang (1992) showed that dominated strategies become extinct, while it is well known that a) stable states of (RD) are Nash equilibria; b) strict Nash equilibria are asymptotically stable; and c) time averages of replicator orbits converge to equilibrium in -player games provided that no strategy share becomes arbitrarily small (Hofbauer and Sigmund, 1998, 2003).

This two-way relationship between exponential learning and the replicator dynamics was noted early on by Rustichini (1999) in his study of reinforcement learning models for games – i.e. learning how to react to a given situation so as to maximize a numerical reward (Sutton and Barto, 1998). From a game-theoretic point of view, the learning models of Börgers and Sarin (1997) and Erev and Roth (1998) are also closely related to the replicator dynamics – and, hence, to (EW) – while Fudenberg and Levine (1998), Hofbauer and Sandholm (2002), and Hopkins (2002) studied a smooth variant of the well-known fictitious play algorithm where the players play a perturbed best response to their opponents’ empirical frequency of play.111For a closely related model, see also Cominetti et al. (2010). Up to mild technical differences, the correlated version of these models can be seen as a noiseless version of our reinforcement learning model, viz. the dynamics of Mertikopoulos and Sandholm (2016) with the parameter choice ; we explore this relation in more detail in Sections 3 and 6.

Of course, a crucial aspect of these considerations is whether players have accurate observations of their actions’ rewards or only a noisy estimate thereof: if the former is not the case, noise and fluctuations could potentially lead to suboptimal outcomes with high probability. In biology and evolutionary game theory (where payoffs measure the reproductive fitness of a biological species or the average payoff of populations of nonatomic players respectively), Fudenberg and Harris (1992) accounted for such fluctuations by introducing a stochastic variant of the replicator dynamics where evolution is perturbed by “aggregate shocks” that reflect the impact of weather-like effects and other perturbations.222Khasminskii and Potsepun (2006) also consider a Stratonovich-based model while Vlasic (2012) examines the case of random jumps incurred by catastrophic, earthquake-like events In this framework, Cabrales (2000), Imhof (2005) and Hofbauer and Imhof (2009) showed that dominated strategies are still eliminated if the variability of the shocks across different genotypes (strategies) is not too high, while Imhof (2005) and Hofbauer and Imhof (2009) showed that strict Nash equilibria of a modified game are stochastically asymptotically stable under the replicator dynamics with aggregate shocks.333Whether the equilibria of the original game are themselves asymptotically stable, depends on the intensity of the noise on different strategies and also on the exact way that the noise enters the process (Mertikopoulos and Viossat, 2016).

On the other hand, Mertikopoulos and Moustakas (2009, 2010) showed that the dynamics obtained by Fudenberg and Harris (1992) do not coincide with the stochastic replicator dynamics induced by (EW) in the presence of random disturbances and measurement noise – in contrast to the noiseless case where (EW) and (RD) do coincide. As we mentioned above, this learning variant of the stochastic replicator dynamics actually retains the rationality properties of the deterministic system (EW)/(RD) without any caveats on the noise: dominated strategies become extinct and strict Nash equilibria remain stochastically asymptotically stable without any conditions on the perturbations’ magnitude. This shows that the origin of the noise is crucial in the determination of the dynamics’ long-term properties: whereas the “aggregate shocks” of evolutionary environments lead to rational behavior in a modified game, no such modifications are required when learning under uncertainty.

### 1.2. Paper outline

In Section 2, we present our model for learning in the presence of noise and we derive the system of coupled stochastic differential equations that governs the evolution of the players’ mixed strategies. In Section 3, we show that if the players use a smoothly decreasing learning parameter, then the learning dynamics under study lead to no regret (a.s.), whatever the noise level. In Section 4, we investigate dominated strategies: we show that dominated strategies become extinct (a.s.) and we derive an explicit bound for their extinction rate. Section 5 focuses on the dynamics’ long-term stability and convergence properties: we show that a) stochastically (Lyapunov) stable states and states that attract trajectories of play with positive probability are Nash equilibria; and b) strict Nash equilibria are stochastically asymptotically stable, irrespective of the fluctuations’ magnitude. In Section 6, we provide an averaging principle for -player games in the spirit of Hofbauer et al. (2009); thanks to this principle, we then show that empirical distributions of play converge to Nash equilibrium in zero-sum games (again, no matter the noise level). Finally, in Section 7, we discuss the relaxation of some of the assumptions on the noise process – and, in particular, the independence of the observation noise across players and strategies.

To streamline our presentation, we included several numerical illustrations in the main text (Figs. 1 and 2) and we relegated the more convoluted proofs to a series of appendices at the end.

## 2. The model

After a few preliminaries to set notation and terminology, this section focuses on the basic properties of a broad class of reinforcement learning dynamics under noise and uncertainty. In the noiseless, deterministic regime (Section 2.2), our model essentially boils down to the class of dynamics recently studied by Mertikopoulos and Sandholm (2016). The full stochastic framework (Section 2.3) is then obtained by positing that the agents’ observations are perturbed at each moment in time by a zero-mean stochastic process (an Itô diffusion).

### 2.1. Preliminaries

#### Notation

If is a finite set, the real vector space generated by will be denoted by and we will write for its canonical basis; for concision, we will also use to refer interchangeably to or , writing e.g. instead of . The set of probability measures on will be identified with the -dimensional simplex and the relative interior of will be written .

If is a finite family of sets, we will use the shorthand for the tuple and we will write instead of . Finally, any statement of the form for a stochastic process should be interpreted in the a.s. sense, i.e. “for every , there exists a.s. some (random) such that for all ” – and likewise for the statement “”.

#### Definitions from game theory

A finite game in normal form is a tuple consisting of a) a finite set of players ; b) a finite set of actions (or pure strategies) per player ; and c) the players’ payoff functions , where denotes the set of all joint action profiles . The set of mixed strategies of player is denoted by and the space of mixed strategy profiles will be called the game’s strategy space. In this mixed context, the expected payoff of player in the strategy profile is

 uk(x)=∑1α1⋯∑NαNuk(α1,…,αN)x1,α1⋯xN,αN, (2.1)

where, in a slight abuse of notation, denotes the payoff of player in the profile . Accordingly, the payoff corresponding to in the mixed strategy profile is

 vkα(x)≡uk(α;x−k) =∑1α1⋯∑NαNuk(α1,…,αN)x1,α1⋯δαk,α⋯xN,αN, (2.2)

leading to the more concise expression

 uk(x)=∑kαxkαvkα(x)=⟨vk(x),xk⟩ (2.3)

where denotes the payoff vector of player at .

### 2.2. The deterministic model

Our learning model will be based on the following simple idea: the game’s players “score” their actions by keeping track of their cumulative payoffs and then, at each moment , they play an “approximate” best response to this vector of performance scores (thus reinforcing the probability of playing an action with a higher overall payoff). More precisely, following Mertikopoulos and Sandholm (2016), this process can be described by the dynamics

 yk(t) =∫t0vk(x(s))ds, (RL) xk(t) =Qk(ηk(t)yk(t)),

where:

1. the score vector of player ranks each strategy based on its cumulative payoff up to time .

2. is a choice map which reinforces the strategies with the highest scores (see below for a rigorous definition).

3. is a learning parameter which can be tuned freely by each player.

Clearly, a natural choice for the “scores-to-strategies” map would be to take the correspondence , i.e. to greedily assign all weight to the strategy (or strategies) with the highest score. However, since the operator is multi-valued (so each player would also need to employ some tie-breaking rule to resolve ambiguities), we will focus on single-valued choice maps of the general form

 Qk(yk)=argmaxxk∈Xk{⟨yk,xk⟩−hk(xk)}, (2.4)

where the penalty function satisfies the following properties:

1. is continuous on .

2. is smooth on the relative interior of every face of .

3. is strongly convex on , i.e. there exists some such that

 hk(txk+(1−t)x′k)≤thk(xk)+(1−t)hk(x′k)−12Kt(1−t)∥x′k−xk∥2, (2.5)

for all and for all .

This “softening” of the operator has a long history in game theory and optimization, and the induced map is intimately related to the notion of softmax or perturbed/regularized best response maps; for an in-depth discussion, we refer the reader to Hofbauer and Sandholm (2002) and Mertikopoulos and Sandholm (2016). For our immediate purposes, the key observation is that (2.4) is a strictly concave problem, so it admits a unique solution for every input vector . Therefore, when the penalty term is small relative to , can be seen as a single-valued approximation of the standard best response correspondence .

Regarding the learning parameter , its role in (RL) is to temper the growth of the cumulative payoff vector so as to allow the player to better explore his strategies (instead of prematurely reinforcing one or another). In other words, can be interpreted as an extrinsic weight that the player assigns to cumulative observations of past payoffs.444The role of the learning parameter can also be linked to the vanishing step-size rules that are used in the theory of stochastic approximation – see e.g. Benaïm (1999), Lamberton et al. (2004), Oyarzun and Ruf (2014) and references therein. The difference between the two is that, in the theory of stochastic approximation, a variable step-size means that new information enters the algorithm with decreasing weight; on the other hand, in our context, all information is weighted evenly, but the aggregate information is weighted by to avoid extreme behaviors. Accordingly, given that grows (at most) as , we will assume throughout that:

###### Assumption 1.

is -smooth, nonincreasing and .

The decay rate of will play a crucial role when the players’ payoff observations are subject to stochastic perturbations, an issue that we will explore in detail in later sections. For now, we turn to two representative examples of (RL):

###### Example 2.1.

The prototype penalty function on the simplex is the Gibbs (negative) entropy which, after a standard calculation, yields the so-called logit map:

 Gα(y)=exp(yα)∑βexp(yβ). (2.6)

An easy differentiation then yields

 ˙xkα=eykα˙ykα∑kβeykβ−eykα∑kβeykβ˙ykβ(∑kβeykβ)2=xkα[vkα(x)−∑kβxkβvkβ(x)], (RD)

which is simply the (multi-population) replicator equation of Taylor and Jonker (1978) for population evolution under natural selection. For a more thorough treatment of the links between logit choice and the replicator dynamics, see Hofbauer et al. (2009), Mertikopoulos and Moustakas (2010), and references therein.

###### Example 2.2.

As another example, consider the penalty function . This quadratic penalty function leads to the projected choice map

 Π(y)=argminx∈Δ{⟨y,x⟩−12∥x∥2}=argminx∈Δ∥y−x∥2, (2.7)

and, as was shown by Mertikopoulos and Sandholm (2016), the induced trajectories of (RL) satisfy the so-called projection dynamics

 ˙xkα={vkα(x)−|supp(xk)|−1∑β∈supp(xk)vkβ(x)if xkα>0,0otherwise, (PD)

on an open dense set of times (in particular, except when the support of changes). The dynamics (PD) were introduced in game theory by Friedman (1991) as a geometric model of the evolution of play in population games; for a closely related model (but with different long-term properties), see Nagurney and Zhang (1997) and Lahkar and Sandholm (2008).

#### Related models

The reinforcement mechanism of (RL) is seen quite clearly in the work of Vovk (1990) and Littlestone and Warmuth (1994) on multi-armed bandits. Therein, the agent is faced with a repeated decision process (e.g. choosing a slot machine in the eponymous problem) and, at each stage, he selects an action with probability exponentially proportional to an estimate of said action’s cumulative payoff up to that time. In this way, the mean dynamics of the agent’s learning process boil down to the exponential learning scheme (EW) of Example 2.1 – itself a special case of (RL).

Leslie and Collins (2005) and Coucheney et al. (2015) also considered a reinforcement learning process where players discount past observations by a constant multiplicative factor and then play a perturbed best response to the resulting cumulative payoff vector – or estimate thereof. The mean dynamics that describe this process in continuous time are then given by (RL) with constant and an additional adjustment term that accounts for the exponential discounting of past observations. For a more detailed survey of the surrounding literature, we refer the reader to Fudenberg and Levine (1998) and Mertikopoulos and Sandholm (2016).

### 2.3. Learning in the presence of noise

A key assumption underlying the reinforcement learning scheme (RL) is that the players’ payoff observations are impervious to any sort of exogenous random noise. However, this assumption is rarely met in practical applications of game-theoretic learning: for instance, in telecommunication networks and traffic engineering, signal strength and latency measurements are constantly subject to stochastic fluctuations which introduce noise to the input of any learning algorithm (Kelly et al., 1998; Kang et al., 2009). Thus, to account for the lack of accurate payoff information in settings where uncertainty is an issue, we will consider the stochastically perturbed reinforcement learning process:

 dYkα =vkα(X)dt+σkα(X)dWkα, (SRL) Xk =Qk(ηkYk),

where is a family of standard Wiener processes (assumed independent across players and strategies) and the diffusion coefficients (assumed Lipschitz) measure the strength of the noise process.

Intuitively, the random component of (SRL) means that the players’ observed payoffs are only accurate up to an error term of the order of , possibly depending on the players’ mixed strategies but otherwise uncorrelated over players and strategies.555An alternative source of stochasticity could be any inherent randomness in the players’ payoffs – for instance, if there is a random component to the players’ payoffs. A detailed analysis of this case would require a careful reformulation of the underlying game which, for simplicity, we do not attempt here; for a related treatment in an evolutionary context, see Mertikopoulos and Viossat (2016). As such, (SRL) should be viewed as a special instance of the more general case where observation errors exhibit correlations between different actions: for instance, if each player’s action corresponds to a choice of route in a congestion game, any two routes that overlap will exhibit such correlations. The impact of including such correlations in our model is discussed at length in Section 7; however, for simplicity, our analysis will be stated in the uncorrelated case.

Regarding the existence and uniqueness of solutions to (SRL), Proposition B.1 in Appendix B shows that the players’ choice maps are Lipschitz continuous, so (SRL) admits a unique (strong) solution for every initial condition . Standard arguments can then be used to show that these solutions exist for all time (a.s.), so the players’ mixed strategy profile is also fully determined for all . However, since this is a somewhat indirect description of the evolution of , our first task will be to derive the governing dynamics of in the form of a stochastic differential equation stated directly on .

For simplicity, we only present here the special case where each player’s penalty function is of the separable form:

 hk(xk)=∑kαθk(xkα) (2.8)

for some strongly convex kernel function . We then get:

###### Proposition 2.1.

Let be an orbit of (SRL) in and let be a random open interval such that remains constant over . Then, for all , satisfies the stochastic differential equation

 dXkα =ηkθ′′kα[vkα−Θ′′k∑βvkβ/θ′′kβ]dt (2.9a) +ηkθ′′kα[σkαdWkα−Θ′′k∑βσkβ/θ′′kβdWkβ] (2.9b) +˙ηkηk1θ′′kα[θ′kα−Θ′′k∑βθ′kβ/θ′′kβ]dt (2.9c) −121θ′′kα[θ′′′kαU2kα−Θ′′k∑βθ′′′kβ/θ′′kβU2kβ]dt, (2.9d)

where all summations are taken over and:

1. , , ,

2. ,

3. .

In particular, if for all , is an ordinary (strong) solution of (2.9); otherwise, satisfies (2.9) on a random open dense subset of .

The proof of Proposition 2.1 is a simple but interesting application of Itô’s formula; to streamline our presentation, we relegate it to Appendix A and we focus here on some remarks and examples (see also Fig. 1 for some sample trajectories of the process):

###### Remark 2.1.

Even though the dynamics (2.9) appear quite convoluted, each of the constituent terms (2.9a)–(2.9d) admits a relatively simple interpretation:

1. The term (2.9a) drives the process in the case , . These are the deterministic dynamics studied by Mertikopoulos and Sandholm (2016) and correspond to learning in settings with no uncertainty.

2. The diffusion term (2.9b) reflects the direct impact of the noise on (SRL).

3. The term (2.9c) is due to the variability of the players’ learning rate so its impact on (2.9) vanishes if sufficiently fast. On the other hand, this term obviously persists even if there is no noise in the players’ learning process.

4. Finally, the term (2.9d) is the Itô correction induced on through (SRL) due to the non-anticipative nature of the Itô integral.666If (SRL) had been formulated as a Stratonovich equation, (2.9d) would vanish; however, the future-anticipating nature of the Stratonovich integral (van Kampen, 1981) is not well-suited for our purposes. Importantly, this term depends on but not , so it does not vanish for constant .

###### Example 2.3.

As we saw in Example 2.1, the replicator dynamics (RD) correspond to the entropic kernel . In this case, (2.9) leads to the following stochastic variant of the replicator dynamics:

 dXkα =ηkXkα[vkα−∑kβXkβvkβ]dt (SRD) +ηkXkα[σkαdWkα−∑kβσkβXkβdWkβ] +˙ηkηkXkα[logXkα−∑kβXkβlogXkβ]dt +η2k2Xkα[σ2kα(1−2Xkα)−∑kβσ2kβXkβ(1−2Xkβ)]dt.

For constant , (SRD) is simply the stochastic replicator dynamics of exponential learning studied by Mertikopoulos and Moustakas (2010). As such, (SRD) should be contrasted to the evolutionary replicator dynamics with aggregate shocks of Fudenberg and Harris (1992):

 dXkα =Xkα[vkα−∑kβXkβvkβ]dt (ASRD) +Xkα[σkαdWkα−∑kβσkβXkβdWkβ] −Xkα[σ2kαXkα−∑kβσ2kβX2kβ]dt,

where denotes the population share of the -th genotype of species in a multi-species environment, represents its reproductive fitness, and the noise coefficients measure the impact of random weather-like effects on population evolution.777For a comprehensive account of the literature surrounding (ASRD), see Hofbauer and Imhof (2009), Mertikopoulos and Viossat (2016), and references therein. In addition to the works mentioned above, Khasminskii and Potsepun (2006) study a Stratonovich-based formulation of (ASRD) while Vlasic (2012) also considers random jumps induced by discontinuous, earthquake-like events.

Besides the absence of the learning rate , the fundamental difference between (SRD) and (ASRD) is in their Itô correction: as we shall see in what follows, this term leads to a drastically different long-term behavior and highlights an important contrast between learning and evolution in the presence of noise.

###### Example 2.4.

In the case of the projected reinforcement learning scheme (PD), substituting in (2.9) yields the stochastic projection dynamics:

 dXkα (SPD) +[σkαdWkα−|supp(Xk)|−1∑β∈supp(Xk)σkβdWkβ] +˙ηkηk[Xkα−|supp(Xk)|−1]dt.

There are two important qualitative differences between (SRD) and (SPD): first, (SRD) holds for all whereas (SPD) describes the solution orbits of (SRL) only on intervals over which the support of remains constant. Second, the projection mapping of (2.7) is piecewise linear, so there is no Itô correction in (SPD); accordingly, the distinction between Itô and Stratonovich perturbations becomes void in the context of (SPD).

## 3. Regret minimization

We begin our rationality analysis in the case where there is a single player whose payoffs are determined by the state of his environment – which, in turn, may evolve arbitrarily over time (including adversarially if the player is playing against an opponent).

More precisely, consider a decision process where, at each , the player chooses an action from a finite set according to some mixed strategy and his expected payoff is determined by the (a priori unknown) payoff vector of stage . In this context, the performance of a dynamic strategy can be measured by comparing the player’s (expected) cumulative payoff to the payoff that he could have obtained if the state of nature were known in advance and the player had best-responded to it; specifically, the player’s cumulative regret at time is defined as

 Reg(t)=maxα∈A∫t0vα(s)ds−∫t0⟨v(s),x(s)⟩ds, (3.1)

and we say that a strategy is consistent if it leads to no (average) regret, i.e.

 limsupt→∞1tReg(t)≤0(a.s.), (3.2)

or, equivalently:

 Reg(t)=o(t)as t→∞. (3.3)

The notion of consistency presented above is commonly referred to as external or universal consistency and was originally introduced by Hannan (1957) in a discrete-time context (for a detailed overview, see Fudenberg and Levine (1998), Shalev-Shwartz (2011) and references therein). There, the agent is assumed to receive a payoff vector at each stage and, observing the past realizations , he chooses an action according to some probability distribution . The strategy is then said to be externally consistent if, on average, it earns more than any action (or expert suggestion) ; in other words, the no-regret criterion (3.2) is simply the continuous-time analogue of the standard definition of external consistency.

With this in mind, the main question that we seek to address here is whether the perturbed reinforcement learning process (SRL) is consistent. To make this precise, we will focus on the unilateral process

 dYα(t) =vα(t)dt+σα(t)dWα(t), (SRL-U) X(t) =Q(η(t)Y(t)),

where is a locally integrable stream of payoff vectors, is a Wiener process in and the noise coefficients (assumed continuous and bounded) represent the error in the player’s payoff observations. In the deterministic case (), Sorin (2009) proved that the unilateral variant of (EW) is consistent, a result which was recently extended by Kwon and Mertikopoulos (2014) to the more general setting of (RL). Below, we show that this learning process remains consistent even when the player’s payoff observations are subject to arbitrarily high observation noise:

###### Theorem 3.1.

Assume that (SRL-U) is run with learning parameter satisfying (in addition to Assumption 1) and initial bias . Then, (SRL-U) is consistent and it enjoys the cumulative regret bound:

 (3.4)

where , and is the strong convexity constant of the player’s penalty function .

###### Remark.

The assumption is only made for simplicity: the policy (SRL-U) is consistent for any initial score vector but the corresponding regret guarantee is a bit more cumbersome to write down.

The basic idea of our proof will be to compare the cumulative payoff of the policy (SRL-U) up to time to that of an arbitrary test strategy . To that end, we will examine how far the induced trajectory of play can stray from ; however, since is defined via the cumulative payoff process , we will carry out this comparison between and directly using the so-called Fenchel coupling:

 F(x,y)=h(x)+h∗(y)−⟨y,x⟩, (3.5)

where

 h∗(y)=maxx∈X{⟨y,x⟩−h(x)} (3.6)

denotes the convex conjugate of (Rockafellar, 1970).888The terminology “Fenchel coupling” is due to Mertikopoulos and Sandholm (2016) and reflects the fact that (3.5) collects all the terms of Fenchel’s inequality. By Fenchel’s inequality (Rockafellar, 1970), is non-negative in both arguments, so it provides a “congruity” measure between and . With this in mind, our proof strategy will be to express the player’s regret with respect to a test strategy in terms of and then show that the latter grows sublinearly in .

The details of the proof are presented in Appendix A and rely on Itô’s lemma and the law of the iterated logarithm (which is used to control the impact of the noise on the agent’s learning process). For now, we focus on some aspects of Theorem 3.1 and the regret bound (3.4):

###### Remark 3.1 (The role of η).

It is important to note that the second term of (3.4) becomes linear in when the player’s learning parameter is constant, explaining in this way the requirement that as . On the other hand, this requirement can be dropped in the noiseless case: when , the guarantee (3.4) reduces to , so (SRL-U) remains consistent even for constant – a fact which was first observed by Sorin (2009) in the context of (EW) and Kwon and Mertikopoulos (2014) for (RL). In the presence of noise, we conjecture that (SRL-U) may lead to positive regret with positive probability for constant but we have not been able to prove this.

The form of the bound (3.4) also highlights a trade-off between more aggressive learning rates (slowly decaying ) and the noise affecting the player’s payoff observations. Specifically, the first (deterministic) term of (3.4) is decreasing in while the second one (which is due to all the noise) is increasing in ;999The last term of (3.4) is also due to the noise, but it does not otherwise depend on . as a result, in the absence of noise (), it is better to use a large, constant rather than letting , but this might lead to disastrous results under uncertainty. Shalev-Shwartz (2011) draws a similar conclusion for the discrete-time analogue of (RL) known as online mirror descent (OMD); in this way, discretization and random payoff disturbances seem to have comparably adverse effects on the agent’s learning process.

The above considerations can be illustrated more explicitly by choosing for some ; in this case, Theorem 3.1 yields the regret bounds:

###### Corollary 3.2.

Assume that (SRL-U) is run with learning rate for some . Then:

 Reg(t)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩O(t1−γ)if 0<γ<12,O(√tloglogt)if γ=12,O(tγ)if 12<γ<1. (3.7)
###### Proof.

Simply note that is the dominant term in (3.4) for all . Otherwise, for , the first two terms of (3.4) are both and are dominated by the third. ∎

###### Remark 3.2 (Links with vanishingly smooth fictitious play).

We close this section by discussing the links of (SRL-U) with the vanishingly smooth fictitious play process that was recently examined by Benaïm and Faure (2013) in a discrete-time setting. Specifically, by interpreting as the payoff that a player obtains in a -player game against his opponent’s empirical frequency of play (cf. Section 6), the strategy

 x(t)=Q(η(t)∫t0v(s)ds)=Q(tη(t)⋅t−1∫t0v(s)ds) (3.8)

can be viewed as a “vanishingly smooth” best response to the empirical distribution of play of one’s opponent.101010It is “smooth” because the player is employing a regularized best response map (as opposed to a hard correspondence), and it is “vanishingly smooth” because the factor hardens to as (for a more detailed discussion, see Benaïm and Faure, 2013). As such, (SRL-U) can be seen as a stochastically perturbed variant of vanishingly smooth fictitious play in continuous time, in which case Corollary 3.2 provides the continuous-time, stochastic analogue of Theorem 1.8 of Benaïm and Faure (2013).

## 4. Extinction of dominated strategies

A fundamental rationality requirement for any game-theoretic learning process is the elimination of suboptimal, dominated strategies. Formally, given a finite game , we say that is dominated by (and we write ) if

 ⟨vk(x),pk⟩<⟨vk(x),p′k⟩%forall$x∈X$. (4.1)

Strategies that are iteratively dominated, undominated, or iteratively undominated are defined similarly; also, for pure strategies , we obviously have if and only if

 vkα(x)

With this in mind, given a trajectory of play , , we will say that a pure strategy becomes extinct along if as . More generally, following Samuelson and Zhang (1992), we will say that the mixed strategy becomes extinct along if ; otherwise, is said to survive.

In the case of perfect payoff observations, Mertikopoulos and Sandholm (2016) showed that dominated strategies become extinct under the general reinforcement learning dynamics (RL), thus extending earlier results for the replicator dynamics (for an overview, see Viossat, 2015). Under noise and uncertainty however, the situation is a bit more complex: in the replicator dynamics with aggregate shocks (ASRD), Cabrales (2000), Imhof (2005) and Hofbauer and Imhof (2009) provided a set of sufficient conditions on the intensity of the noise that guarantee the elimination of dominated strategies; on the other hand, Mertikopoulos and Moustakas (2010) showed that the noisy replicator dynamics (SRD) eliminate all strategies that are not iteratively undominated, with no conditions on the noise level.

As we show below, this unconditional elimination result is a consequence of the core reinforcement principle behind (RL) and it extends to the entire class of learning dynamics covered by (SRL):

###### Theorem 4.1.

Let be a solution orbit of (SRL). If is dominated (even iteratively), then it becomes extinct along almost surely.

The basic idea of our proof is to show that the Itô process has a dominant drift term that pushes it away from dominated strategies. However, given the complicated form of the (non-autonomous) dynamics (2.9), we do so by using the Fenchel coupling (3.5) to relate the evolution of the generating process to .

###### Proof of Theorem 4.1.

Suppose so that for some and for all . Then, if is a solution orbit of (SRL), we get:

 =⟨vk,p′k−pk⟩dt+∑kβ(p′kβ−pkβ)σkβdWkβ ≥mkdt+∑kβ(p′kβ−pkβ)σkβdWkβ, (4.3)

or, equivalently:

 ⟨Yk(t),p′k−pk⟩≥ck+mkt+ξk(t), (4.4)

where and

 ξk(t)=∑kβ(pkβ−p′kβ)∫t0σkβ(X(s))dWkβ(s). (4.5)

Consider now the rate-adjusted “cross-coupling”

 Vk(yk) =η−1k[Fk(pk,ηkyk)−Fk(p′k,ηkyk)] =η−1k[hk(pk)−hk(p′k)]−⟨yk,pk−p′k⟩, (4.6)

with defined as in (3.5). Then, by substituting (4.4) in (4) and recalling that , we obtain:

 Fk(pk,ηkYk)≥hk(pk)−hk(p′k)+ηk⋅[ck+mkt+ξk(t)]. (4.7)

To proceed, Lemma B.4 shows that (a.s.), so the RHS of (4.7) tends to infinity as on account of the fact that (cf. Assumption 1). In turn, this gives , so becomes extinct along by virtue of Proposition B.1. Finally, our claim for iteratively dominated strategies follows by induction on the rounds of elimination of dominated strategies – cf. Cabrales (2000, Proposition 1A). ∎

Theorem 4.1 shows that dominated strategies become extinct under (SRL) but it does not give any information on the rate of extinction – or how probable it is to observe a dominated strategy above a given level at some . To address this issue, we provide two results below: Proposition 4.2 describes the long-term decay rate of dominated strategies and provides a “large deviations” bound for the probability of observing a dominated strategy above a given level at time . Subsequently, in Proposition 4.3 we estimate the expected time it takes for a dominated strategy to drop below a given level. For simplicity, we present our results in the case where the players’ choice maps are derived from separable penalty functions as in (2.8); again, proofs are relegated to Appendix A:

###### Proposition 4.2.

Let be dominated by and assume that the choice map of player is generated by a separable penalty function of the form (2.8) with . Then, for all and for all large enough , we have:

 Xkα(t)≤ϕk[Ck−ηk(t)(mkt−2(1+ε)σαβ√tloglogt)]{% (}a.s.{)} (4.8) and P(Xkα(t)>δ)≤12erfc[12σαβ(mk√t−Ck−θ′k(δ)ηk(t)√t)] (4.9)

where:

1. is the complementary error function.

2. (note that by assumption).

3. is the minimum payoff difference between and .

4. .

5. is a constant that depends only on the initial conditions of (SRL).

###### Proposition 4.3.

With notation as in Proposition 4.2, assume that (SRL) is run with constant learning rates and noisy observations with constant variance. If , then:

 E[τδ]≤[Ck−θ′k(δ)]+ηkmk. (4.10)

Propositions 4.2 and 4.3 are our main results concerning the rate of elimination of dominated strategies under uncertainty, so a few remarks are in order:

###### Remark 4.1 (Asymptotics versus mean behavior).

Propositions 4.2 and 4.3 capture complementary aspects of the statistics of the extinction rate of dominated strategies under (SRL). For instance, (4.8) describes the asymptotic rate of elimination of dominated strategies but it does not provide an estimate of how much time must pass until this rate becomes relevant. On the other hand, the bound (4.10) estimates the mean time it takes for a dominated strategy to fall below a given level; however, in contrast to (4.9), it does not describe how probable it is to observe deviations from this mean.

###### Remark 4.2 (Comparison with the noiseless case).

When , the decay estimate (4.8) boils down to , a bound which recovers the results of Mertikopoulos and Sandholm (2016) for the deterministic dynamics (RL). Thus, even though the noise may initially mask the fact that a strategy is dominated, Proposition 4.2 shows that the long-run behavior of (SRL) and (RL) is the same as far as dominated strategies are concerned. This is in contrast with the aggregate-shocks dynamics (ASRD) where, even if a dominated strategy becomes extinct, its rate of elimination is different to leading order than in the noiseless case (Imhof, 2005, Theorem 3.1).

###### Remark 4.3 (The role of η).

The estimate (4.8) shows that running (SRL) with large, constant leads to a much faster rate of extinction of dominated strategies. In particular, recalling that as , the bound (4.9) becomes:

 P(Xkα(t)>δ)=O(t−1/2exp(−m2kt2σ2αβ)), (4.11)

up to a subleading term of the order of in the exponent. Thus, even though the leading behavior of (4.9) is not affected by the player’s choice of learning parameter, the subleading term is minimized for large, constant so the asymptotic bound (4.11) becomes tighter in that case.

We thus observe an important contrast between regret minimization and the elimination of dominated strategies: whereas the optimal regret guarantee of Theorem 3.1 is achieved for , the asymptotic extinction rate of dominated strategies is much faster for constant . The reason for this disparity is that higher values of reinforce consistent payoff differences and therefore eliminate dominated strategies faster (independently of the noise level). On the other hand, to attain a no-regret state, players must be careful not to make too many mistakes in the presence of noise, so a more conservative choice of is warranted.

## 5. Long-term stability and convergence analysis

We now turn to the long-term stability and convergence properties of the dynamics (SRL) with respect to equilibrium play. To that end, recall first that is a Nash equilibrium of if it is unilaterally stable for all players, i.e.

 uk(x∗)≥uk(xk,x∗−k)for all xk∈Xk, k∈N, (5.1)

or, equivalently:

 (5.2)

Strict equilibria are defined by requiring that (5.1) hold as a strict inequality for all ; obviously, such equilibria are also pure in the sense that they correspond to pure strategy profiles in (i.e. vertices of ).

In the noiseless case () with constant learning rates (), Mertikopoulos and Sandholm (2016) recently showed that the deterministic dynamics (RL) exhibit the following properties with respect to Nash equilibria of :

1. If a solution orbit of (RL) converges to , then is a Nash equilibrium.

2. If is (Lyapunov) stable, then it is also a Nash equilibrium.

3. Strict Nash equilibria are asymptotically stable in (RL).

In turn, these properties are generalizations of the long-term stability and convergence properties of the (multi-population) replicator dynamics – sometimes referred to as the “folk theorem” of evolutionary game theory (Hofbauer and Sigmund, 1998, 2003). That being said, the situation is quite different in the presence of noise: for instance, interior Nash equilibria are not even traps (almost sure rest points) of the stochastic reinforcement learning dynamics (SRL), so the ordinary (deterministic) definitions of stability and convergence no longer apply. Instead, in the context of stochastic differential equations, Lyapunov and asymptotic stability are defined as follows (Khasminskii, 2012):

###### Definition 5.1.

Let . We will say that:

1. is stochastically (Lyapunov) stable under (SRL) if, for every and for every neighborhood of in , there exists a neighborhood of such that

 P(X(t)∈U0 for all t≥0)≥1−ε, (5.3)

whenever .

2. is stochastically asymptotically stable under (SRL) if it is stochastically stable and attracting: for every and for every neighborhood of in , there exists a neighborhood of such that

 P(X(t)∈U0 for all t≥0 and limt→∞X(t)=x∗)≥1−ε, (5.4)

whenever .

In the evolutionary setting of the stochastic replicator dynamics with aggregate shocks, Imhof (2005) and Hofbauer and Imhof (2009) showed that strict Nash equilibria are stochastically asymptotically stable under (ASRD) provided that the variability of the shocks across different strategies is small enough. More recently, in a learning context, Mertikopoulos and Moustakas (2010) showed that the same holds for the stochastic replicator dynamics (SRD) of exponential learning (with constant ), irrespective of the variance of the observation noise. However, this last result relies heavily on the specific properties of the logit map (2.6) and the infinitesimal generator of (SRD).

In our case, the convoluted (and non-autonomous) form of the stochastic dynamics (2.9) complicates things considerably, so such an approach is not possible – especially with regards to finding a local stochastic Lyapunov function for the dynamical system (2.9) that governs the evolution of . Nonetheless, by working directly with (SRL), we obtain the following general result:

###### Theorem 5.2.

Let be a solution orbit of (SRL) and let . Then:

1. If , is a Nash equilibrium of .

2. If is stochastically (Lyapunov) stable, it is also Nash.

3. If is a strict Nash equilibrium of , it is stochastically asymptotically stable under (SRL).

The fist ingredient of our proof (presented in detail in Appendix A) is to show that if the process remains in the vicinity of for all with positive probability, then must be a Nash equilibrium. This is formalized in the following proposition (which is of independent interest):

###### Proposition 5.3.

With notation as in Theorem 5.2, assume that every neighborhood of in admits with positive probability a solution orbit of (SRL) such that for all . Then, is a Nash equilibrium.

###### Proof.

If is not Nash, we must have for some player and for some , . On that account, let be a sufficiently small neighborhood of in such that for some and for all . Then, conditioning on the positive probability event that there exists an orbit of (SRL) that is contained in for all , we have:

 dYkα−dYkβ =(vkα(X)−vkβ(X))dt+σkαdWkα−σkβdWkβ ≤−mkdt−dξk,