On the robustness of learning in games with stochastically perturbed payoff observations
Abstract.
Motivated by the scarcity of accurate payoff feedback in practical applications of game theory, we examine a class of learning dynamics where players adjust their choices based on past payoff observations that are subject to noise and random disturbances. First, in the singleplayer case (corresponding to an agent trying to adapt to an arbitrarily changing environment), we show that the stochastic dynamics under study lead to no regret almost surely, irrespective of the noise level in the player’s observations. In the multiplayer case, we find that dominated strategies become extinct and we show that strict Nash equilibria are stochastically stable and attracting; conversely, if a state is stable or attracting with positive probability, then it is a Nash equilibrium. Finally, we provide an averaging principle for player games, and we show that in zerosum games with an interior equilibrium, time averages converge to Nash equilibrium for any noise level.
Key words and phrases:
Dominated strategies; learning; Nash equilibrium; regret minimization; regularization; robustness; stochastic game dynamics; stochastic stability2010 Mathematics Subject Classification:
Primary 60H10, 37N40, 91A26; secondary 60H30, 60J70, 91A22Contents
1. Introduction
A central question in gametheoretic learning is whether the outcome of a learning process (viewed here as a plausible model for the behavior of optimizing agents) is also justifiable from the point of view of rationality – e.g. whether the process leads to a Nash equilibrium or a state where no dominated strategies are present. In that regard, one of the most widely studied learning procedures is the exponential weight (EW) algorithm that was originally introduced by Vovk (1990) and Littlestone and Warmuth (1994) in the context of multiarmed bandit problems. In a gametheoretic setting, the algorithm simply prescribes that players score their actions based on their cumulative payoffs and then assign choice probabilities proportionally to the exponential of each action’s score. As such, the EW algorithm in continuous time (Sorin, 2009) is described by the dynamics
(EW)  
where, somewhat informally, denotes the payoff to the th action of player , is the action’s performance score (cumulative payoff) over time and is the corresponding mixed strategy weight.
Given its long history and its link with the replicator dynamics of evolutionary game theory (discussed below), the rationality properties of (EW) are also relatively well understood. To name the most important ones: a) dominated strategies become extinct in the long run; b) limits of interior trajectories and stable rest points are Nash equilibria; c) strict equilibria are stable and attracting; and d) empirical distributions of play converge to equilibrium in player zerosum games with an interior equilibrium (Hofbauer and Sigmund, 1998; Sandholm, 2010). More recently, Sorin (2009) also showed that (EW) is universally consistent, i.e. players have no regret for following (EW) instead of any other fixed strategy in a dynamically changing environment.
On the other hand, a crucial limitation in the above considerations is that players are assumed to have perfect observations of their actions’ rewards. In practical applications of game theory (e.g. in economics, finance and traffic networks), “perfect feedback” requirements are often too stringent, especially in games with massively many actions and/or players. Accordingly, an important question that arises is whether (EW) retains its rationality properties in the presence of noise and stochastically perturbed payoff observations. Somewhat surprisingly, this is indeed the case (at least for some of them): even if the payoffs in (EW) are perturbed by a Brownian noise term of arbitrarily high variance, dominated strategies become extinct and strict equilibria remain stable and attracting with high probability (Mertikopoulos and Moustakas, 2009, 2010). Thus, even when the players’ true payoffs are masked by noise and uncertainty, the reinforcement learning principle behind (EW) allows players to weed out the noise and leads to similar outcomes as in the noiseless, deterministic regime.
Motivated by the robustness of exponential learning in noisy environments, we examine here a broad class of gametheoretic learning procedures where players adjust their strategies by playing an approximate best response to the vector of their actions’ cumulative payoffs – possibly subject to random disturbances and noise. In the case of perfect payoff observations, this scheme boils down to the reinforcement learning dynamics considered by Mertikopoulos and Sandholm (2016) who showed that the properties discussed above still hold in the absence of noise. With this in mind, our main contribution is to show that feedback noise does not really matter: even if the players’ payoff observations are subject to arbitrarily high (and possibly statedependent or correlated) noise, dominated strategies become extinct (roughly at the same rate as in deterministic environments); strict Nash equilibria remain (stochastically) stable and attracting; and, in the converse direction, if a state is (stochastically) stable or attracting with positive probability, then it is also a Nash equilibrium. Finally, if players use a decreasing learning parameter to adjust the weight of their scoring process over time, the stochastic learning dynamics under study lead to no regret (a.s.) and their time average converges to equilibrium in player zerosum games with an interior equilibrium.
Our analysis also highlights an important difference between “static” solution concepts (such as Nash equilibria and strategic dominance) and more “dynamic” notions (such as regret minimization). Whereas the speed of convergence to static target states is accelerated by the use of a large, constant learning parameter (noise and disturbances notwithstanding), the rate of regret minimization is optimized by using a learning parameter that decays proportionally to (and which guarantees an bound for the players’ cumulative regret under uncertainty). This disparity only appears in the noisy regime and is due to the fact that players need to be more conservative when facing a fluid environment that varies with time in an (a priori) unpredictable fashion. Otherwise, if players have access to noiseless payoff observations, they can be signficantly more greedy and achieve lower regret faster by using a constant learning parameter.
1.1. Related work
The longterm rationality properties of exponential learning in a gametheoretic setting were first studied in conjunction with those of the replicator dynamics, one of the most widely studied dynamical systems for population evolution under natural selection. Indeed, a simple differentiation of (EW) reveals that the evolution of the players’ mixed strategies under (EW) follows the differential equation:
(RD) 
which is simply the (multipopulation) replicator dynamics of Taylor and Jonker (1978). In this context, Akin (1980), Nachbar (1990) and Samuelson and Zhang (1992) showed that dominated strategies become extinct, while it is well known that a) stable states of (RD) are Nash equilibria; b) strict Nash equilibria are asymptotically stable; and c) time averages of replicator orbits converge to equilibrium in player games provided that no strategy share becomes arbitrarily small (Hofbauer and Sigmund, 1998, 2003).
This twoway relationship between exponential learning and the replicator dynamics was noted early on by Rustichini (1999) in his study of reinforcement learning models for games – i.e. learning how to react to a given situation so as to maximize a numerical reward (Sutton and Barto, 1998). From a gametheoretic point of view, the learning models of Börgers and Sarin (1997) and Erev and Roth (1998) are also closely related to the replicator dynamics – and, hence, to (EW) – while Fudenberg and Levine (1998), Hofbauer and Sandholm (2002), and Hopkins (2002) studied a smooth variant of the wellknown fictitious play algorithm where the players play a perturbed best response to their opponents’ empirical frequency of play.^{1}^{1}1For a closely related model, see also Cominetti et al. (2010). Up to mild technical differences, the correlated version of these models can be seen as a noiseless version of our reinforcement learning model, viz. the dynamics of Mertikopoulos and Sandholm (2016) with the parameter choice ; we explore this relation in more detail in Sections 3 and 6.
Of course, a crucial aspect of these considerations is whether players have accurate observations of their actions’ rewards or only a noisy estimate thereof: if the former is not the case, noise and fluctuations could potentially lead to suboptimal outcomes with high probability. In biology and evolutionary game theory (where payoffs measure the reproductive fitness of a biological species or the average payoff of populations of nonatomic players respectively), Fudenberg and Harris (1992) accounted for such fluctuations by introducing a stochastic variant of the replicator dynamics where evolution is perturbed by “aggregate shocks” that reflect the impact of weatherlike effects and other perturbations.^{2}^{2}2Khasminskii and Potsepun (2006) also consider a Stratonovichbased model while Vlasic (2012) examines the case of random jumps incurred by catastrophic, earthquakelike events In this framework, Cabrales (2000), Imhof (2005) and Hofbauer and Imhof (2009) showed that dominated strategies are still eliminated if the variability of the shocks across different genotypes (strategies) is not too high, while Imhof (2005) and Hofbauer and Imhof (2009) showed that strict Nash equilibria of a modified game are stochastically asymptotically stable under the replicator dynamics with aggregate shocks.^{3}^{3}3Whether the equilibria of the original game are themselves asymptotically stable, depends on the intensity of the noise on different strategies and also on the exact way that the noise enters the process (Mertikopoulos and Viossat, 2016).
On the other hand, Mertikopoulos and Moustakas (2009, 2010) showed that the dynamics obtained by Fudenberg and Harris (1992) do not coincide with the stochastic replicator dynamics induced by (EW) in the presence of random disturbances and measurement noise – in contrast to the noiseless case where (EW) and (RD) do coincide. As we mentioned above, this learning variant of the stochastic replicator dynamics actually retains the rationality properties of the deterministic system (EW)/(RD) without any caveats on the noise: dominated strategies become extinct and strict Nash equilibria remain stochastically asymptotically stable without any conditions on the perturbations’ magnitude. This shows that the origin of the noise is crucial in the determination of the dynamics’ longterm properties: whereas the “aggregate shocks” of evolutionary environments lead to rational behavior in a modified game, no such modifications are required when learning under uncertainty.
1.2. Paper outline
In Section 2, we present our model for learning in the presence of noise and we derive the system of coupled stochastic differential equations that governs the evolution of the players’ mixed strategies. In Section 3, we show that if the players use a smoothly decreasing learning parameter, then the learning dynamics under study lead to no regret (a.s.), whatever the noise level. In Section 4, we investigate dominated strategies: we show that dominated strategies become extinct (a.s.) and we derive an explicit bound for their extinction rate. Section 5 focuses on the dynamics’ longterm stability and convergence properties: we show that a) stochastically (Lyapunov) stable states and states that attract trajectories of play with positive probability are Nash equilibria; and b) strict Nash equilibria are stochastically asymptotically stable, irrespective of the fluctuations’ magnitude. In Section 6, we provide an averaging principle for player games in the spirit of Hofbauer et al. (2009); thanks to this principle, we then show that empirical distributions of play converge to Nash equilibrium in zerosum games (again, no matter the noise level). Finally, in Section 7, we discuss the relaxation of some of the assumptions on the noise process – and, in particular, the independence of the observation noise across players and strategies.
2. The model
After a few preliminaries to set notation and terminology, this section focuses on the basic properties of a broad class of reinforcement learning dynamics under noise and uncertainty. In the noiseless, deterministic regime (Section 2.2), our model essentially boils down to the class of dynamics recently studied by Mertikopoulos and Sandholm (2016). The full stochastic framework (Section 2.3) is then obtained by positing that the agents’ observations are perturbed at each moment in time by a zeromean stochastic process (an Itô diffusion).
2.1. Preliminaries
Notation
If is a finite set, the real vector space generated by will be denoted by and we will write for its canonical basis; for concision, we will also use to refer interchangeably to or , writing e.g. instead of . The set of probability measures on will be identified with the dimensional simplex and the relative interior of will be written .
If is a finite family of sets, we will use the shorthand for the tuple and we will write instead of . Finally, any statement of the form for a stochastic process should be interpreted in the a.s. sense, i.e. “for every , there exists a.s. some (random) such that for all ” – and likewise for the statement “”.
Definitions from game theory
A finite game in normal form is a tuple consisting of a) a finite set of players ; b) a finite set of actions (or pure strategies) per player ; and c) the players’ payoff functions , where denotes the set of all joint action profiles . The set of mixed strategies of player is denoted by and the space of mixed strategy profiles will be called the game’s strategy space. In this mixed context, the expected payoff of player in the strategy profile is
(2.1) 
where, in a slight abuse of notation, denotes the payoff of player in the profile . Accordingly, the payoff corresponding to in the mixed strategy profile is
(2.2) 
leading to the more concise expression
(2.3) 
where denotes the payoff vector of player at .
2.2. The deterministic model
Our learning model will be based on the following simple idea: the game’s players “score” their actions by keeping track of their cumulative payoffs and then, at each moment , they play an “approximate” best response to this vector of performance scores (thus reinforcing the probability of playing an action with a higher overall payoff). More precisely, following Mertikopoulos and Sandholm (2016), this process can be described by the dynamics
(RL)  
where:

the score vector of player ranks each strategy based on its cumulative payoff up to time .

is a choice map which reinforces the strategies with the highest scores (see below for a rigorous definition).

is a learning parameter which can be tuned freely by each player.
Clearly, a natural choice for the “scorestostrategies” map would be to take the correspondence , i.e. to greedily assign all weight to the strategy (or strategies) with the highest score. However, since the operator is multivalued (so each player would also need to employ some tiebreaking rule to resolve ambiguities), we will focus on singlevalued choice maps of the general form
(2.4) 
where the penalty function satisfies the following properties:

is continuous on .

is smooth on the relative interior of every face of .

is strongly convex on , i.e. there exists some such that
(2.5) for all and for all .
This “softening” of the operator has a long history in game theory and optimization, and the induced map is intimately related to the notion of softmax or perturbed/regularized best response maps; for an indepth discussion, we refer the reader to Hofbauer and Sandholm (2002) and Mertikopoulos and Sandholm (2016). For our immediate purposes, the key observation is that (2.4) is a strictly concave problem, so it admits a unique solution for every input vector . Therefore, when the penalty term is small relative to , can be seen as a singlevalued approximation of the standard best response correspondence .
Regarding the learning parameter , its role in (RL) is to temper the growth of the cumulative payoff vector so as to allow the player to better explore his strategies (instead of prematurely reinforcing one or another). In other words, can be interpreted as an extrinsic weight that the player assigns to cumulative observations of past payoffs.^{4}^{4}4The role of the learning parameter can also be linked to the vanishing stepsize rules that are used in the theory of stochastic approximation – see e.g. Benaïm (1999), Lamberton et al. (2004), Oyarzun and Ruf (2014) and references therein. The difference between the two is that, in the theory of stochastic approximation, a variable stepsize means that new information enters the algorithm with decreasing weight; on the other hand, in our context, all information is weighted evenly, but the aggregate information is weighted by to avoid extreme behaviors. Accordingly, given that grows (at most) as , we will assume throughout that:
Assumption 1.
is smooth, nonincreasing and .
The decay rate of will play a crucial role when the players’ payoff observations are subject to stochastic perturbations, an issue that we will explore in detail in later sections. For now, we turn to two representative examples of (RL):
Example 2.1.
The prototype penalty function on the simplex is the Gibbs (negative) entropy which, after a standard calculation, yields the socalled logit map:
(2.6) 
An easy differentiation then yields
(RD) 
which is simply the (multipopulation) replicator equation of Taylor and Jonker (1978) for population evolution under natural selection. For a more thorough treatment of the links between logit choice and the replicator dynamics, see Hofbauer et al. (2009), Mertikopoulos and Moustakas (2010), and references therein.
Example 2.2.
As another example, consider the penalty function . This quadratic penalty function leads to the projected choice map
(2.7) 
and, as was shown by Mertikopoulos and Sandholm (2016), the induced trajectories of (RL) satisfy the socalled projection dynamics
(PD) 
on an open dense set of times (in particular, except when the support of changes). The dynamics (PD) were introduced in game theory by Friedman (1991) as a geometric model of the evolution of play in population games; for a closely related model (but with different longterm properties), see Nagurney and Zhang (1997) and Lahkar and Sandholm (2008).
Related models
The reinforcement mechanism of (RL) is seen quite clearly in the work of Vovk (1990) and Littlestone and Warmuth (1994) on multiarmed bandits. Therein, the agent is faced with a repeated decision process (e.g. choosing a slot machine in the eponymous problem) and, at each stage, he selects an action with probability exponentially proportional to an estimate of said action’s cumulative payoff up to that time. In this way, the mean dynamics of the agent’s learning process boil down to the exponential learning scheme (EW) of Example 2.1 – itself a special case of (RL).
Leslie and Collins (2005) and Coucheney et al. (2015) also considered a reinforcement learning process where players discount past observations by a constant multiplicative factor and then play a perturbed best response to the resulting cumulative payoff vector – or estimate thereof. The mean dynamics that describe this process in continuous time are then given by (RL) with constant and an additional adjustment term that accounts for the exponential discounting of past observations. For a more detailed survey of the surrounding literature, we refer the reader to Fudenberg and Levine (1998) and Mertikopoulos and Sandholm (2016).
2.3. Learning in the presence of noise
A key assumption underlying the reinforcement learning scheme (RL) is that the players’ payoff observations are impervious to any sort of exogenous random noise. However, this assumption is rarely met in practical applications of gametheoretic learning: for instance, in telecommunication networks and traffic engineering, signal strength and latency measurements are constantly subject to stochastic fluctuations which introduce noise to the input of any learning algorithm (Kelly et al., 1998; Kang et al., 2009). Thus, to account for the lack of accurate payoff information in settings where uncertainty is an issue, we will consider the stochastically perturbed reinforcement learning process:
(SRL)  
where is a family of standard Wiener processes (assumed independent across players and strategies) and the diffusion coefficients (assumed Lipschitz) measure the strength of the noise process.
Intuitively, the random component of (SRL) means that the players’ observed payoffs are only accurate up to an error term of the order of , possibly depending on the players’ mixed strategies but otherwise uncorrelated over players and strategies.^{5}^{5}5An alternative source of stochasticity could be any inherent randomness in the players’ payoffs – for instance, if there is a random component to the players’ payoffs. A detailed analysis of this case would require a careful reformulation of the underlying game which, for simplicity, we do not attempt here; for a related treatment in an evolutionary context, see Mertikopoulos and Viossat (2016). As such, (SRL) should be viewed as a special instance of the more general case where observation errors exhibit correlations between different actions: for instance, if each player’s action corresponds to a choice of route in a congestion game, any two routes that overlap will exhibit such correlations. The impact of including such correlations in our model is discussed at length in Section 7; however, for simplicity, our analysis will be stated in the uncorrelated case.
Regarding the existence and uniqueness of solutions to (SRL), Proposition B.1 in Appendix B shows that the players’ choice maps are Lipschitz continuous, so (SRL) admits a unique (strong) solution for every initial condition . Standard arguments can then be used to show that these solutions exist for all time (a.s.), so the players’ mixed strategy profile is also fully determined for all . However, since this is a somewhat indirect description of the evolution of , our first task will be to derive the governing dynamics of in the form of a stochastic differential equation stated directly on .
For simplicity, we only present here the special case where each player’s penalty function is of the separable form:
(2.8) 
for some strongly convex kernel function . We then get:
Proposition 2.1.
Let be an orbit of (SRL) in and let be a random open interval such that remains constant over . Then, for all , satisfies the stochastic differential equation
(2.9a)  
(2.9b)  
(2.9c)  
(2.9d) 
where all summations are taken over and:

, , ,

,

.
In particular, if for all , is an ordinary (strong) solution of (2.9); otherwise, satisfies (2.9) on a random open dense subset of .
The proof of Proposition 2.1 is a simple but interesting application of Itô’s formula; to streamline our presentation, we relegate it to Appendix A and we focus here on some remarks and examples (see also Fig. 1 for some sample trajectories of the process):
Remark 2.1.
Even though the dynamics (2.9) appear quite convoluted, each of the constituent terms (2.9a)–(2.9d) admits a relatively simple interpretation:

Finally, the term (2.9d) is the Itô correction induced on through (SRL) due to the nonanticipative nature of the Itô integral.^{6}^{6}6If (SRL) had been formulated as a Stratonovich equation, (2.9d) would vanish; however, the futureanticipating nature of the Stratonovich integral (van Kampen, 1981) is not wellsuited for our purposes. Importantly, this term depends on but not , so it does not vanish for constant .
Example 2.3.
As we saw in Example 2.1, the replicator dynamics (RD) correspond to the entropic kernel . In this case, (2.9) leads to the following stochastic variant of the replicator dynamics:
(SRD)  
For constant , (SRD) is simply the stochastic replicator dynamics of exponential learning studied by Mertikopoulos and Moustakas (2010). As such, (SRD) should be contrasted to the evolutionary replicator dynamics with aggregate shocks of Fudenberg and Harris (1992):
(ASRD)  
where denotes the population share of the th genotype of species in a multispecies environment, represents its reproductive fitness, and the noise coefficients measure the impact of random weatherlike effects on population evolution.^{7}^{7}7For a comprehensive account of the literature surrounding (ASRD), see Hofbauer and Imhof (2009), Mertikopoulos and Viossat (2016), and references therein. In addition to the works mentioned above, Khasminskii and Potsepun (2006) study a Stratonovichbased formulation of (ASRD) while Vlasic (2012) also considers random jumps induced by discontinuous, earthquakelike events.
Besides the absence of the learning rate , the fundamental difference between (SRD) and (ASRD) is in their Itô correction: as we shall see in what follows, this term leads to a drastically different longterm behavior and highlights an important contrast between learning and evolution in the presence of noise.
Example 2.4.
In the case of the projected reinforcement learning scheme (PD), substituting in (2.9) yields the stochastic projection dynamics:
(SPD)  
There are two important qualitative differences between (SRD) and (SPD): first, (SRD) holds for all whereas (SPD) describes the solution orbits of (SRL) only on intervals over which the support of remains constant. Second, the projection mapping of (2.7) is piecewise linear, so there is no Itô correction in (SPD); accordingly, the distinction between Itô and Stratonovich perturbations becomes void in the context of (SPD).
3. Regret minimization
We begin our rationality analysis in the case where there is a single player whose payoffs are determined by the state of his environment – which, in turn, may evolve arbitrarily over time (including adversarially if the player is playing against an opponent).
More precisely, consider a decision process where, at each , the player chooses an action from a finite set according to some mixed strategy and his expected payoff is determined by the (a priori unknown) payoff vector of stage . In this context, the performance of a dynamic strategy can be measured by comparing the player’s (expected) cumulative payoff to the payoff that he could have obtained if the state of nature were known in advance and the player had bestresponded to it; specifically, the player’s cumulative regret at time is defined as
(3.1) 
and we say that a strategy is consistent if it leads to no (average) regret, i.e.
(3.2) 
or, equivalently:
(3.3) 
The notion of consistency presented above is commonly referred to as external or universal consistency and was originally introduced by Hannan (1957) in a discretetime context (for a detailed overview, see Fudenberg and Levine (1998), ShalevShwartz (2011) and references therein). There, the agent is assumed to receive a payoff vector at each stage and, observing the past realizations , he chooses an action according to some probability distribution . The strategy is then said to be externally consistent if, on average, it earns more than any action (or expert suggestion) ; in other words, the noregret criterion (3.2) is simply the continuoustime analogue of the standard definition of external consistency.
With this in mind, the main question that we seek to address here is whether the perturbed reinforcement learning process (SRL) is consistent. To make this precise, we will focus on the unilateral process
(SRLU)  
where is a locally integrable stream of payoff vectors, is a Wiener process in and the noise coefficients (assumed continuous and bounded) represent the error in the player’s payoff observations. In the deterministic case (), Sorin (2009) proved that the unilateral variant of (EW) is consistent, a result which was recently extended by Kwon and Mertikopoulos (2014) to the more general setting of (RL). Below, we show that this learning process remains consistent even when the player’s payoff observations are subject to arbitrarily high observation noise:
Theorem 3.1.
Remark.
The assumption is only made for simplicity: the policy (SRLU) is consistent for any initial score vector but the corresponding regret guarantee is a bit more cumbersome to write down.
The basic idea of our proof will be to compare the cumulative payoff of the policy (SRLU) up to time to that of an arbitrary test strategy . To that end, we will examine how far the induced trajectory of play can stray from ; however, since is defined via the cumulative payoff process , we will carry out this comparison between and directly using the socalled Fenchel coupling:
(3.5) 
where
(3.6) 
denotes the convex conjugate of (Rockafellar, 1970).^{8}^{8}8The terminology “Fenchel coupling” is due to Mertikopoulos and Sandholm (2016) and reflects the fact that (3.5) collects all the terms of Fenchel’s inequality. By Fenchel’s inequality (Rockafellar, 1970), is nonnegative in both arguments, so it provides a “congruity” measure between and . With this in mind, our proof strategy will be to express the player’s regret with respect to a test strategy in terms of and then show that the latter grows sublinearly in .
The details of the proof are presented in Appendix A and rely on Itô’s lemma and the law of the iterated logarithm (which is used to control the impact of the noise on the agent’s learning process). For now, we focus on some aspects of Theorem 3.1 and the regret bound (3.4):
Remark 3.1 (The role of ).
It is important to note that the second term of (3.4) becomes linear in when the player’s learning parameter is constant, explaining in this way the requirement that as . On the other hand, this requirement can be dropped in the noiseless case: when , the guarantee (3.4) reduces to , so (SRLU) remains consistent even for constant – a fact which was first observed by Sorin (2009) in the context of (EW) and Kwon and Mertikopoulos (2014) for (RL). In the presence of noise, we conjecture that (SRLU) may lead to positive regret with positive probability for constant but we have not been able to prove this.
The form of the bound (3.4) also highlights a tradeoff between more aggressive learning rates (slowly decaying ) and the noise affecting the player’s payoff observations. Specifically, the first (deterministic) term of (3.4) is decreasing in while the second one (which is due to all the noise) is increasing in ;^{9}^{9}9The last term of (3.4) is also due to the noise, but it does not otherwise depend on . as a result, in the absence of noise (), it is better to use a large, constant rather than letting , but this might lead to disastrous results under uncertainty. ShalevShwartz (2011) draws a similar conclusion for the discretetime analogue of (RL) known as online mirror descent (OMD); in this way, discretization and random payoff disturbances seem to have comparably adverse effects on the agent’s learning process.
The above considerations can be illustrated more explicitly by choosing for some ; in this case, Theorem 3.1 yields the regret bounds:
Corollary 3.2.
Assume that (SRLU) is run with learning rate for some . Then:
(3.7) 
Proof.
Remark 3.2 (Links with vanishingly smooth fictitious play).
We close this section by discussing the links of (SRLU) with the vanishingly smooth fictitious play process that was recently examined by Benaïm and Faure (2013) in a discretetime setting. Specifically, by interpreting as the payoff that a player obtains in a player game against his opponent’s empirical frequency of play (cf. Section 6), the strategy
(3.8) 
can be viewed as a “vanishingly smooth” best response to the empirical distribution of play of one’s opponent.^{10}^{10}10It is “smooth” because the player is employing a regularized best response map (as opposed to a hard correspondence), and it is “vanishingly smooth” because the factor hardens to as (for a more detailed discussion, see Benaïm and Faure, 2013). As such, (SRLU) can be seen as a stochastically perturbed variant of vanishingly smooth fictitious play in continuous time, in which case Corollary 3.2 provides the continuoustime, stochastic analogue of Theorem 1.8 of Benaïm and Faure (2013).
4. Extinction of dominated strategies
A fundamental rationality requirement for any gametheoretic learning process is the elimination of suboptimal, dominated strategies. Formally, given a finite game , we say that is dominated by (and we write ) if
(4.1) 
Strategies that are iteratively dominated, undominated, or iteratively undominated are defined similarly; also, for pure strategies , we obviously have if and only if
(4.2) 
With this in mind, given a trajectory of play , , we will say that a pure strategy becomes extinct along if as . More generally, following Samuelson and Zhang (1992), we will say that the mixed strategy becomes extinct along if ; otherwise, is said to survive.
In the case of perfect payoff observations, Mertikopoulos and Sandholm (2016) showed that dominated strategies become extinct under the general reinforcement learning dynamics (RL), thus extending earlier results for the replicator dynamics (for an overview, see Viossat, 2015). Under noise and uncertainty however, the situation is a bit more complex: in the replicator dynamics with aggregate shocks (ASRD), Cabrales (2000), Imhof (2005) and Hofbauer and Imhof (2009) provided a set of sufficient conditions on the intensity of the noise that guarantee the elimination of dominated strategies; on the other hand, Mertikopoulos and Moustakas (2010) showed that the noisy replicator dynamics (SRD) eliminate all strategies that are not iteratively undominated, with no conditions on the noise level.
As we show below, this unconditional elimination result is a consequence of the core reinforcement principle behind (RL) and it extends to the entire class of learning dynamics covered by (SRL):
Theorem 4.1.
Let be a solution orbit of (SRL). If is dominated (even iteratively), then it becomes extinct along almost surely.
The basic idea of our proof is to show that the Itô process has a dominant drift term that pushes it away from dominated strategies. However, given the complicated form of the (nonautonomous) dynamics (2.9), we do so by using the Fenchel coupling (3.5) to relate the evolution of the generating process to .
Proof of Theorem 4.1.
Suppose so that for some and for all . Then, if is a solution orbit of (SRL), we get:
(4.3) 
or, equivalently:
(4.4) 
where and
(4.5) 
Consider now the rateadjusted “crosscoupling”
(4.6) 
with defined as in (3.5). Then, by substituting (4.4) in (4) and recalling that , we obtain:
(4.7) 
To proceed, Lemma B.4 shows that (a.s.), so the RHS of (4.7) tends to infinity as on account of the fact that (cf. Assumption 1). In turn, this gives , so becomes extinct along by virtue of Proposition B.1. Finally, our claim for iteratively dominated strategies follows by induction on the rounds of elimination of dominated strategies – cf. Cabrales (2000, Proposition 1A). ∎
Theorem 4.1 shows that dominated strategies become extinct under (SRL) but it does not give any information on the rate of extinction – or how probable it is to observe a dominated strategy above a given level at some . To address this issue, we provide two results below: Proposition 4.2 describes the longterm decay rate of dominated strategies and provides a “large deviations” bound for the probability of observing a dominated strategy above a given level at time . Subsequently, in Proposition 4.3 we estimate the expected time it takes for a dominated strategy to drop below a given level. For simplicity, we present our results in the case where the players’ choice maps are derived from separable penalty functions as in (2.8); again, proofs are relegated to Appendix A:
Proposition 4.2.
Let be dominated by and assume that the choice map of player is generated by a separable penalty function of the form (2.8) with . Then, for all and for all large enough , we have:
(4.8)  
and  
(4.9) 
where:

is the complementary error function.

(note that by assumption).

is the minimum payoff difference between and .

.

is a constant that depends only on the initial conditions of (SRL).
Proposition 4.3.
Propositions 4.2 and 4.3 are our main results concerning the rate of elimination of dominated strategies under uncertainty, so a few remarks are in order:
Remark 4.1 (Asymptotics versus mean behavior).
Propositions 4.2 and 4.3 capture complementary aspects of the statistics of the extinction rate of dominated strategies under (SRL). For instance, (4.8) describes the asymptotic rate of elimination of dominated strategies but it does not provide an estimate of how much time must pass until this rate becomes relevant. On the other hand, the bound (4.10) estimates the mean time it takes for a dominated strategy to fall below a given level; however, in contrast to (4.9), it does not describe how probable it is to observe deviations from this mean.
Remark 4.2 (Comparison with the noiseless case).
When , the decay estimate (4.8) boils down to , a bound which recovers the results of Mertikopoulos and Sandholm (2016) for the deterministic dynamics (RL). Thus, even though the noise may initially mask the fact that a strategy is dominated, Proposition 4.2 shows that the longrun behavior of (SRL) and (RL) is the same as far as dominated strategies are concerned. This is in contrast with the aggregateshocks dynamics (ASRD) where, even if a dominated strategy becomes extinct, its rate of elimination is different to leading order than in the noiseless case (Imhof, 2005, Theorem 3.1).
Remark 4.3 (The role of ).
The estimate (4.8) shows that running (SRL) with large, constant leads to a much faster rate of extinction of dominated strategies. In particular, recalling that as , the bound (4.9) becomes:
(4.11) 
up to a subleading term of the order of in the exponent. Thus, even though the leading behavior of (4.9) is not affected by the player’s choice of learning parameter, the subleading term is minimized for large, constant so the asymptotic bound (4.11) becomes tighter in that case.
We thus observe an important contrast between regret minimization and the elimination of dominated strategies: whereas the optimal regret guarantee of Theorem 3.1 is achieved for , the asymptotic extinction rate of dominated strategies is much faster for constant . The reason for this disparity is that higher values of reinforce consistent payoff differences and therefore eliminate dominated strategies faster (independently of the noise level). On the other hand, to attain a noregret state, players must be careful not to make too many mistakes in the presence of noise, so a more conservative choice of is warranted.
5. Longterm stability and convergence analysis
We now turn to the longterm stability and convergence properties of the dynamics (SRL) with respect to equilibrium play. To that end, recall first that is a Nash equilibrium of if it is unilaterally stable for all players, i.e.
(5.1) 
or, equivalently:
(5.2) 
Strict equilibria are defined by requiring that (5.1) hold as a strict inequality for all ; obviously, such equilibria are also pure in the sense that they correspond to pure strategy profiles in (i.e. vertices of ).
In the noiseless case () with constant learning rates (), Mertikopoulos and Sandholm (2016) recently showed that the deterministic dynamics (RL) exhibit the following properties with respect to Nash equilibria of :
In turn, these properties are generalizations of the longterm stability and convergence properties of the (multipopulation) replicator dynamics – sometimes referred to as the “folk theorem” of evolutionary game theory (Hofbauer and Sigmund, 1998, 2003). That being said, the situation is quite different in the presence of noise: for instance, interior Nash equilibria are not even traps (almost sure rest points) of the stochastic reinforcement learning dynamics (SRL), so the ordinary (deterministic) definitions of stability and convergence no longer apply. Instead, in the context of stochastic differential equations, Lyapunov and asymptotic stability are defined as follows (Khasminskii, 2012):
Definition 5.1.
Let . We will say that:

is stochastically (Lyapunov) stable under (SRL) if, for every and for every neighborhood of in , there exists a neighborhood of such that
(5.3) whenever .

is stochastically asymptotically stable under (SRL) if it is stochastically stable and attracting: for every and for every neighborhood of in , there exists a neighborhood of such that
(5.4) whenever .
In the evolutionary setting of the stochastic replicator dynamics with aggregate shocks, Imhof (2005) and Hofbauer and Imhof (2009) showed that strict Nash equilibria are stochastically asymptotically stable under (ASRD) provided that the variability of the shocks across different strategies is small enough. More recently, in a learning context, Mertikopoulos and Moustakas (2010) showed that the same holds for the stochastic replicator dynamics (SRD) of exponential learning (with constant ), irrespective of the variance of the observation noise. However, this last result relies heavily on the specific properties of the logit map (2.6) and the infinitesimal generator of (SRD).
In our case, the convoluted (and nonautonomous) form of the stochastic dynamics (2.9) complicates things considerably, so such an approach is not possible – especially with regards to finding a local stochastic Lyapunov function for the dynamical system (2.9) that governs the evolution of . Nonetheless, by working directly with (SRL), we obtain the following general result:
Theorem 5.2.
The fist ingredient of our proof (presented in detail in Appendix A) is to show that if the process remains in the vicinity of for all with positive probability, then must be a Nash equilibrium. This is formalized in the following proposition (which is of independent interest):
Proposition 5.3.
Proof.
If is not Nash, we must have for some player and for some , . On that account, let be a sufficiently small neighborhood of in such that for some and for all . Then, conditioning on the positive probability event that there exists an orbit of (SRL) that is contained in for all , we have: