Fixation and escape times in stochastic game learning
Abstract
Evolutionary dynamics in finite populations is known to fixate eventually in the absence of mutation. We here show that a similar phenomenon can be found in stochastic game dynamical batch learning, and investigate fixation in learning processes in a simple game, for twoplayer games with cyclic interaction, and in the context of the bestshot network game. The analogues of finite populations in evolution are here finite batches of observations between strategy updates. We study when and how such fixation can occur, and present results on the average timetofixation from numerical simulations. Simple cases are also amenable to analytical approaches and we provide estimates of the behaviour of socalled escape times as a function of the batch size. The differences and similarities with escape and fixation in evolutionary dynamics are discussed.
I Introduction
Modern approaches to game theory have moved beyond the identification of equilibrium points of games Nash1950 (); Nash1951 (); Neumann1953 (), and instead consider dynamical processes in populations of agents MaynardSmith1998 (); nowakbook (); Sigmund2010 (), or the adaptation of a given set of agents to each other’s actions fudenberg (); Young2004 (); Camerer2003 (). The study of populations of players is the focus of what is now called ‘evolutionary game theory’ MaynardSmith1998 (); Vega2003 (); Gintis2000 (). Within this field two approaches can broadly be distinguished. The more conventional one describes evolving populations by means of deterministic replicator equations, for textbooks see e.g. hofbauer (). The dynamical behaviour and attractors of these systems are studied with tools from the theory of nonlinear differential equations. Such formulations are valid formally only for infinite populations of agents, and systematically neglect stochastic effects in finite populations. The study of these random processes is at the centre of the second, more recent class of studies in evolutionary game theory, see for example traulsenreview () for a review. Crucial differences between the behaviour of finite and of infinite populations have been identified, for example finite systems may fixate at purestrategy absorbing states even when the corresponding deterministic replicator equations have their attractors at mixed equilibria. These stochastic processes are studied with a variety of different tools, including the master equation formalism, systemsize expansions, backwards FokkerPlanck methods, and other concepts from statistical mechanics kampen (); risken (); gardiner ().
The purpose of the present paper is to parallel existing research on stochastic effects in evolutionary systems with studies of corresponding effects in stochastic learning dynamics. Learning is here related to, but different from evolution. Learning, or adaptation, is concerned with a fixed set of players who interact repeatedly in a given game, and who react to their opponent’s actions by modifying their own strategic propensities fudenberg (); Young2004 (). These processes occur on much shorter time scales than evolutionary dynamics. Adaptation dynamics of the type we study here are of interest in two main different contexts. First, learning models provide mathematical descriptions of human or animal decision making and can be used to model the outcome of experiments in behavioural game theory and cognitive science Camerer2003 (). The second main area in which models of adaptation are relevant is in machine learning and algorithmic game theory Nisan2007 (). Here the interest is not in the modelling of human behaviour, but instead in the properties and design of algorithms with which to identify equilibrium points or solutions of optimisation problems. Understanding the dynamics of learning is of key importance in both of these applications.
In learning there are no birthdeath processes as in evolution, but instead dynamical updates of the agents’ strategy profiles in time. Very little work exists on the systematic comparison of the effects of noise in evolution and in learning. Initial investigations galla () have shown, that similar to what is seen in evolutionary processes, the dynamics and attractors of stochastic learning can be quite different from that of deterministic adaptation processes. Up to now the analyses of fluctuations in learning are however limited to the identification of socalled quasicycles, also seen in evolution bladon (); mobilia (). In the present paper we aim to establish further analogies between the two modelling approaches, and focus in particular on fixation effects antal (); altrockinger (). Fixation here refers to processes by which dynamical systems reach absorbing states. In evolution these are typically points at the boundaries of strategy space, at which only one species (pure strategy) survives, and where all other strategies are extinct. In finite populations the elimination of species may happen by random drift, and in the absence of mutation a species is then never introduced again in the dynamics once all its representatives have been removed. The system thus fixates in an absorbing state.
In this paper we investigate the extent to which a similar removal of strategies may occur in multiplayer learning. The analogue of extinction is here the convergence of a player’s strategic propensities to a pure strategy. The question we address is here when and how stochastic learning fixates. In particular we ask (i) under what circumstances convergence to pure, rather than mixed equilibra occurs in learning, (ii) if fixation occurs what are the corresponding extinction times, and (iii) given that extinction phenomena are well known in evolutionary systems, what are the differences and similarities with fixation in learning ? To answer these questions we consider several different types of games. After a general introduction to learning and the required definitions in Sec. II we first study simple two person games in Sec. III. We then turn to games with cyclic interaction in Sec. IV, before we finally discuss a more intricate bestshot game galeotti (); asta1 (); asta2 () defined on regular random graphs (Sec. V). The final section summarises our results and discusses possible future work.
Ii Deterministic and stochastic learning
ii.1 General definitions
In this paper we will consider both twoplayer and multiplayer games. Interaction will occur in learning processes, in which each player interacts only with a small number of other agents, in twoplayer games we will have , for multiplayer games one has . Individual players will typically be labelled by indices , where stands for the total number of players in the model at hand. We will restrict the discussion to symmetric noncooperative games. The variable will indicate the number of pure strategies available to each of the players. Following the standard game theoretic notation we will write for the payoff player receives when playing pure strategy , and when her opponents play actions . This paper focuses only on symmetric games so that is identical for all players, and carries no explicit dependence on . We will use the notation for player ’s mixed strategy, i.e. we have with for all . The component indicates the frequency with which player plays pure strategy .
ii.2 Learning
We will here focus on a reinforcement type learning model, and assume that each player keeps a score valuation of each of his/her pure strategies, these are a measure of the (perceived) relative performance of the pure actions in the past, and indicate the propensity of playing any particular pure action. Discarding memoryloss, the valuation player has for pure strategy is the cumulative payoff would have received in all past rounds up to time , given his opponent’s actions, and had always played pure strategy up to time . This will be detailed further below. Following Camerer2003 (); Ho2007 (); satopre (); satopnas () we will assume that given the score valuation vector player chooses each of the pure strategies according to a logit rule, i.e. that the probabilities of playing the different pure strategies depend on the score valuations via the following relations:
(1) 
The variable is here a model parameter, and describes a learning rate or intensity of selection. For the players strictly choose the pure action with highest propensity. For they play at random.
A learning dynamics is then a description governing the evolution of the in time. We will here mostly focus on a reinforcement learning rule of the form
(2) 
The interpretation of these update rules is understood best by first considering the case : in this case the increment of between timesteps and is the payoff player would have received had they played pure strategy , and given their opponents actions . For the variable is thus the total payoff player would have received given their opponent’s play had always played action . A nonzero value of accounts for exponential discounting over time, or equivalently for a possible memory loss. For the outcomes of the game in the distant past have a lesser effect on the valuation than the more recent rounds of the game.
The process defined by Eq. (2) is inherently stochastic, given that all players choose their pure actions according to the probabilistic rules of Eq. (1). A deterministic limit has been considered in satopre (); satopnas (); satophysica (), and can be formulated as
(3) 
where stands for the probability of action being played by ’s opponents, i.e. we have . Taking into account Eqs. (1) one can then write the update rule solely in terms of and and finds the following map satopre ()
(4) 
To interpolate between the stochastic process defined by Eqs. (1,2) and the deterministic limit of Eq. (3) we will consider a batch learning process, in which players update their score valuations only once every rounds of the game, and keep them constant inbetween. Specifically, we will assume
(5) 
and for all . We will refer to as the batch size of the learning process. The batch process at (but finite) is here mostly a theoretical vehicle which allows one to understand the dynamics of learning. Realworld adaptation presumably operates close to the limit , nevertheless some of the existing work has focused on deterministic learning (). Our work tries to address the gap between these two extreme cases, and to establish in a systematic manner the stochastic effects affecting the dynamics at finite batch sizes. The case can be understood as a special limiting case. Previous work has shown that approaches taken based on a systematic expansion in can give good results even for small batch sizes galla ().
ii.3 SatoCrutchfield dynamics in continuous time
In order to make contact with deterministic descriptions of evolutionary systems it is helpful to consider the continuoustime limit of the deterministic learning process, Eq. (4). Assuming the validity of such a limit for small intensity of selection, , and following satopre (); satophysica () one finds
(6) 
For this reduces to a set of multipopulation replicator equations, a signature of the close connection between evolutionary processes and adaptive learning.
Iii Twoplayer HawkDove game
iii.1 Definition and replicator flow
We will first consider the case of a symmetric game, the socalled HawkDove game (also referred to as the coexistence game or the anticoordination game) defined by the payoff matrix
(7) 
where we set and . We will label the elements of the payoff matrix by , where and can each take one of two values, representing the pure strategies of this game. In the learning process two players interact repeatedly, the strategy of each player is fully characterized by the probability of playing ‘Hawk’. We will denote these probabilities by for the first player, and by for player 2. In the absence of memory loss, and taking the continuoustime limit of Eqs. (4) we obtain the twopopulation replicator dynamics
(8) 
These equations are obtained from setting in the above SatoCrutchfield equations (6) and upon using with the above payoff matrix (7). It is straightforward to work out the corresponding deterministic flow, we illustrate it for completeness in Fig. 1. The replicator dynamics has one reactive fixed point at , and two pure strategy fixed points at and . These fixed points at the boundary of strategy space are stable attractors, the central fixed point is a saddle with one stable and one unstable eigendirection. The stable eigenvector points along the diagonal, and restricting the dynamics to this direction (i.e. setting ) hence yields a stable flow towards the central fixed point. The singlepopulation replicator equation
(9) 
therefore converges to , provided nonextremal initial conditions () are chosen. The twopopulation system will generally fixate at one of the corner attractors for generic initial conditions , only in the restricted case is the symmetry between the players preserved and dynamics converges to the mixed fixed point.
iii.2 Fixation in stochastic learning
We will first address learning dynamics in absence of memoryloss (), the effects of exponential discounting are described in Sec. III.4. Numerical simulations show that stochastic learning without memoryloss will generally fixate in one of the two corners, or of strategy space, a typical trajectory generated by the learning dynamics at finite batch size is shown in Fig. 1. This is further illustrated in the left panel of Fig. 2, where we show the evolution of and in stochastic learning at different batch sizes . The dynamics are here started from , and will initially follow the replicator flow closely, and approach, but not reach the replicator fixed point at . Fluctuations, which will invariably occur at any finite batch size, break the symmetry between and however, and the system will generally drift off the diagonal relatively quickly. While the stable eigenvalue of the central fixed point still exerts some attraction to the centre, the unstable direction will eventually take over, and draw the learning process to or . Which one of these corners is reached is purely random, and determined by the nature of sampling errors in the adaptation process. Large batch sizes here reduce the amount of noise in the dynamics, and the system hence follows the deterministic flow longer at large than at smaller batches, as illustrated in the left panel of Fig. 2. In the right panel of the figure we have measured the timetofixation more systematically. Specifically we consider the system to be fixated once and have each approached the values or up to an accuracy , with a small threshold ( in Fig. 2). Once this condition is met each player plays essentially a pure strategy, i.e. the system is close to one of the corners of strategy space up to deviations smaller than . We find logarithmic behaviour of the sodefined fixation time, i.e. . This is consistent with observations in onedimensional evolutionary coordination games, with one central unstable fixed point nowakbook (); traulsenreview ().
It is generally very hard to compute the timetofixation of stochastic processes analytically, this applies both to learning processes and to evolutionary dynamics. In the latter, general analytical results have been obtained only for onepopulation models nowakbook (); traulsenreview (). One major complication is here the fact that the dynamics in most other cases has at least two degrees of freedom, impeding a full analytical solution.
Partial analytical results for game dynamical learning can however be obtained for what we will refer to as ‘escape times’ in the following, see also mobilia () for studies of escape times in cyclic evolutionary games. For a given (finite) batch size we here start the learning dynamics at the deterministic fixed point , and run the stochastic dynamics until the system reaches a given distance from this fixed point. More precisely we define the escape time as the time at which the variable first exceeds the value . This measure of distance was chosen for analytical convenience, as it will become clear below. Results from simulations are shown in Fig. 3.
Analytical predictions of the escape time for small values of are possible within a linear approximation about the central fixed point. Using the methods detailed in galla (); galla2 () we find that in the limit and for large but finite batch size the twoplayer learning dynamics can be described by the following Langevin dynamics
(10) 
where the matrix is the Jacobian of the continuoustime learning dynamics (equivalent to the replicator equations for the case we are considering here), specifically we have at vanishing memory loss
(11) 
Given that is an eigenvector of the above Jacobian (with an eigenvalue of ) we then have
(12) 
where , and where , with . Using results for escape times of general Langevin processes of the form with (see appendix) we then obtain the following prediction for the escape time
(13) 
with . In our specific example we have and . As seen in Fig. 3 this compares well with numerical simulations.
iii.3 Comparison with evolutionary dynamics in finite populations
We have already indicated that the behaviour of the stochastic learning dynamics is, to an extent, similar to evolutionary processes. To quantify this further we investigate both onepopulation and twopopulation stochastic evolutionary processes in this section.
Twopopulation dynamics
Specifically we consider two populations, each composed of players. Each of these players will either be a Hawk or a Dove, we denote the number of Hawks in the first population by , and the number of Hawks in the second population by respectively. The corresponding numbers of Doves are then and in the two populations. Players in the first population only play against players of the second population, and vice versa. The fitness of a Hawk and Dove players in the first population are then for example given by
(14) 
and similar definitions hold for individuals in the second population. In order to specify a microscopic dynamics we use the socalled ‘local update rule’, sometimes also referred to as the ‘pairwise comparison process’ traulsenreview (). A player of the ‘Dove’ type is converted into ‘Hawk’ player with a rate proportional to , where labels the two populations. Similarly conversions of Hawk players into Dove players occur with a rate proportional to . Specifically, we will use the following transition rates
(15) 
The factors of the form here indicate that two players of different types need to be drawn from any one population in order for an interaction to occur. It is here important to stress that reproduction and selection occurs within the separate populations, i.e. at no point is an individual of one population converted into a member of the other. Interaction between the population occurs via Eq. (14), i.e. the fitness of members of population one depends on the composition of population two and vice versa.
In the deteministic limit () one recovers the twopopulation replicator dynamics
(16) 
where we have used the replacements and in Eqs. (15) to obtain the . The deterministic flow of these replicator equations is the one indicated in Fig. 1, in particular the central fixed point has one stable and one unstable eigendirection. Fixation of the stochastic evolutionary dynamics can occur at any of the four corners of strategy space. We show results for the average timetofixation in the inset of Fig. 4, the fixation time depends logarithmically on the system size .
As in the learning dynamics, an analytical calculation of the fixation time is very difficult. Estimates for the escape times can however be obtained within a systemsize expansion about the fixed point of the deterministic replicator equations. Following standard methods based on the socalled ‘van Kampen expansion’ in the inverse system size kampen () one finds
(17) 
where and . As before and describe Gaussian noise, from the van Kampen expansion one finds as well as
(18) 
This translates into a Langevin equation
(19) 
for the variable , with . A theoretical prediction for the escape time can hence be found using the values in Eq. (13). Results are tested against simulations and confirmed successfully in Fig. 4.
Onepopulation dynamics
In the onepopulation model one considers a single population of individuals, each of whom can either be a Hawk or a Dove player. The state of the system is hence characterized by a single integer, the number of Hawks. Transition rates of the local process read
(20) 
The analysis of this model is not new as such, the study of singlepopulation dynamics of games is in fact standard, see for example traulsenreview (); altrockinger (). We here present results mainly for completeness and in order to contrast them with the above twopopulation case.
In the deteministic limit () the following replicator equation is obtained from Eq. (20):
(21) 
This corresponds to restricting the twopopulation replicator equations (16) to the subspace in which . In order to explore stochastic corrections to this limiting behaviour in nexttoleading order we again carry out the systemsize expansion. As before we do not report the detailed mathematics, which is tedious, but standard. Defining one finds
(22) 
where . Setting and in Eq. (13) we obtain semianalytical predictions for the escape time. These results are compared with simulations in Fig. 5. As seen in the figure the escape time no longer scales logarithmically in the system size as it was the case in the twopopulation model, but instead the escape is now exponentially slow in the asymptotic limit of large . We have also measured the actual fixation time (see inset of Fig. 5). Fixation times scale exponentially in the system size. Analytical results can here be obtained based on the methods described for example in traulsenreview ()). For completeness we show the results of these calculations in the inset of Fig. 5.
iii.4 Effects of memoryloss in twoplayer learning
Unlike in evolutionary dynamics, where fixation can occur in absence of mutation purely by random drift, fixation in stochastic game learning is strictly tied to the convergence of the limiting deterministic learning process to pure strategy equilibria. In order to demonstrate this we will extend the analysis to nonzero memoryloss rates in the following. Deterministic learning of the HawkDove game in discrete time is then described by the twodimensional map given by Eq. (4) (with the appropriate substitutions for the payoff structure). The point is a fixed point and the relevant eigenvalues are identified as . Assuming (and ) we therefore find that the central fixed point is stable whenever . In order to characterise the outcome of learning we have to distinguish between three different types of behaviour

For the central fixed point is not a stable attractor of the deterministic learning process. In this regime is still a stable eigendirection, so deterministic learning will converge to provided it is started from symmetric initial conditions (). For generic initial conditions this symmetry is broken however, and the dynamics is observed to approach either or asymptotically. Noise in learning has a similar symmetrybreaking effect, and will drive the dynamics to one of the pure strategy attractors.

For , but the central fixed point is again not a stable attractor of the learning dynamics, and deterministic learning will converge to only if started from symmetric initial conditions. For asymmetric initial conditions the dynamics will approach an asymmetric fixed point (), which is generally not a pure strategy for . With noise learning fluctuates around this symmetrybroken attractor. Memoryloss in learning thus acts similar to mutation in evolutionary dynamics, and impedes absorption at the boundaries.

For deterministic learning converges to even for nonsymmetric initial conditions. In this case there is no fixation, the dynamics of stochastic learning will fluctuate around the mixedstrategy equilibrium asymptotically.
This behaviour is illustrated further in Fig. 6.
Iv Escape rates in cyclic games
We now consider a twoplayer discretetime learning dynamics in the rockpaperscissors game (RPS). Detailed analyses of evolutionary processes in this cyclic game can for example be found in mobilia (). We here focus on learning, and first concentrate on the deterministic limit. Specifically, using the deterministic limit of Eqs. (4) we have the following map
(23) 
where
(24) 
and where () the standard RPS payoff matrix, i.e.
(25) 
Due to overall normalisation, , the above map defines a dimensional dynamical system. The mixed strategy point for all is always a fixed point, and the corresponding Jacobian is easily computed. One finds the following eigenvalues
(26) 
each with degeneracy . Thus, the central fixed point is stable if and only if . For a fixed choice of one therefore has stability for , and an unstable fixed point otherwise.
This separation of two regimes, one with a stable fixed point, and the other with a deterministic flow away from the centre of strategy space, is reflected in the escape times of stochastic learning. Results are shown in Fig. 7. In our simulations the stochastic learning dynamics is started at the fixedpoint at the centre of the strategy simplex and evolved at finite batch size . The system does not fixate into one pure strategy, so the escape time is measured as the point in time when the dimensional vector first crosses a circle of radius around the fixed point. As seen in the figure the escape time scales sublinearly with the batch size if the fixed point is unstable (). For neutrally stable deterministic dynamics algebraic scaling is found, and escape is subextensively slow in the regime of a stable fixed point ().
V Network games
v.1 Definition of the game
We will now move to a more complex multiplayer game defined on a networked structure, and consider the socalled ‘best shot game’ galeotti (). Analyses of the statistics of Nash equilibria on random graphs can be found in asta1 (); asta2 (). We here again focus on adaptive learning. Players are labelled by and arranged on an undirected graph, so that players and interact if and only if the link between and is present in the graph. In the ‘best shot’ game each player has the choice between two actions, to ‘contribute’ or not to contribute. For simplicity we will refer to these actions as and respectively. The payoff any given player receives in any round of the game then depends on her action and on the actions of her neighbours on the underlying network. If we write for the set of neighbours of player then the bestshot game is defined by the following payoff structure for action
(27) 
and by payoffs for action
(28) 
The variables and are positive constants. To a certain extent the game resembles the typical structure of public goods games. In absence of any contributors in player ’s neighbourhood () player will increase his payoff by contributing. If however at least one of her neighbours is contributing already (), then player will not want to contribute herself.
v.2 SatoCrutchfield equations and homogeneous fixed point
We will write with and for the probability with which player takes action at time . One always has . In the continuoustime limit one obtains the following deterministic equation
(29) 
Taking into account that
(30)  
(31) 
we can rewrite the equations above in terms of the parameters :
(32) 
where we have introduced . Up to now all derivations hold for any network structure. In order to keep the analytical expressions to a sensible level we will from now on restrict the analysis to regular graphs, i.e. to graphs in which all players have the same number of neighbours. We will denote the degree of the resulting regular network by . Looking for homogeneous fixedpoint solutions of the above continuoustime dynamics, i.e. setting for all players , one finds
(33) 
Excluding trivial fixed points, i.e. assuming and , one obtains
(34) 
The solutions to this equation give the possible fixed points of the deterministic learning dynamics. In the cases studied here we will typically have one internal fixed point, , its numerical value will generally depend on the model parameters and . For and we recover the homogeneous mixed Nash equilibrium . Irrespective of and one finds for .
v.3 Stability analysis
Expanding Eq. (32) around the fixed point to linear order one finds
(35) 
where . Eq. (35) can then be written in matrix form as
(36) 
where is the adjacency matrix of the graph and the identity matrix. Diagonalizing this equation is equivalent to diagonalizing . In particular the critical value of , separating the phase in which the fixed point is stable from a phase with an unstable fixed point is given by
(37) 
where is the smallest eigenvalue of the adjacency matrix (all eigenvalues are real since is symmetric).
In analogy with the earlier sections we expect that the escape time of learning will scale logarithmically in the batch size for when the interior fixed point is unstable. In the stable phase () on the other hand one would predict an exponential behaviour. We verify these predictions in the following section. As a final remark regarding stability it is interesting to consider the limiting case of deterministic learning started from homogeneous initial conditions for all . For regular networks of degree one then has for all , and fulfills
(38) 
Linearising about the fixed point, and restricting the motion to the space , one finds
(39) 
Given that the interior fixed point is therefore stable irrespective of the value of the parameter , similar to what was observed in the HawkDove game. The network game considered in this section therefore bears close similarity to the HawkDove game discussed earlier on. In a onepopulation setting (equivalently upon restricting the dynamics to the subspace ) the deterministic dynamics has a stable internal fixed point for any . In the multipopulation case the fixed point remains unchanged, but unstable eigenvalues are present for . The corresponding eigendirections break the symmetry between the different coordinates , and hence the flow is away from the manifold defined by .
v.4 Test against simulations
The above theoretical predictions can be tested in several possible ways. For example one can consider the thermodynamic limit of large regular random networks of degree , and then perform an average over multiple instances of the graph. Using results from spectral graph theory mckay (); cioaba () the support of the eigenvalue distribution of the adjacency matrix of a large regular random graph of degree K typically has its most negative eigenvalue at
(40) 
With this estimate the expected value of can then be computed by means of Eq. (37). Simulations of the learning dynamics on large networks are however time consuming, and we have therefore taken a different route. We have created one particular instance of a regular random graph with nodes and degree . The adjacency matrix of this particular graph has then been diagonalised and the relevant eigenvalue has been identified as . For convenience we have also chosen , ensuring the value for the deterministic fixed point
Fig. 9 illustrates the dynamical evolution under either the deterministic replicator equations or the stochastic learning dynamics . It shows how the system is driven to fixation by a nonhomogeneous perturbation, caused either by a slight heterogeneity in the initial conditions of the replicator equations or by the stochasticity induced by a finite batch size in the learning dynamics. Specifically one finds that (i) the fixed point is an attractor for homogeneous initial conditions
Vi Conclusions
In summary we have studied the fixation properties of simple reinforcementtype learning algorithms in the context of different games, and have compared them to the outcome of evolutionary dynamics.The examples we have chosen range from simple twoplayer twoaction games to more complex multiplayer games on networked structures. Our main results can be summarized as follows: (i) Unlike evolutionary dynamics in which fixation can occur purely driven by fluctuations, fixation (i.e. convergence to pure strategies) in learning of games of the type we have studied here appears to be possible only if the underlying deterministic dynamics itself converges to a pure action profile. This is typically only the case if the symmetry between players is broken, for example by an inhomogeneous initial condition; (ii) Twoplayer and multiplayer learning in the deterministic limit can, to a good approximation, be described by equations of a multipopulation replicator type. As seen for the HawkDove game, and for the network game we have studied, the stability properties of multipopulation replicator dynamics can differ substantially from those of the corresponding onepopulation model; (iii) The role of noise in fixation processes in dynamical learning is mostly limited to triggering the required breaking of symmetry, eventually leading to fixation, unlike in evolutionary processes we have not found examples in which fixation is triggered by random drift alone. (iv) In cases for which the limiting deterministic learning converges to a symmetric fixed point in the interior of strategy space the corresponding escape time depends on the stability of this fixed point: for stable fixed points escape is essentially exponential, for unstable fixed points logarithmic scaling in the batch size is found, these findings are very similar to those found in evolutionary systems.
While we have pointed out crucial differences between multiplayer learning and evolutionary dynamics, our results mostly extend the similarities between the two approaches to dynamical aspects of games. In galla () it was pointed out that stochastic learning can exhibit persistent quasicycles in regimes where deterministic learning converges to fixed points. These effects are very similar to those observed in evolutionary systems. The present work shows that the analogy goes further, and that the escape and fixation properties of stochastic learning dynamics are closely related to the corresponding behaviour of population models. Our analysis shows that the analogy is particularly strong when learning is compared with evolution in multiple populations. We expect that these similarities stretch even further, including potentially pattern formation in spatially extended systems and or more complicated dynamics on adaptive networks. Future work may also include more complex learning models camerer1 (); Camerer2003 (); Ho2007 () inspired by laboratory experiments in behavioural game theory or by algorithms in machine learning. We are here confident that the analogy with stochastic evolutionary systems may provide a powerful perspective and that it can contribute to accelerating the research required to analyze more general learning dynamics.
Acknowledgements.
TG and JRG acknowledge hospitality of the Abdus Salam ICTP. TG is grateful for funding by the Research Councils UK (RCUK reference EP/E500048/1) and by EPSRC (references EP/I005765/1 and EP/I019200/1), and would like to thank A. J. Bray for useful discussions. BSZ acknowledges support by an EPSRC studentship.Appendix: Escape time for linear Langevin dynamics
.1 Reduced problem and backward FokkerPlanck equation
Let us consider the simple 1D problem
(41) 
where is standard Gaussian white noise, i.e. and
(42) 
We consider escape times: Fix a number , if the process is started at at , then the escape time is defined as the first time, the process leaves the interval .
Using standard methods gardiner (); risken () one finds the backward FokkerPlanck equation
(43) 
where is the mean escape time, conditioned on a starting point . Boundary conditions read (if the process is started at the boundaries of the interval, the extinction time is trivially zero). It is easy to check that the solution is given by
(44) 
Restricting to a starting point of one has
(45) 
By introducing the change of variables and , introducing the appropriate Jacobian , and setting , we can rewrite the integrals on the RHS of Eq. (45) as
(46) 
where
(47) 
is a generalised hypergeometric function whose asymptotic behaviour is well known DLMF (). For simplicity, we will in the following give a heuristic derivation of the asymptotic leading behaviour of at large values of . Our starting point will be the expression given in Eq. (45).
.2 Large behaviour for
Setting , so that , and changing variables in (45) to and one has
(48) 
Executing the integration over and setting one finds
(49) 
where we have introduced the lower integration limit . This variable will be set to zero eventually, the purpose of our procedure is to keep track off singularities at small values of .
When and is very large the main contribution to the integral in (49) comes from small values of due to the decaying exponential factors. Notice that the term in the second integral in (49) is only relevant when in which case the exponent and so the integrand is exponentially small in . Therefore, we can neglect the term in the exponent relative to . Doing this and introducing the variable in the first integral and in the second, we obtain
(50) 
Both integrals on the RHS are of the type , where in the first case, and in the second. In both cases tends to zero as . The upper integration limit is proportional to in both integrals. Integrals of this type can be simplied by an integration by parts, and we find
(51) 
The last integral in (51) is exponentially small in , as one can see from
(52) 
valid for . We can therefore neglect the last term in Eq. (51) in the limit of , when we have . Performing the limit (and taking into account that in this limit) we find
(53) 
where is the EulerMascheroni constant. Using these results in Eq. (50), we finally find
(54) 
Except for the additional term, this result coincides with the asymptotic results for escape times reported in [Mobilia 2010] for the case of a twodimensional system.
.3 Large behaviour for
For and large, we have from Eq. (45)
(55) 
We have here first extended the integration range of the integration to the interval , adjusted for by an overall factor of . Subsequently, based on the observation that the integrand assumes its maximum at we have extended the integration range of the integral to the entire real axis. This introduces an error which does not contribute to the leading exponential behaviour for large . The integral over is now Gaussian, and can be evaluated straightforwardly. The exponent in the remaining integration reaches its maximum at . Expanding the exponent up to first order about this value we get
(56) 
with