Fixation and escape times in stochastic game learning

Fixation and escape times in stochastic game learning

Abstract

Evolutionary dynamics in finite populations is known to fixate eventually in the absence of mutation. We here show that a similar phenomenon can be found in stochastic game dynamical batch learning, and investigate fixation in learning processes in a simple game, for two-player games with cyclic interaction, and in the context of the best-shot network game. The analogues of finite populations in evolution are here finite batches of observations between strategy updates. We study when and how such fixation can occur, and present results on the average time-to-fixation from numerical simulations. Simple cases are also amenable to analytical approaches and we provide estimates of the behaviour of so-called escape times as a function of the batch size. The differences and similarities with escape and fixation in evolutionary dynamics are discussed.

game theory, learning and adaptation, fixation and extinction, evolutionary dynamics
1

I Introduction

Modern approaches to game theory have moved beyond the identification of equilibrium points of games Nash1950 (); Nash1951 (); Neumann1953 (), and instead consider dynamical processes in populations of agents MaynardSmith1998 (); nowakbook (); Sigmund2010 (), or the adaptation of a given set of agents to each other’s actions fudenberg (); Young2004 (); Camerer2003 (). The study of populations of players is the focus of what is now called ‘evolutionary game theory’ MaynardSmith1998 (); Vega2003 (); Gintis2000 (). Within this field two approaches can broadly be distinguished. The more conventional one describes evolving populations by means of deterministic replicator equations, for textbooks see e.g. hofbauer (). The dynamical behaviour and attractors of these systems are studied with tools from the theory of nonlinear differential equations. Such formulations are valid formally only for infinite populations of agents, and systematically neglect stochastic effects in finite populations. The study of these random processes is at the centre of the second, more recent class of studies in evolutionary game theory, see for example traulsenreview () for a review. Crucial differences between the behaviour of finite and of infinite populations have been identified, for example finite systems may fixate at pure-strategy absorbing states even when the corresponding deterministic replicator equations have their attractors at mixed equilibria. These stochastic processes are studied with a variety of different tools, including the master equation formalism, system-size expansions, backwards Fokker-Planck methods, and other concepts from statistical mechanics kampen (); risken (); gardiner ().

The purpose of the present paper is to parallel existing research on stochastic effects in evolutionary systems with studies of corresponding effects in stochastic learning dynamics. Learning is here related to, but different from evolution. Learning, or adaptation, is concerned with a fixed set of players who interact repeatedly in a given game, and who react to their opponent’s actions by modifying their own strategic propensities fudenberg (); Young2004 (). These processes occur on much shorter time scales than evolutionary dynamics. Adaptation dynamics of the type we study here are of interest in two main different contexts. First, learning models provide mathematical descriptions of human or animal decision making and can be used to model the outcome of experiments in behavioural game theory and cognitive science Camerer2003 (). The second main area in which models of adaptation are relevant is in machine learning and algorithmic game theory Nisan2007 (). Here the interest is not in the modelling of human behaviour, but instead in the properties and design of algorithms with which to identify equilibrium points or solutions of optimisation problems. Understanding the dynamics of learning is of key importance in both of these applications.

In learning there are no birth-death processes as in evolution, but instead dynamical updates of the agents’ strategy profiles in time. Very little work exists on the systematic comparison of the effects of noise in evolution and in learning. Initial investigations galla () have shown, that similar to what is seen in evolutionary processes, the dynamics and attractors of stochastic learning can be quite different from that of deterministic adaptation processes. Up to now the analyses of fluctuations in learning are however limited to the identification of so-called quasi-cycles, also seen in evolution bladon (); mobilia (). In the present paper we aim to establish further analogies between the two modelling approaches, and focus in particular on fixation effects antal (); altrockinger (). Fixation here refers to processes by which dynamical systems reach absorbing states. In evolution these are typically points at the boundaries of strategy space, at which only one species (pure strategy) survives, and where all other strategies are extinct. In finite populations the elimination of species may happen by random drift, and in the absence of mutation a species is then never introduced again in the dynamics once all its representatives have been removed. The system thus fixates in an absorbing state.

In this paper we investigate the extent to which a similar removal of strategies may occur in multi-player learning. The analogue of extinction is here the convergence of a player’s strategic propensities to a pure strategy. The question we address is here when and how stochastic learning fixates. In particular we ask (i) under what circumstances convergence to pure, rather than mixed equilibra occurs in learning, (ii) if fixation occurs what are the corresponding extinction times, and (iii) given that extinction phenomena are well known in evolutionary systems, what are the differences and similarities with fixation in learning ? To answer these questions we consider several different types of games. After a general introduction to learning and the required definitions in Sec. II we first study simple two person games in Sec. III. We then turn to games with cyclic interaction in Sec. IV, before we finally discuss a more intricate best-shot game galeotti (); asta1 (); asta2 () defined on regular random graphs (Sec. V). The final section summarises our results and discusses possible future work.

Ii Deterministic and stochastic learning

ii.1 General definitions

In this paper we will consider both two-player and multi-player games. Interaction will occur in learning processes, in which each player interacts only with a small number of other agents, in two-player games we will have , for multi-player games one has . Individual players will typically be labelled by indices , where stands for the total number of players in the model at hand. We will restrict the discussion to symmetric non-cooperative games. The variable will indicate the number of pure strategies available to each of the players. Following the standard game theoretic notation we will write for the payoff player receives when playing pure strategy , and when her opponents play actions . This paper focuses only on symmetric games so that is identical for all players, and carries no explicit dependence on . We will use the notation for player ’s mixed strategy, i.e. we have with for all . The component indicates the frequency with which player plays pure strategy .

ii.2 Learning

We will here focus on a re-inforcement type learning model, and assume that each player keeps a score valuation of each of his/her pure strategies, these are a measure of the (perceived) relative performance of the pure actions in the past, and indicate the propensity of playing any particular pure action. Discarding memory-loss, the valuation player has for pure strategy is the cumulative payoff would have received in all past rounds up to time , given his opponent’s actions, and had always played pure strategy up to time . This will be detailed further below. Following Camerer2003 (); Ho2007 (); satopre (); satopnas () we will assume that given the score valuation vector player chooses each of the pure strategies according to a logit rule, i.e. that the probabilities of playing the different pure strategies depend on the score valuations via the following relations:

(1)

The variable is here a model parameter, and describes a learning rate or intensity of selection. For the players strictly choose the pure action with highest propensity. For they play at random.

A learning dynamics is then a description governing the evolution of the in time. We will here mostly focus on a re-inforcement learning rule of the form

(2)

The interpretation of these update rules is understood best by first considering the case : in this case the increment of between time-steps and is the payoff player would have received had they played pure strategy , and given their opponents actions . For the variable is thus the total payoff player would have received given their opponent’s play had always played action . A non-zero value of accounts for exponential discounting over time, or equivalently for a possible memory loss. For the outcomes of the game in the distant past have a lesser effect on the valuation than the more recent rounds of the game.

The process defined by Eq. (2) is inherently stochastic, given that all players choose their pure actions according to the probabilistic rules of Eq. (1). A deterministic limit has been considered in satopre (); satopnas (); satophysica (), and can be formulated as

(3)

where stands for the probability of action being played by ’s opponents, i.e. we have . Taking into account Eqs. (1) one can then write the update rule solely in terms of and and finds the following map satopre ()

(4)

To interpolate between the stochastic process defined by Eqs. (1,2) and the deterministic limit of Eq. (3) we will consider a batch learning process, in which players update their score valuations only once every rounds of the game, and keep them constant inbetween. Specifically, we will assume

(5)

and for all . We will refer to as the batch size of the learning process. The batch process at (but finite) is here mostly a theoretical vehicle which allows one to understand the dynamics of learning. Real-world adaptation presumably operates close to the limit , nevertheless some of the existing work has focused on deterministic learning (). Our work tries to address the gap between these two extreme cases, and to establish in a systematic manner the stochastic effects affecting the dynamics at finite batch sizes. The case can be understood as a special limiting case. Previous work has shown that approaches taken based on a systematic expansion in can give good results even for small batch sizes galla ().

ii.3 Sato-Crutchfield dynamics in continuous time

In order to make contact with deterministic descriptions of evolutionary systems it is helpful to consider the continuous-time limit of the deterministic learning process, Eq. (4). Assuming the validity of such a limit for small intensity of selection, , and following satopre (); satophysica () one finds

(6)

For this reduces to a set of multi-population replicator equations, a signature of the close connection between evolutionary processes and adaptive learning.

Iii Two-player Hawk-Dove game

iii.1 Definition and replicator flow

Figure 1: (Colour on-line) Deterministic flow of the two-population replicator dynamics for the Hawk-Dove game (arrows). The noisy green line shows the trajectory of one single realisation of the learning dynamics at batch size , started at , and with parameters .

We will first consider the case of a symmetric game, the so-called Hawk-Dove game (also referred to as the coexistence game or the anti-coordination game) defined by the payoff matrix

(7)

where we set and . We will label the elements of the payoff matrix by , where and can each take one of two values, representing the pure strategies of this game. In the learning process two players interact repeatedly, the strategy of each player is fully characterized by the probability of playing ‘Hawk’. We will denote these probabilities by for the first player, and by for player 2. In the absence of memory loss, and taking the continuous-time limit of Eqs. (4) we obtain the two-population replicator dynamics

(8)

These equations are obtained from setting in the above Sato-Crutchfield equations (6) and upon using with the above payoff matrix (7). It is straightforward to work out the corresponding deterministic flow, we illustrate it for completeness in Fig. 1. The replicator dynamics has one reactive fixed point at , and two pure strategy fixed points at and . These fixed points at the boundary of strategy space are stable attractors, the central fixed point is a saddle with one stable and one unstable eigendirection. The stable eigenvector points along the diagonal, and restricting the dynamics to this direction (i.e. setting ) hence yields a stable flow towards the central fixed point. The single-population replicator equation

(9)

therefore converges to , provided non-extremal initial conditions () are chosen. The two-population system will generally fixate at one of the corner attractors for generic initial conditions , only in the restricted case is the symmetry between the players preserved and dynamics converges to the mixed fixed point.

iii.2 Fixation in stochastic learning

         

Figure 2: (Colour on-line) Fixation in stochastic learning of the Hawk-Dove game (). Left panel shows individual trajectories for different batch sizes , started at . Each pair of curves shows and for a single run. The black dashed line indicates the evolution of the deterministic replicator equations. Parameters are and . The right panel shows the mean extinction time as a function of the batch size (simulations started from , extinction as defined in the main text). Symbols are from simulations (initial conditions , averaged over independent runs, ). The dashed line is a fit to a logarithmic dependence .

We will first address learning dynamics in absence of memory-loss (), the effects of exponential discounting are described in Sec. III.4. Numerical simulations show that stochastic learning without memory-loss will generally fixate in one of the two corners, or of strategy space, a typical trajectory generated by the learning dynamics at finite batch size is shown in Fig. 1. This is further illustrated in the left panel of Fig. 2, where we show the evolution of and in stochastic learning at different batch sizes . The dynamics are here started from , and will initially follow the replicator flow closely, and approach, but not reach the replicator fixed point at . Fluctuations, which will invariably occur at any finite batch size, break the symmetry between and however, and the system will generally drift off the diagonal relatively quickly. While the stable eigenvalue of the central fixed point still exerts some attraction to the centre, the unstable direction will eventually take over, and draw the learning process to or . Which one of these corners is reached is purely random, and determined by the nature of sampling errors in the adaptation process. Large batch sizes here reduce the amount of noise in the dynamics, and the system hence follows the deterministic flow longer at large than at smaller batches, as illustrated in the left panel of Fig. 2. In the right panel of the figure we have measured the time-to-fixation more systematically. Specifically we consider the system to be fixated once and have each approached the values or up to an accuracy , with a small threshold ( in Fig. 2). Once this condition is met each player plays essentially a pure strategy, i.e. the system is close to one of the corners of strategy space up to deviations smaller than . We find logarithmic behaviour of the so-defined fixation time, i.e. . This is consistent with observations in one-dimensional evolutionary co-ordination games, with one central unstable fixed point nowakbook (); traulsenreview ().

It is generally very hard to compute the time-to-fixation of stochastic processes analytically, this applies both to learning processes and to evolutionary dynamics. In the latter, general analytical results have been obtained only for one-population models nowakbook (); traulsenreview (). One major complication is here the fact that the dynamics in most other cases has at least two degrees of freedom, impeding a full analytical solution.

Partial analytical results for game dynamical learning can however be obtained for what we will refer to as ‘escape times’ in the following, see also mobilia () for studies of escape times in cyclic evolutionary games. For a given (finite) batch size we here start the learning dynamics at the deterministic fixed point , and run the stochastic dynamics until the system reaches a given distance from this fixed point. More precisely we define the escape time as the time at which the variable first exceeds the value . This measure of distance was chosen for analytical convenience, as it will become clear below. Results from simulations are shown in Fig. 3.

Figure 3: (Colour on-line) Two-player learning dynamics in the Hawk-Dove game. The figure shows the escape time from a ball about the central fixed point (see text for details). Symbols are results from numerical simulations (, averaged over samples), solid lines show the theoretical estimates of Eq. (13), dashed lines the approximation of Eq. (54). The escape time scales logarithmically in the system size , in-line with the existence of an unstable eigendirection of the limiting deterministic dynamics.

Analytical predictions of the escape time for small values of are possible within a linear approximation about the central fixed point. Using the methods detailed in galla (); galla2 () we find that in the limit and for large but finite batch size the two-player learning dynamics can be described by the following Langevin dynamics

(10)

where the matrix is the Jacobian of the continuous-time learning dynamics (equivalent to the replicator equations for the case we are considering here), specifically we have at vanishing memory loss2. We have here introduced and , and address only the stationary regime in which deterministic learning has assumed its fixed point. The variables and describe fluctuations about this fixed point, and represent Gaussian white noise, with variances and correlations given by

(11)

Given that is an eigenvector of the above Jacobian (with an eigenvalue of ) we then have

(12)

where , and where , with . Using results for escape times of general Langevin processes of the form with (see appendix) we then obtain the following prediction for the escape time

(13)

with . In our specific example we have and . As seen in Fig. 3 this compares well with numerical simulations.

iii.3 Comparison with evolutionary dynamics in finite populations

We have already indicated that the behaviour of the stochastic learning dynamics is, to an extent, similar to evolutionary processes. To quantify this further we investigate both one-population and two-population stochastic evolutionary processes in this section.

Two-population dynamics

Specifically we consider two populations, each composed of players. Each of these players will either be a Hawk or a Dove, we denote the number of Hawks in the first population by , and the number of Hawks in the second population by respectively. The corresponding numbers of Doves are then and in the two populations. Players in the first population only play against players of the second population, and vice versa. The fitness of a Hawk and Dove players in the first population are then for example given by

(14)

and similar definitions hold for individuals in the second population. In order to specify a microscopic dynamics we use the so-called ‘local update rule’, sometimes also referred to as the ‘pairwise comparison process’ traulsenreview (). A player of the ‘Dove’ type is converted into ‘Hawk’ player with a rate proportional to , where labels the two populations. Similarly conversions of Hawk players into Dove players occur with a rate proportional to . Specifically, we will use the following transition rates

(15)

The factors of the form here indicate that two players of different types need to be drawn from any one population in order for an interaction to occur. It is here important to stress that reproduction and selection occurs within the separate populations, i.e. at no point is an individual of one population converted into a member of the other. Interaction between the population occurs via Eq. (14), i.e. the fitness of members of population one depends on the composition of population two and vice versa.

In the deteministic limit () one recovers the two-population replicator dynamics

(16)

where we have used the replacements and in Eqs. (15) to obtain the . The deterministic flow of these replicator equations is the one indicated in Fig. 1, in particular the central fixed point has one stable and one unstable eigendirection. Fixation of the stochastic evolutionary dynamics can occur at any of the four corners of strategy space. We show results for the average time-to-fixation in the inset of Fig. 4, the fixation time depends logarithmically on the system size .

Figure 4: (Colour on-line) Two-population evolutionary dynamics in the Hawk-Dove game: Main panel shows the escape time from a region about the central fixed point (see text for details). Symbols are results from numerical simulations (averaged over samples), solid lines show the theoretical estimates of Eq. (13), dashed lines the approximation of Eq. (54). The escape time scales logarithmically in the system size , in-line with the existence of an unstable eigendirection of the limiting deterministic dynamics. The inset shows the fixation time as a function of the system size (average over runs).

As in the learning dynamics, an analytical calculation of the fixation time is very difficult. Estimates for the escape times can however be obtained within a system-size expansion about the fixed point of the deterministic replicator equations. Following standard methods based on the so-called ‘van Kampen expansion’ in the inverse system size kampen () one finds

(17)

where and . As before and describe Gaussian noise, from the van Kampen expansion one finds as well as

(18)

This translates into a Langevin equation

(19)

for the variable , with . A theoretical prediction for the escape time can hence be found using the values in Eq. (13). Results are tested against simulations and confirmed successfully in Fig. 4.

One-population dynamics

In the one-population model one considers a single population of individuals, each of whom can either be a Hawk or a Dove player. The state of the system is hence characterized by a single integer, the number of Hawks. Transition rates of the local process read

(20)

The analysis of this model is not new as such, the study of single-population dynamics of games is in fact standard, see for example traulsenreview (); altrockinger (). We here present results mainly for completeness and in order to contrast them with the above two-population case.

In the deteministic limit () the following replicator equation is obtained from Eq. (20):

(21)

This corresponds to restricting the two-population replicator equations (16) to the subspace in which . In order to explore stochastic corrections to this limiting behaviour in next-to-leading order we again carry out the system-size expansion. As before we do not report the detailed mathematics, which is tedious, but standard. Defining one finds

(22)

where . Setting and in Eq. (13) we obtain semi-analytical predictions for the escape time. These results are compared with simulations in Fig. 5. As seen in the figure the escape time no longer scales logarithmically in the system size as it was the case in the two-population model, but instead the escape is now exponentially slow in the asymptotic limit of large . We have also measured the actual fixation time (see inset of Fig. 5). Fixation times scale exponentially in the system size. Analytical results can here be obtained based on the methods described for example in traulsenreview ()). For completeness we show the results of these calculations in the inset of Fig. 5.

Figure 5: (Colour on-line) One-population evolutionary dynamics in the Hawk-Dove game: main panel shows the escape time from the central fixed point (defined as the point in time at which the quantity , with as defined in the text, first exceeds the value ). Symbols are results from numerical simulations (averaged over samples), solid lines show the theoretical results of Eq. (13), the dashed lines the asymptotic approximation of Eq. (56). The escape time scales exponentially in the system size , in-line with the existence of a stable eigendirection of the limiting deterministic dynamics.The inset shows the fixation time as a function of the system size, symbols are from simulations (average over runs), the solid line from an analytical calculation based on the methods and expressions detailed in traulsenreview ().

iii.4 Effects of memory-loss in two-player learning

Unlike in evolutionary dynamics, where fixation can occur in absence of mutation purely by random drift, fixation in stochastic game learning is strictly tied to the convergence of the limiting deterministic learning process to pure strategy equilibria. In order to demonstrate this we will extend the analysis to non-zero memory-loss rates in the following. Deterministic learning of the Hawk-Dove game in discrete time is then described by the two-dimensional map given by Eq. (4) (with the appropriate substitutions for the payoff structure). The point is a fixed point and the relevant eigenvalues are identified as . Assuming (and ) we therefore find that the central fixed point is stable whenever . In order to characterise the outcome of learning we have to distinguish between three different types of behaviour3:

  1. For the central fixed point is not a stable attractor of the deterministic learning process. In this regime is still a stable eigendirection, so deterministic learning will converge to provided it is started from symmetric initial conditions (). For generic initial conditions this symmetry is broken however, and the dynamics is observed to approach either or asymptotically. Noise in learning has a similar symmetry-breaking effect, and will drive the dynamics to one of the pure strategy attractors.

  2. For , but the central fixed point is again not a stable attractor of the learning dynamics, and deterministic learning will converge to only if started from symmetric initial conditions. For asymmetric initial conditions the dynamics will approach an asymmetric fixed point (), which is generally not a pure strategy for . With noise learning fluctuates around this symmetry-broken attractor. Memory-loss in learning thus acts similar to mutation in evolutionary dynamics, and impedes absorption at the boundaries.

  3. For deterministic learning converges to even for non-symmetric initial conditions. In this case there is no fixation, the dynamics of stochastic learning will fluctuate around the mixed-strategy equilibrium asymptotically.

This behaviour is illustrated further in Fig. 6.

Figure 6: (Colour on-line) Effects of memory-loss on deterministic and stochastic learning in the Hawk-Dove game. The upper row shows the outcome of the deterministic dynamics at for different values of the memory-loss parameter . In each panel we show five trajectories () obtained from five independent random initial conditions. Lower row: single runs of stochastic learning, started from random initial condition. The critical value separating the regime of a stable central fixed point () from an unstable regime ( is given by .

Iv Escape rates in cyclic games

We now consider a two-player discrete-time learning dynamics in the rock-paper-scissors game (RPS). Detailed analyses of evolutionary processes in this cyclic game can for example be found in mobilia (). We here focus on learning, and first concentrate on the deterministic limit. Specifically, using the deterministic limit of Eqs. (4) we have the following map

(23)

where

(24)

and where () the standard RPS payoff matrix, i.e.

(25)

Due to overall normalisation, , the above map defines a -dimensional dynamical system. The mixed strategy point for all is always a fixed point, and the corresponding Jacobian is easily computed. One finds the following eigenvalues

(26)

each with degeneracy . Thus, the central fixed point is stable if and only if . For a fixed choice of one therefore has stability for , and an unstable fixed point otherwise.

This separation of two regimes, one with a stable fixed point, and the other with a deterministic flow away from the centre of strategy space, is reflected in the escape times of stochastic learning. Results are shown in Fig. 7. In our simulations the stochastic learning dynamics is started at the fixed-point at the centre of the strategy simplex and evolved at finite batch size . The system does not fixate into one pure strategy, so the escape time is measured as the point in time when the -dimensional vector first crosses a circle of radius around the fixed point. As seen in the figure the escape time scales sublinearly with the batch size if the fixed point is unstable (). For neutrally stable deterministic dynamics algebraic scaling is found, and escape is sub-extensively slow in the regime of a stable fixed point ().

Figure 7: (Colour on-line) Mean escape times in the learning of rock-papers-scissors at a fixed value of as a function of the batch size. Symbols are from numerical simulations, averaged over independent runs. The mean escape time is sub-linear in if the fixed point is unstable (), and super-linear for (stable fixed point). If the central fixed point is neutrally stable () the escape time scales linearly in the batch size . The dashed line is a fit to a power law of the data at and reveals a linear scaing . For the present choice of one has .

V Network games

v.1 Definition of the game

We will now move to a more complex multi-player game defined on a networked structure, and consider the so-called ‘best shot game’ galeotti (). Analyses of the statistics of Nash equilibria on random graphs can be found in asta1 (); asta2 (). We here again focus on adaptive learning. Players are labelled by and arranged on an undirected graph, so that players and interact if and only if the link between and is present in the graph. In the ‘best shot’ game each player has the choice between two actions, to ‘contribute’ or not to contribute. For simplicity we will refer to these actions as and respectively. The payoff any given player receives in any round of the game then depends on her action and on the actions of her neighbours on the underlying network. If we write for the set of neighbours of player then the best-shot game is defined by the following payoff structure for action

(27)

and by payoffs for action

(28)

The variables and are positive constants. To a certain extent the game resembles the typical structure of public goods games. In absence of any contributors in player ’s neighbourhood () player will increase his payoff by contributing. If however at least one of her neighbours is contributing already (), then player will not want to contribute herself.

v.2 Sato-Crutchfield equations and homogeneous fixed point

We will write with and for the probability with which player takes action at time . One always has . In the continuous-time limit one obtains the following deterministic equation

(29)

Taking into account that

(30)
(31)

we can rewrite the equations above in terms of the parameters :

(32)

where we have introduced . Up to now all derivations hold for any network structure. In order to keep the analytical expressions to a sensible level we will from now on restrict the analysis to regular graphs, i.e. to graphs in which all players have the same number of neighbours. We will denote the degree of the resulting regular network by . Looking for homogeneous fixed-point solutions of the above continuous-time dynamics, i.e. setting for all players , one finds

(33)

Excluding trivial fixed points, i.e. assuming and , one obtains

(34)

The solutions to this equation give the possible fixed points of the deterministic learning dynamics. In the cases studied here we will typically have one internal fixed point, , its numerical value will generally depend on the model parameters and . For and we recover the homogeneous mixed Nash equilibrium . Irrespective of and one finds for .

v.3 Stability analysis

Expanding Eq. (32) around the fixed point to linear order one finds

(35)

where . Eq. (35) can then be written in matrix form as

(36)

where is the adjacency matrix of the graph and the identity matrix. Diagonalizing this equation is equivalent to diagonalizing . In particular the critical value of , separating the phase in which the fixed point is stable from a phase with an unstable fixed point is given by

(37)

where is the smallest eigenvalue of the adjacency matrix (all eigenvalues are real since is symmetric).

Figure 8: (Colour on-line) Escape times of learning in the networked best-shot game (, ). Simulations are performed on a fixed regular graph with players and , created at random. All runs are started at the homogeneous fixed point , the escape time is defined as the first time in each run when . Results are averaged over independent runs (all on the same realisation of the graph). The most negative eigenvalue of the adjacency matrix of this specimen graph is approximately , so that Eq. (37) predicts . An exponent of approximately is found in a power law fit to the escape time at (dashed line).

In analogy with the earlier sections we expect that the escape time of learning will scale logarithmically in the batch size for when the interior fixed point is unstable. In the stable phase () on the other hand one would predict an exponential behaviour. We verify these predictions in the following section. As a final remark regarding stability it is interesting to consider the limiting case of deterministic learning started from homogeneous initial conditions for all . For regular networks of degree one then has for all , and fulfills

(38)

Linearising about the fixed point, and restricting the motion to the space , one finds

(39)

Given that the interior fixed point is therefore stable irrespective of the value of the parameter , similar to what was observed in the Hawk-Dove game. The network game considered in this section therefore bears close similarity to the Hawk-Dove game discussed earlier on. In a one-population setting (equivalently upon restricting the dynamics to the subspace ) the deterministic dynamics has a stable internal fixed point for any . In the multi-population case the fixed point remains unchanged, but unstable eigenvalues are present for . The corresponding eigendirections break the symmetry between the different co-ordinates , and hence the flow is away from the manifold defined by .

v.4 Test against simulations

The above theoretical predictions can be tested in several possible ways. For example one can consider the thermodynamic limit of large regular random networks of degree , and then perform an average over multiple instances of the graph. Using results from spectral graph theory mckay (); cioaba () the support of the eigenvalue distribution of the adjacency matrix of a large regular random graph of degree K typically has its most negative eigenvalue at

(40)

With this estimate the expected value of can then be computed by means of Eq. (37). Simulations of the learning dynamics on large networks are however time consuming, and we have therefore taken a different route. We have created one particular instance of a regular random graph with nodes and degree . The adjacency matrix of this particular graph has then been diagonalised and the relevant eigenvalue has been identified as . For convenience we have also chosen , ensuring the value for the deterministic fixed point4. Eq. (37) then predicts a change of stability at . Measurements of the escape time of the best-shot game on this fixed sample of the graph are shown in Fig. 8. Results are consistent with an algebraic dependence of the escape time on the batch size at , even though we note a slight discrepancy from the exponent of unity one would expect from the Langevin approximation5. Below the fixed point is unstable, and escape times are shorter, consistent with logarithmic scaling in the batch size. At the fixed point is stable and escape is slow.

Fig. 9 illustrates the dynamical evolution under either the deterministic replicator equations or the stochastic learning dynamics . It shows how the system is driven to fixation by a non-homogeneous perturbation, caused either by a slight heterogeneity in the initial conditions of the replicator equations or by the stochasticity induced by a finite batch size in the learning dynamics. Specifically one finds that (i) the fixed point is an attractor for homogeneous initial conditions6 and that (ii) inhomogeneous initial conditions lead to a flow towards the corners of strategy space.

Figure 9: (Colour on-line) Fixation in learning in the networked best-shot game (). Simulations are performed on a fixed regular graph with players and , created at random. The black dashed line indicates the evolution of the deterministic replicator dynamics (RD), started with homogeneous initial conditions (hom IC) for all players . Symbols denote the evolution of the corresponding stochastic learning dynamics (SD) with batch size ; for clarity, only three players are followed but all of them fixate. The colour lines show the evolution, for the same three players, of the deterministic replicator equations when the homogeneous initial conditions above are slightly perturbed in the direction of the configuration to which the stochastic learning fixated (het IC). That is to say, where is a random number smaller than which is positive if the configuration reached by the stochastic learning has and it is negative otherwise. If the perturbation is in any other direction, the configuration to which the replicator equations fixate may be different from the one reached by stochastic learning.

Vi Conclusions

In summary we have studied the fixation properties of simple reinforcement-type learning algorithms in the context of different games, and have compared them to the outcome of evolutionary dynamics.The examples we have chosen range from simple two-player two-action games to more complex multi-player games on networked structures. Our main results can be summarized as follows: (i) Unlike evolutionary dynamics in which fixation can occur purely driven by fluctuations, fixation (i.e. convergence to pure strategies) in learning of games of the type we have studied here appears to be possible only if the underlying deterministic dynamics itself converges to a pure action profile. This is typically only the case if the symmetry between players is broken, for example by an inhomogeneous initial condition; (ii) Two-player and multi-player learning in the deterministic limit can, to a good approximation, be described by equations of a multi-population replicator type. As seen for the Hawk-Dove game, and for the network game we have studied, the stability properties of multi-population replicator dynamics can differ substantially from those of the corresponding one-population model; (iii) The role of noise in fixation processes in dynamical learning is mostly limited to triggering the required breaking of symmetry, eventually leading to fixation, unlike in evolutionary processes we have not found examples in which fixation is triggered by random drift alone. (iv) In cases for which the limiting deterministic learning converges to a symmetric fixed point in the interior of strategy space the corresponding escape time depends on the stability of this fixed point: for stable fixed points escape is essentially exponential, for unstable fixed points logarithmic scaling in the batch size is found, these findings are very similar to those found in evolutionary systems.

While we have pointed out crucial differences between multi-player learning and evolutionary dynamics, our results mostly extend the similarities between the two approaches to dynamical aspects of games. In galla () it was pointed out that stochastic learning can exhibit persistent quasi-cycles in regimes where deterministic learning converges to fixed points. These effects are very similar to those observed in evolutionary systems. The present work shows that the analogy goes further, and that the escape and fixation properties of stochastic learning dynamics are closely related to the corresponding behaviour of population models. Our analysis shows that the analogy is particularly strong when learning is compared with evolution in multiple populations. We expect that these similarities stretch even further, including potentially pattern formation in spatially extended systems and or more complicated dynamics on adaptive networks. Future work may also include more complex learning models camerer1 (); Camerer2003 (); Ho2007 () inspired by laboratory experiments in behavioural game theory or by algorithms in machine learning. We are here confident that the analogy with stochastic evolutionary systems may provide a powerful perspective and that it can contribute to accelerating the research required to analyze more general learning dynamics.

Acknowledgements.
TG and JRG acknowledge hospitality of the Abdus Salam ICTP. TG is grateful for funding by the Research Councils UK (RCUK reference EP/E500048/1) and by EPSRC (references EP/I005765/1 and EP/I019200/1), and would like to thank A. J. Bray for useful discussions. BSZ acknowledges support by an EPSRC studentship.

Appendix: Escape time for linear Langevin dynamics

.1 Reduced problem and backward Fokker-Planck equation

Let us consider the simple 1D problem

(41)

where is standard Gaussian white noise, i.e. and

(42)

We consider escape times: Fix a number , if the process is started at at , then the escape time is defined as the first time, the process leaves the interval .

Using standard methods gardiner (); risken () one finds the backward Fokker-Planck equation

(43)

where is the mean escape time, conditioned on a starting point . Boundary conditions read (if the process is started at the boundaries of the interval, the extinction time is trivially zero). It is easy to check that the solution is given by

(44)

Restricting to a starting point of one has

(45)

By introducing the change of variables and , introducing the appropriate Jacobian , and setting , we can rewrite the integrals on the RHS of Eq. (45) as

(46)

where

(47)

is a generalised hypergeometric function whose asymptotic behaviour is well known DLMF (). For simplicity, we will in the following give a heuristic derivation of the asymptotic leading behaviour of at large values of . Our starting point will be the expression given in Eq. (45).

.2 Large- behaviour for

Setting , so that , and changing variables in (45) to and one has

(48)

Executing the integration over and setting one finds

(49)

where we have introduced the lower integration limit . This variable will be set to zero eventually, the purpose of our procedure is to keep track off singularities at small values of .

When and is very large the main contribution to the integral in (49) comes from small values of due to the decaying exponential factors. Notice that the term in the second integral in (49) is only relevant when in which case the exponent and so the integrand is exponentially small in . Therefore, we can neglect the term in the exponent relative to . Doing this and introducing the variable in the first integral and in the second, we obtain

(50)

Both integrals on the RHS are of the type , where in the first case, and in the second. In both cases tends to zero as . The upper integration limit is proportional to in both integrals. Integrals of this type can be simplied by an integration by parts, and we find

(51)

The last integral in (51) is exponentially small in , as one can see from

(52)

valid for . We can therefore neglect the last term in Eq. (51) in the limit of , when we have . Performing the limit (and taking into account that in this limit) we find

(53)

where is the Euler-Mascheroni constant. Using these results in Eq. (50), we finally find

(54)

Except for the additional term, this result coincides with the asymptotic results for escape times reported in [Mobilia 2010] for the case of a two-dimensional system.

.3 Large- behaviour for

For and large, we have from Eq. (45)

(55)

We have here first extended the integration range of the -integration to the interval , adjusted for by an overall factor of . Subsequently, based on the observation that the integrand assumes its maximum at we have extended the integration range of the -integral to the entire real axis. This introduces an error which does not contribute to the leading exponential behaviour for large . The integral over is now Gaussian, and can be evaluated straightforwardly. The exponent in the remaining -integration reaches its maximum at . Expanding the exponent up to first order about this value we get

(56)

with