Deterministic limit of temporal difference reinforcement learning for stochastic games
Abstract
Reinforcement learning in multiagent systems has been studied in the fields of economic game theory, artificial intelligence and statistical physics by developing an analytical understanding of the learning dynamics (often in relation to the replicator dynamics of evolutionary game theory). However, the majority of these analytical studies focuses on repeated normal form games, which only have a single environmental state. Environmental dynamics, i.e. changes in the state of an environment affecting the agents’ payoffs has received less attention, lacking a universal method to obtain deterministic equations from established multistate reinforcement learning algorithms. In this work we present a novel methodology to derive the deterministic limit resulting from an interactionadaptation time scales separation of a general class of reinforcement learning algorithms, called temporal difference learning. This form of learning is equipped to function in more realistic multistate environments by using the estimated value of future environmental states to adapt the agent’s behavior. We demonstrate the potential of our method with the three well established learning algorithms Q learning, SARSA learning and ActorCritic learning. Illustrations of their dynamics on two multiagent, multistate environments reveal a wide range of different dynamical regimes, such as convergence to fixed points, limit cycles and even deterministic chaos.
I Introduction
Individual learning through reinforcements is a central approach in the fields of artificial intelligence Sutton and Barto (1998); Busoniu et al. (2008); Wiering and van Otterlo (2012), neuroscience Shah (2012); Hassabis et al. (2017), learning in games Fudenberg and Levine (1998) and behavioral game theory Roth and Erev (1995); Erev and Roth (1998); Camerer and Ho (1999); Camerer (2003), thereby offering a general purpose principle to either solve complex problems or explain behavior. Also in the fields of complexity economics Arthur (1991, 1999) and social science Macy and Flache (2002), reinforcement learning has been used as a model for human behavior to study social dilemmas.
However, there is a need for improved understanding and better qualitative insight into the characteristic dynamics that different learning algorithms produce. Therefore, reinforcement learning has also been studied from a dynamical systems perspective. In their seminal work, Börgers and Sarin showed that one of the most basic reinforcement learning update schemes, Cross learning Cross (1973), converges to the replicator dynamics of evolutionary games theory in the continuous time limit Börgers and Sarin (1997). This has led to at least two, presumably nonoverlapping research communities, one from statistical physics Marsili et al. (2000); Sato et al. (2002); Sato and Crutchfield (2003); Sato et al. (2005); Galla (2009, 2011); Bladon and Galla (2011); RealpeGomez et al. (2012); Sanders et al. (2012); Galla and Farmer (2013); Aloric et al. (2016), and one from computer science machine learning Tuyls et al. (2003); Bloembergen et al. (2015); Tuyls and Nowé (2005); Tuyls et al. (2006); Tuyls and Parsons (2007); Kaisers and Tuyls (2010); Hennes et al. (2009); Vrancx et al. (2008); Hennes et al. (2010). Thus, Sato and Crutchfield Sato and Crutchfield (2003) and Tuyls et al. Tuyls et al. (2003) independently deduced identical learning equations in 2003.
The statistical physics articles usually consider the deterministic limit of the stochastic learning equations, assuming infinitely many interactions between the agents before an adaptation of behavior occurs. This limit can either be performed in continuous time with differential equations Sato et al. (2002); Sato and Crutchfield (2003); Sato et al. (2005) or discrete time with difference equations Galla (2009, 2011); Bladon and Galla (2011). The differences between both variants can be significant Galla (2011); RealpeGomez et al. (2012). Deterministic chaos was found to emerge when learning simple Sato et al. (2002) as well as complicated games Galla and Farmer (2013). Relaxing the assumption of infinitely many interactions between behavior updates revealed that noise can change the attractor of the learning dynamics significantly, e.g. by noiseinduced oscillations Galla (2009, 2011).
However, these statistical physics studies so far considered only repeated normal form games. These are games where the payoff depends solely on the set of current actions, typically encoded in the entries of a payoff matrix (for the typical case of two players). Receiving payoff and choosing another set of joint actions is performed repeatedly. This setup lacks the possibility to study dynamically changing environments and their interplay with multiple agents. In those systems, rewards do not depend only on the joint action of agents, but also on the states of the environment. Environmental state changes may occur probabilistically and depend also on joint actions and the current state. Such a setting is also known as a Markov game or stochastic game Shapley (1953); Mertens and Neyman (1981). Thus, a repeated normal form game is a special case of a stochastic game with only one environmental state. Notably Akiyama and Kaneko Akiyama and Kaneko (2000, 2002) did emphasize the importance of a dynamically changing environment, however did not utilize a reinforcement learning update scheme.
The computer science machine learning community dealing with reinforcement learning as a dynamical system (see Bloembergen et al. (2015) for an overview) particularly emphasizes the link between evolutionary game theory and multiagent reinforcement learning as a well grounded theoretical framework for the latter Bloembergen et al. (2015); Tuyls and Nowé (2005); Tuyls et al. (2006); Tuyls and Parsons (2007). This dynamical systems perspective is proposed as a way to gain qualitative insights about the variety of multiagent reinforcement learning algorithms (see Busoniu et al. (2008) for a review). Consequently, this literature developed a focus on the translation of established reinforcement learning algorithms to a dynamical systems description, as well as the development of new algorithms based on insights of a dynamical systems perspective. While there is more work on stateless games (e.g. Q learning Tuyls et al. (2003), frequency adjusted multiagent Q learning Kaisers and Tuyls (2010)), multiagent learning dynamics for multistate environments have been developed as well, such as the piecewise replicator dynamics Vrancx et al. (2008), the statecoupled replicator dynamics Hennes et al. (2009) or the reverse engineering statecoupled replicator dynamics Hennes et al. (2010).
Both communities, statistical physics and machine learning, share the interest in better qualitative insights into multiagent learning dynamics. While the statistical physics community focuses more on dynamical properties the same set of learning equations can produce, it leaves a research gap of learning equations capable of handling multiple environmental states. The machine learning community on the other hand aims more towards algorithm development, but so far put their focus less on a dynamical systems understanding. Taken together, there is the challenge of developing a dynamical systems theory of multiagent learning dynamics in varying environmental states.
With this work, we aim to contribute to such a dynamical systems theory of multiagent learning dynamics. We present a novel methodology for obtaining the deterministic limit of multistate temporal difference reinforcement learning. In essence, it consists of formulating the temporal difference error for batch learning, and sending the batch size to infinity. We showcase our approach with the three prominent learning variants of Q learning, SARSA learning and ActorCritic learning. Illustrations of their learning dynamics reveal multiple different dynamical regimes, such as fixed points, periodic orbits and deterministic chaos.
In Sec. II we introduce the necessary background and notation. Sec. III presents our method to obtain the deterministic limit of temporal difference reinforcement learning, and demonstrate it for multistate Q learning, SARSA learning and ActorCritic learning. We illustrate their learning dynamics for two previously utilized twoagents twoactions twostates environments in Sec. IV. In Sec. V we conclude with a discussion of our work.
Ii Preliminaries
We introduce the components (incl. notation) of our multiagent environment systems (see Fig. 1), followed by a brief introduction of temporal difference reinforcement learning.
ii.1 Multiagent Markov environments
A multiagent Markov environment (also called stochastic game or Markov game) consists of agents. The environment can exist in states . In each state each agent has available actions to choose from. Having an identical number of actions for all states and all agents is notational convenience, no significant restriction. A joint action of all agents is referred to by , the joint action of all agents but agent is denoted by .
Environmental dynamics are given by the probabilities for state changes expressed as a transition tensor The entry denotes the probability that the environment transitions to state given the environment was in state and the agents have chosen the joint action . Hence, for all , must hold. The assumption that the next state only depends on the current state and joint action makes our system Markovian. We here restrict ourselves to ergodic environments without absorbing states (c.f. Hennes et al. (2010)).
The rewards receivable by the agents are given by the reward tensor The entry denotes the reward agent receives when the environment transits from state to state under the joint action . Rewards are also called payoffs from a game theoretic perspective.
Agents draw their actions from their behavior profile The entry denotes the probability that agent chooses action in state . Thus, for all and all , must hold. We here focus on the case of independent agents, able to fully observe the current state of the environment. With correlated behavior (see e.g. Busoniu et al. (2008)) and partially observable environments Spaan (2012); Oliehoek (2012) one could extend the multiagent environment systems to be even more general. Note that what we call behavior profile is usually termed policy from a machine learning perspective or behavioral strategy from a game theoretic perspective. We chose to introduce our own term because policies and strategies suggest a deliberate choice which we do not want to impose.
ii.2 Averaging out behavior and environment
We define a notational convention, that allows a systematic averaging over the current behavior profile and the environmental transitions . It will be used throughout the paper.
Averaging over the whole behavioral profile yields
(1) 
Here, serves as a placeholder. If the quantity to be inserted for depends on the summation indices, then those indices will be summed over as well. If the quantity, which is averaged out, is used in tensor form, it is written in bold. If not, remaining indices are added after the right angle bracket.
Averaging over the behavioral profile of the other agents, keeping the action of agent , yields
(2) 
Last, averaging over the subsequent state yields
(3) 
Of course, these operations may also be combined as and , by multiplying both summations.
For example, given a behavior profile , the resulting effective Markov Chain transition matrix reads , which encodes the transition probabilities from state to . From the stationary distribution of environmental states can be computed. is the eigenvector corresponding to the eigenvalue 1 of . Its entries encode the ratios of the average durations the agents find themselves in the respective environmental states.
The average reward agent receives from state under action , given all other agents follow the behavior profile reads . Including agent ’s behavior profile gives the average reward it receives from state : . Hence, holds.
ii.3 Agent’s preferences and values
Typically, agents are assumed to maximize their exponentially discounted sum of future rewards, called return where is the discount factor of agent and denotes the reward received by agent at time step . Exponential discounting is most commonly used for its mathematical convenience and because it ensures consistent preferences over time. Other formulations of a return use e.g. finite time horizons, average rewards setting, as well as other ways of discounting, such as hyperbolic discounting. Those other forms require their own form of reinforcement learning.
Given a behavior profile the expected return defines the statevalue function , which is independent of time . Inserting the return yields the Bellman equation Bellman (1957)
(4) 
Thus, the value of a state is the discounted value of the subsequent state plus times the reward received along the way. Evaluating the expected value of the behavior profile and writing in matrix form, we get:
(5) 
A solution of the statevalues can be obtained using matrix inversion
(6) 
The computational complexity of matrix inversion makes this solution strategy infeasible for large systems. Therefore many iterative solution methods exist Wiering and van Otterlo (2012).
Equivalently, stateactionvalue functions are defined as the expected return, given agent applied action in state and then followed accordingly: They can be computed via
(7) 
One can show that holds for the inverse relation of stateaction and statevalues.
ii.4 Learning through reinforcement
In contrast to the typical game theoretic assumption of perfect information we assume that agents know nothing about the game in advance. They can only gain information about the environment and other agents through interactions. They do not know the true reward tensor or the true transition probabilities . They experience only reinforcements (i.e. particular rewards ), while observing the current true Markov state of the environment.
ii.4.1 Temporal difference learning
In essence, stateactionpropensities get iteratively updated by a temporal difference error :
(8) 
with being the learning rate of agent . These stateactionpropensities can be interpreted as estimates of the stateaction values .
The temporal difference error expresses a difference in the estimation of stateaction values. New experience is used to compute a new estimate of the current stateaction value and corrected by the old estimate. The estimate from the new experience uses exactly the recursive relation of value functions from the Bellmann equation (Eq. 4),
(9) 
The indicates the estimate at time step of the value of the state visited at the next time step . denotes the estimate at time step of the value of the current state . Different choice for these estimations are possible, leading to different learning variants (see below).
The Dirac deltas indicate that the temporal difference error for stateaction pair is only nonzero when was actually visited in time step . This denotes and emphasizes, that agents can only learn from experience. In contrast, e.g. experienceweightedattraction learning Camerer and Ho (1999) assumes that actionpropensities can be updated with hypothetical rewards an agents would have received if it had played a different than the current action. These two cases have been referred to as full vs. partial information Marsili et al. (2000). Thus, the Dirac deltas in Eq. 9 indicate a partial information update. The agents use only information experienced through interaction.
The stateactionpropensities are translated to a behavior profile according to the Gibbs / Boltzmann distributionSutton and Barto (1998) (also called softmax)
(10) 
The behavior profile has become a dynamic variable as well. The parameter controls the intensity of choice or the exploitation level of agent controlling the explorationexploitation tradeoff. For high agents tend to exploit their learned knowledge about the environment, leaning towards actions with high estimated stateaction value. For low agents are more likely to deviate from these high value actions in order to explore the environment further with the chance of finding actions, which eventually lead to even higher rewards. Other behavior profile translations exist as well (e.g. greedy Sutton and Barto (1998))
ii.4.2 Three learning variants
The specific choices of the value estimates in the temporal difference error result in different reinforcement learning variants.
Q learning.
For the Q learning algorithm Sutton and Barto (1998); Wiering and van Otterlo (2012) and . Thus, the Q learning update takes the maximum of the next stateactionpropensities as an estimate of the value for the next state, regardless of the actual next action the agent plays. This is reasonable, because the maximum is the highest value achievable given the current knowledge.
SARSA learning.
For SARSA learning Sutton and Barto (1998); Wiering and van Otterlo (2012) and , where denotes the action taken by agent at the next time step . Thus, the SARSA algorithm uses the five ingredients of an update sequence of State, Action, Reward, next State, next Action to perform one update. In practice, the SARSA sequence has to be shifted one time step back to know what the actual ”next” action of the agent was.
ActorCritic (AC) learning.
For AC learning Sutton and Barto (1998); Wiering and van Otterlo (2012) and . It has an additional data structure of the statevalue approximations which get separately updated according to The stateactionpropensities serve as the actor which get criticized by the state value approximations .
Tab. 0(a) summarizes the values estimates for these three learning variants. Q and SARSA learning are structurally more similar compared to the ActorCritic learner, which has an additional data structure of state value approximations .



Iii Deterministic limit
In this section we will derive a novel methodology to obtain the deterministic limit of temporal difference reinforcement learning. We showcase our method in the three learning variants of Q, SARSA and ActorCritic learning. For the statistical physics community, the novelty consists of learning equations, capable of handling environmental state transitions. For the machine learning community the novelty lies in the systematic methodology we use to obtain the deterministic learning equations. Note that these deterministic learning equations will not depend on the stateactionpropensities anymore, being iterated maps of the behavior profile alone.
Following e.g. Sato and Crutchfield (2003); Sato et al. (2005); Bladon and Galla (2011), we first combine Eq. 10 with a propensity update and obtain
(11) 
Next, we formulate the temporal difference error for batch learning.
iii.1 Batch learning
With batch learning we mean that several time steps of interaction with the environment and the other agents take place before an update of the stateactionpropensities and the behavior profile occurs. It has been also interpreted as a form of history replay Lange et al. (2012) which is essential to stabilize the learning process when function approximation (e.g. by deep neural networks) is used Mnih et al. (2015). History (i.e. already experienced state, action, next state triples) is used again for an update of the stateactionpropensities.
Imagine that the information from these interactions are stored inside a batch of size . We introduce the corresponding temporal difference error of batch size :
(12) 
where denotes the number of times the state action pair was visited. If the stateaction pair was never visited, . The agents interact times under the same behavior and use the sample average as the new estimate for the value of the stateaction pair .
The stateactionpropensities update then follows
(13) 
The notation is short for a batch update of batch size 1: .
iii.2 Separation of timescales
We obtain the deterministic limit of the temporal difference learning dynamics by sending the batch size to infinity, .
Equivalently, this can be regarded as a separation of timescales. Two processes can be distinguished during an update of the stateactionpropensities : adaptation and interaction,
(14) 
By separating the timescales of both processes, we assume that (infinitely) many interactions happen before one step of behavior profile adaptation occurs.
Under this assumption and because of the assumed ergodicity one can replace the sample average, i.e. the sum over sequences of states and actions with the behavior profile average, i.e. the sum over stateaction behavior and transition probabilities according to
(15) 
For example, the immediate reward in the temporal difference error becomes . The time gets resealed accordingly, as well.
iii.3 Three learning variants
Next, we present the deterministic limit of the temporal difference error of the three learning variants of Q, SARSA and ActorCritic learning. Inserting them into Eq. 11 yields the complete description of the behavior profile update in the deterministic limit. Tab. 1 presents an overview of the resulting equations and a comparison to their batch size versions.
iii.3.1 Q learning
The temporal difference error of Q leaning consists of three terms: i) ii) and iii) . As already stated under . which is defined as
(16) 
using the deterministic limit conversion rule (Eq. 15) and the stateaction value of the behavior profile according to Eq. 7.
For the third term, we invert Eq. 10, yielding , where is constant in actions, but may vary for each agent and state. Now, one can show that the dynamics induced by Eq. 11 are invariant against additive transformations in the temporal difference error . Thus, the third term can be converted according to .
All together, the temporal difference error for Q learning in the deterministic limit reads
(17) 
iii.3.2 SARSA learning
Two of the three terms of the SARSA temporal difference error are identical to the one of Q learning, leaving which we replace by
(18) 
using again the deterministic limit conversion rule (Eq. 15) and the stateactionvalue of the behavior profile according to Eq. 7.
Thus, the temporal difference error for the SARSA learning update in the deterministic limit reads
(19) 
iii.3.3 ActorCritic (AC) learning
For the temporal difference error for AC learning we have to find replacements for i) and ii) . Applying again Eq. 15 yields defined as
(20) 
using Eq. 6 for the statevalue . This is the average value of the next state given that in the current state the agent took action . One can show that from the SARSA update.
The second remaining term belongs to the slower adaptation timescale, or in other words: occurs outside the batch. Thus, our deterministic limit conversion rule (Eq. 15) does not apply. We could think of a conversion . However, the remaining term is constant in action, and therefore irrelevant for the dynamics, as we have argued above. Thus, we can simply put .
All together, the temporal difference error of the ActorCritic learner in the deterministic limit reads
(21) 
Iv Application to example environments
In the following we apply the derived deterministic learning equations in two different environments. Specifically, we compare the three well established temporal difference learning variants (Q learning, SARSA learning and ActorCritic (AC) learning) in two different twoagents (), twoactions () and twostates () environments. The two environments, a twostate Matching Pennies game and a twostate Prisoners Dilemma have also been used in ref. Hennes et al. (2010). Note that we leave a comparison between the deterministic limit and the stochastic equations to future work, which would add a noise term to our equations following the example of ref. Galla (2009).
To measure the performance of an agent’s behavior profile in a single scalar, we use the dot product between the stationary state distribution of the effective Markov Chain with the transition matrix and the behavior average reward . Interestingly, we find this relation to be identical to the dot product of the stationary distribution and the state value :
(22) 
This relation can be shown by using Eq. 6 and the fact that is an eigenvector of .
In the following examples we will only investigate homogeneous agents, i.e. agents whose parameters will not differ from each other. We will therefore drop the agentindices from and . The heterogeneous agent case is to be explored in future work.
iv.1 Twostate Matching Pennies
The single state matching pennies game is a paradigmatic twoagent twoaction game. Imagine the situation of soccer penalty kicks. The keeper (agent 1) can choose to jump either to the left or right side of the goal, the kicker (agent 2) can choose to kick the ball also either to the left or the right. If both agents choose the identical side, the keeper agent wins, otherwise the kicker agent.
In the twostate version of the game according to Hennes et al. (2010) the rules are extend as follows: In state 1 the situation is as described in the singlestate version. Whenever agent 1 (the keeper) decides to jump to the left, the environment transitions to state 2 in which the agents switch roles: agent 1 now plays the kicker and agent 2 the keeper. From here, whenever agent 1 (now the kicker) decides to kick to the right side the environment transition again to state 1 and both agents switch their roles again.
Figure 2 illustrate this twostate Matching Pennies games. Formally, with denoting the payoff matrix of agent under the state , rewards are given by in state and in state for . State transitions are governed by and . Thus by construction, the probability of transitioning to the other state is independent of agent 2’s action. Only agent 1 has agency over the state transitions. By playing a uniform random behavior profile both agent would obtain an average reward of per time step.
With Fig. 3 we compare the temporal difference error in the phase space sections at a comparable low discount factor of for each environmental state, as well as learning trajectories for an exemplary initial condition for two learning rates , a low one () and a high one (). Overall, we observe a variety of qualitatively different dynamical regimes, such as fixed points, periodic orbits and chaotic motion.
Specifically, we see that Q learners and SARSA learners behave qualitatively similar in contrast to the AC learners; for both learning rates . For the low learning rate , Q and SARSA learners reach a fixed point of playing both actions with equal probability in both states, yielding a reward of 0.5. Due to the low , this takes approx. 600 time steps. In contrast, the reward trajectory of the AC learner appears to be chaotic. Figure 5 confirms this observation, which we will discuss in more detail below.
For the high learning rate both Q and SARSA learners enter a periodic limit cycle. Differences in the trajectories of Q and SARSA learner are clearly visible. The time average reward of this periodic orbit appears to be approx. for each agent, identical to the reward of the fixed point at lower . The AC learner, however, converges to a fixed point after oscillating near the edges of the phase space. At this fixed point agent 1 plays action 1 with probability 1. Thus, it has trapped the system into state 2. Agent 2 plays action 2 with probability 1 and consequently agent 1 receives a reward of 1, whereas agent 2 receives 0 reward. One might ask, why does agent 2 not decrease its probability for playing action 2, thereby increasing its own reward? And indeed, the arrows of the temporal difference error suggest this change of behavior profile. However, agent 2 cannot follow because its behavior is trapped on the simplex of nonzero action probabilities . For only actions, thus cannot change anymore, regardless of the temporal difference error.
Increasing the discount factor to , we observe the learning rate to set the timescale of learning (Fig. 4). The intensity of choice remained . A high learning rate corresponds to faster learning in contrast to a low learning rate . Also the ratio of learning timescales is comparable to the inverse ratio of learning rates. For both Q and SARSA learners reach a fixed point, whereas the AC learners seem to move chaotically (details to be investigated below). Comparing the trajectories between the learning rates , we observe a similar shape for each pair of learners. However, the similarity of the AC trajectories decreases at larger time steps.
So far, we varied two parameters: the discount factor and the learning rate . Combining Figures 3 and 4 we investigated all four combinations of a low and a high with a low and a high . We can summarize that Q and SARSA learners converge to a fixed point for all combinations of discount factor and learning rate , expect when is low and simultaneously high. AC dynamics seem chaotic for all combinations of and .
To investigate the relationship between the parameters more thoroughly, Figure 5 shows bifurcation diagrams with the bifurcation parameters and . Additionally, it also gives the largest Lyapunov exponents for each learner and each parameter combination. A largest Lyapunov exponent greater than zero is a key characteristic of chaotic motion. We computed the Lyapunov exponent from the analytically derived Jacobian matrix, iteratively used in a QR decomposition according to Sandri Sandri (1996). See Appendix A for details.
The largest Lyapunov exponent for Q and SARSA learners align almost perfectly with each other, whereas the largest Lyapunov exponent of the AC learners behaves qualitatively different. We first describe the behavior of the Q and SARSA learner: For high learning rates and low farsightedness Figure 5 shows a periodic orbit with few (four) points in phase space. Largest Lyapunov exponents are distinctly below 0 at those regimes. Increasing the farsightedness both learners enter a regime of visiting many points in phase space around the stable fixed point . The largest Lyapuonv exponents are close to zero. With increasing the distance around this fixed point solution decreases until the dynamics converge from a farsightedness slightly greater than 0.5 on. From there the largest Lyapunov exponent decreases again for further increasing . The same observations can be made along a decreasing bifurcation parameter , except that a the end, for low the largest Lyapunov exponents do not decrease as distinctly as for high .
The behavior of the ActorCritic dynamics is qualitatively different from the one of Q and SARSA. The placement of the fixed points on the natural numbers grid suggests that the AC learner get confined on one of the 16 () corners of the phase space. No regularity to which fixed point the AC learner converges can be deduced. The largest Lyapunov exponent is always above zero and experiences an overall decreasing behavior. Similarly for a decreasing bifurcation parameter , the largest Lyapunov exponent tends to decrease as well. Different from the bifurcation diagram along , for low the system might enter a periodic motion, but only for some parameters . No regularity can be determined at which parameters the AC learners enter a periodic motion. A more thorough investigation of the nonlinear dynamics, especially those of the ActorCritic learner seems of great interest, is, however, beyond the scope of this article and leaves promising paths for future work.
Concerning the parameter , the intensity of choice, one can infer from the update equations (Eq. 11 combined with Eq. 19 and Eq. 21), that the dynamics for the AC learner are invariant for a constant product . This is because the temporal difference error of the ActorCritic learner in the deterministic limit is independent of . Further, the dynamics of the SARSA learner will converge to the dynamics of the AC learner under . Figure 6 nicely confirms these two observations. Observing Tab. 1 is another way to see this. Since the value estimate of the future state is identical for SARSA and AC learners, letting the value estimate of the current state vanish by sending makes the SARSA learners approximate the AC learners.
As mentioned before, controls the explorationexploitation tradeoff. In the temporal difference errors of the Q and SARSA learner it appears in the term indicating the value estimate of the current state . If this term dominates the temporal difference error (i.e. if is small), the learners tend towards the center of behavior space, i.e. , forgetting what they have learned about the obtainable reward. This characteristic happens to be favorable in our twostate Matching Pennies environment, which is why Q and SARSA learners perform better in finding the solution. On the other hand, if is large, the temporal difference error is dominated by the current reward and future value estimate. Not being able to forget, the learners might get trapped in unfavorable behavior, as we can see observing the ActorCritic learners. To calibrate it is useful to make oneself clear that it must come in units of [log behavior] / [reward].
iv.2 Twostate Prisoners Dilemma
The single state Prisoners Dilemma is another paradigmatic twoagent, twoactions game. It has been used to model social dilemmas and study the emergence of cooperation. It describes a situation in which two prisoners are separately interrogated, leaving them with the choice to either cooperate with each other by not speaking to the police or defecting by testifying.
The twostate version, which has been used as a testenvironment also in Vrancx et al. (2008); Hennes et al. (2009, 2010), extends this situation somewhat artificially by playing a Prisoner’s Dilemma in each of the two states with a transition probability of 10% from one state to the other if both agents chose the same action, and a transition probability of 90% if both agents chose opposite actions.
Figure 7 illustrates these game dynamics, with rewards given as in state and in state for , respectively. State transition probabilities are given by with . Hence, the probability of remaining in the same state is given by for both states .
A behavior profile in which one agent exploits the other in one state, while being exploited in the other state, would result in an average reward per time step of 5 for each agent, e.g. .
However, for all three learning types with a mid ranged farsightedness () and an intensity of choice , the temporal difference error arrows are pointing on average towards the lower left defectiondefection point for each state in phase space (Figure 8). To see whether the three learning types may converge to the described defectcooperate, cooperatedefect equilibrium, individual trajectories from two exemplary initial conditions and for two learning rates are shown, as before a small one () and a high one ().
We observe qualitatively different behavior across all three learners. The Q learners converge to equilibria with average rewards distinctly below 5, the SARSA learners converge to equilibira with average rewards of almost 5 for both learning rates and both exemplary initial conditions. Both Q and SARSA learners converge to solutions of proper probabilistic behavior, i.e. choosing action cooperate and action defect with nonvanishing chance. The ActorCritic learners on the other hand converge to the deterministic defectcooperate, cooperatedefect behavior described above for the initial condition shown with the nondashed lines in Figure 8 for both learning rates (shown in red and blue). For the other exemplary initial condition, shown with the dashed lines, it converges to an alldefection solution in both states for both .
Interestingly, for all learners, all combinations of initial conditions and learning rates converge to a fixed point solution, except for the Q learners with a comparably high learning rate , which enter a periodic behavior solution for the initial condition with the nondashed line. The same phenomenon occurred also in the Matching Pennies environment for low farsightedness , however there for both, Q and SARSA learners. It seems to be caused by the comparably high learning rate. A high learning rate overshoots the behavior update resulting in a circling behavior around the fixed point. As in Figure 3, the time average reward of the periodic orbit seems to be comparable to the reward of the corresponding fixed point at lower . Furthermore, we observe the same time rescaling effect of the learning rate in Figure 8 as in Figure 4.
To visualize the influence of the discount factor on the converged behavior, Figure 9 shows a bifurcation diagram along the bifurcation parameter for two initial conditions. Dots in blue result from a uniformly random behavior profile of , whereas the dots in red initially started from the behavior profile .
Across all learners, lower discount factors correspond to alldefect solutions, whereas for higher the solutions from the initial condition shown in red tend towards the cooperatedefect, defectcooperate solution. For low , the agents are less aware of the presence of other states and find the alldefect equilibrium solution of the iterated normal form Prisoner’s Dilemma. The state transition probabilities have less effect on the learning dynamics. Only above a certain farsightedness, the agents find the more rewarding cooperatedefect, defectcooperate solution.
The observation from Figure 8 is confirmed that the probability to cooperate (i.e. here and ) is lowest for the Q learner, mid range for the SARSA learner and 1 for the ActorCritic learner. One reason for this observation can be found in the intensity of choice parameter . It balances the reward obtainable in the current behavior space segment with the forgetting of current knowledge to be open to new solutions. Such forgetting expresses itself by temporal difference error components pointing towards the center of behavior space. Thus, a relatively small can explain why solutions at the edge of the behavior space cannot be reached by Q and SARSA learners. The AC learner misses this forgetting term in the deterministic limit and can therefore easily enter behavior profiles at the edge of the behavior space.
Q and SARSA learners have a critical discount factor above which the cooperatedefect, defectcooperate high reward solution is obtained and below which the alldefect low reward solution gets selected. However, for increasing discount factors up to 1, Q and SARSA learners experience a drop in playing the cooperative action probability.
The ActorCritic learners approach the cooperatedefect, defectcooperate solution in two steps. For increasing , first the probability to cooperate of agent 2 in state 2 () jumps from zero to one while agent 1 still defects in state 1. Only after a slight increase of , agent 1 then also cooperates in state 1 ().
Interestingly, for the uniformly random initial behavior condition shown in blue, there is no critical discount factor and no learners come close to the cooperatedefect, defectcooperate solution. Here, only for close to 1, all cooperation probabilities gradually increase. Furthermore, exactly at those , where the cooperatedefect, defectcooperate solution is obtained from the initial behavior condition shown in red, the solutions from the uniformly random initial behavior condition (blue) have a largest Lyapuonv exponent greater than 0. At other values of , the largest Lyapunov exponents for the two initial conditions overlap. This suggests that largest Lyapunov exponents greater than zero may point to the fact that other, perhaps more rewarding solutions may exist in phase space. A more thorough investigation regarding this multistability is an open point for future research.
V Discussion
The main contribution of this paper is the development of a technique to obtain the deterministic limit of temporal difference reinforcement learning. With our work we have combined, for the first time, the literature on learning dynamics from statistical physics with the evolutionary game theoryinspired learning dynamics literature from machine learning. For the statistical physics community, the novelty consists of learning equations, capable of handling environmental state transitions. For the machine learning community the novelty lies in the systematic methodology we have used to obtain the deterministic learning equations.
We have demonstrated our approach with the three prominent reinforcement learning algorithms from computer science: Q learning, SARSA learning and ActorCritic learning. A comparison of their dynamics in two previously used twoagent, twoactions, twostates environments has revealed the existence of a variety of qualitatively different dynamical regimes, such as convergence to fixed points, periodic orbits and deterministic chaos.
We have found that Q and SARSA learners tend to behave qualitatively more similar in comparison to the ActorCritic learning dynamics. This characteristic results at least partly from our relatively low intensity of choice parameter , controlling the explorationexploitation tradeoff via a forgetting term in the temporal difference errors. Sending , the SARSA learning dynamics approach the ActorCritic learning dynamics, as we have shown. Overall the ActorCritic learners have a tendency to enter confining behavior profiles, due to their nonexisting forgetting term. This characteristic leaves them trapped at the edges of the behavior space. In contrast, Q and SARSA learner do not show such learning behavior. Interestingly, this characteristic of the AC learners turns out to be favorable in the twostate Prisoners Dilemma environment, where they find the most rewarding solution in more cases compared to Q and SARSA, but hinders the convergence to the fixed point solution in the twostate Matching Pennies environment
We have demonstrated the effect of the learning rate adjusting the speed of learning, and, within limits, thereby acting as a time rescaling. A comparably large learning rate might cause limit cycles around the fixed point, thereby hindering the convergence to that point. Nevertheless, the average reward of the limit cycling behavior was approximately equal to the one of the fixed point obtained at lower , but took fewer time steps to reach. Thus, perhaps other dynamical regimes than fixed points such as limit cycles or strange attractors could be of interest in some applications of reinforcement learning.
We have also shown the effect of the discount factor describing the farsightedness of the agents. At low the state transition probabilities have less effect on the learning dynamics compared to high discount factors.
We hope that our work might turn out useful for the application of reinforcement learning in various domains, with respect to parameter tuning, the design of new algorithms, and the analysis of complex strategic interactions using meta strategies, as Bloembergen et al. Bloembergen et al. (2015) have pointed out. In this regard, future work could extend the presented methodology to partial observability of the Markov states of the environment Spaan (2012); Oliehoek (2012), behavior profiles with history, and otherregarding agent (i.e. jointaction) learners (c.f. Busoniu et al. (2008) for an overview of otherregarding agent learning algorithms). Also, the combination of individual reinforcement learning and social learning through imitation Bandura (1977); Smolla et al. (2015); Barfuss et al. (2017); Banisch and Olbrich (2017) seems promising. Such endeavors would naturally lead to the exploration of network effects. It is important to note that only a few dynamical systems reinforcement learning studies have begun to incorporate network structures between agents Bladon and Galla (2011); RealpeGomez et al. (2012).
Apart from these more technical extensions, we are confident that our learning equation will prove themselves useful when studying the evolution of cooperation in stochastic games Hilbe et al. (2018). With stochastic games one is able to explicitly account for a changing environment. Therefore, such studies are likely to contribute to the advancement of theoretical research on the sustainability of interlinked socialecological systems Levin (2013) in the age of the Anthropocene Steffen et al. (2007). Interactions, synergies and tradeoffs between social Dawes (1980); Macy and Flache (2002) and ecological Heitzig et al. (2016) dilemmas need to be explored. To advance the modeling endeavors of such socialecological systems, especially on a global scale, requires a better understanding of agency and general purpose human behavior Donges et al. (2017). Learning is a suitable candidate for a general purpose behavior in such sustainability challenges Lindkvist and Norberg (2014); Lindkvist et al. (2017). By defining a default behavior profile, fruitful connections to the topology of sustainable management framework Heitzig et al. (2016) and viability theory Aubin (2009) could be explored. Also, modern measures of facets of socialecological resilience, such as adaptability and transformability, could be developed within our framework Donges and Barfuss (2017). Eventually, learning agents could be used to discover resilient governance strategies. With respect to modeling human behavior, it is an open question to what extent behavioral theories from social science Schlüter et al. (2017) can be operationalized, yielding a broader spectrum of behavioral update equations. Ultimately, the comparison of the dynamics of different behavioral theories from social science would be of great interest. Critical to an increased understanding of socialecological systems across scales is the role of agency Donges et al. (2017). Very roughly, agency shall describe the level of deliberate influence an agent, or a group of agents has onto the system. The framework of multiagent environment systems, that we work with, seems a promising place to start for attempts to formalize agency and investigate its influence in connection to certain behavioral update equations.
Acknowledgements.
This work was conducted in the framework of the project on Coevolutionary Pathways in the Earth System (COPAN) at the Potsdam Institute for Climate Impact Research (PIK). We are grateful for financial support by the Heinrich Böll Foundation, the Stordalen Foundation (via the Planetary Boundaries Research Network PB.net), the Earth League’s EarthDoc program and the Leibniz Association (project DOMINOES). We thank Jobst Heitzig for discussions and comments on the manuscript.Appendix A Computation of Lyapunov Exponents
We compute the Lyapunov exponents using an iterative QR decomposition of the Jacobian matrix according to Sandri Sandri (1996). In the following we present the derivation of the Jacobian matrix.
Eq. 11 constitutes a map , which iteratively updates the behavior profile . Consequently, we can represent its derivative as a Jacobin tensor .
Let be the numerator of Eq. 11, and its denominator, i.e. . Hence,
(23) 
or, more precisely, in components,
(24) 
and are known, and if is known, is easily obtained by . Therefore we need to compute for the three learner types Q, SARSA and ActorCritic learning.
a.1 Q learning
Let us rewrite for the Q learner according to
(25) 
where we removed the estimate of the current value from the temporal difference error, leaving the truncated TD error as
(26) 
Hence, we can write the derivative of as
(27) 
Since , can be expressed as
(28) 
The derivative of the truncated temporal difference error reads