Deterministic limit of temporal difference reinforcement learning for stochastic games

Deterministic limit of temporal difference reinforcement learning for stochastic games

Wolfram Barfuss barfuss@pik-potsdam.de Potsdam Institute for Climate Impact Research, Germany Department of Physics, Humboldt University Berlin, Germany    Jonathan F. Donges Potsdam Institute for Climate Impact Research, Germany Stockholm Resilience Centre, Stockholm University, Sweden    Jürgen Kurths Potsdam Institute for Climate Impact Research, Germany Department of Physics, Humboldt University Berlin, Germany Saratov State University, Russia
July 22, 2019
Abstract

Reinforcement learning in multi-agent systems has been studied in the fields of economic game theory, artificial intelligence and statistical physics by developing an analytical understanding of the learning dynamics (often in relation to the replicator dynamics of evolutionary game theory). However, the majority of these analytical studies focuses on repeated normal form games, which only have a single environmental state. Environmental dynamics, i.e. changes in the state of an environment affecting the agents’ payoffs has received less attention, lacking a universal method to obtain deterministic equations from established multi-state reinforcement learning algorithms. In this work we present a novel methodology to derive the deterministic limit resulting from an interaction-adaptation time scales separation of a general class of reinforcement learning algorithms, called temporal difference learning. This form of learning is equipped to function in more realistic multi-state environments by using the estimated value of future environmental states to adapt the agent’s behavior. We demonstrate the potential of our method with the three well established learning algorithms Q learning, SARSA learning and Actor-Critic learning. Illustrations of their dynamics on two multi-agent, multi-state environments reveal a wide range of different dynamical regimes, such as convergence to fixed points, limit cycles and even deterministic chaos.

I Introduction

Individual learning through reinforcements is a central approach in the fields of artificial intelligence Sutton and Barto (1998); Busoniu et al. (2008); Wiering and van Otterlo (2012), neuroscience Shah (2012); Hassabis et al. (2017), learning in games Fudenberg and Levine (1998) and behavioral game theory Roth and Erev (1995); Erev and Roth (1998); Camerer and Ho (1999); Camerer (2003), thereby offering a general purpose principle to either solve complex problems or explain behavior. Also in the fields of complexity economics Arthur (1991, 1999) and social science Macy and Flache (2002), reinforcement learning has been used as a model for human behavior to study social dilemmas.

However, there is a need for improved understanding and better qualitative insight into the characteristic dynamics that different learning algorithms produce. Therefore, reinforcement learning has also been studied from a dynamical systems perspective. In their seminal work, Börgers and Sarin showed that one of the most basic reinforcement learning update schemes, Cross learning Cross (1973), converges to the replicator dynamics of evolutionary games theory in the continuous time limit Börgers and Sarin (1997). This has led to at least two, presumably non-overlapping research communities, one from statistical physics Marsili et al. (2000); Sato et al. (2002); Sato and Crutchfield (2003); Sato et al. (2005); Galla (2009, 2011); Bladon and Galla (2011); Realpe-Gomez et al. (2012); Sanders et al. (2012); Galla and Farmer (2013); Aloric et al. (2016), and one from computer science machine learning Tuyls et al. (2003); Bloembergen et al. (2015); Tuyls and Nowé (2005); Tuyls et al. (2006); Tuyls and Parsons (2007); Kaisers and Tuyls (2010); Hennes et al. (2009); Vrancx et al. (2008); Hennes et al. (2010). Thus, Sato and Crutchfield Sato and Crutchfield (2003) and Tuyls et al. Tuyls et al. (2003) independently deduced identical learning equations in 2003.

The statistical physics articles usually consider the deterministic limit of the stochastic learning equations, assuming infinitely many interactions between the agents before an adaptation of behavior occurs. This limit can either be performed in continuous time with differential equations Sato et al. (2002); Sato and Crutchfield (2003); Sato et al. (2005) or discrete time with difference equations Galla (2009, 2011); Bladon and Galla (2011). The differences between both variants can be significant Galla (2011); Realpe-Gomez et al. (2012). Deterministic chaos was found to emerge when learning simple Sato et al. (2002) as well as complicated games Galla and Farmer (2013). Relaxing the assumption of infinitely many interactions between behavior updates revealed that noise can change the attractor of the learning dynamics significantly, e.g. by noise-induced oscillations Galla (2009, 2011).

However, these statistical physics studies so far considered only repeated normal form games. These are games where the payoff depends solely on the set of current actions, typically encoded in the entries of a payoff matrix (for the typical case of two players). Receiving payoff and choosing another set of joint actions is performed repeatedly. This setup lacks the possibility to study dynamically changing environments and their interplay with multiple agents. In those systems, rewards do not depend only on the joint action of agents, but also on the states of the environment. Environmental state changes may occur probabilistically and depend also on joint actions and the current state. Such a setting is also known as a Markov game or stochastic game Shapley (1953); Mertens and Neyman (1981). Thus, a repeated normal form game is a special case of a stochastic game with only one environmental state. Notably Akiyama and Kaneko Akiyama and Kaneko (2000, 2002) did emphasize the importance of a dynamically changing environment, however did not utilize a reinforcement learning update scheme.

The computer science machine learning community dealing with reinforcement learning as a dynamical system (see Bloembergen et al. (2015) for an overview) particularly emphasizes the link between evolutionary game theory and multi-agent reinforcement learning as a well grounded theoretical framework for the latter Bloembergen et al. (2015); Tuyls and Nowé (2005); Tuyls et al. (2006); Tuyls and Parsons (2007). This dynamical systems perspective is proposed as a way to gain qualitative insights about the variety of multi-agent reinforcement learning algorithms (see Busoniu et al. (2008) for a review). Consequently, this literature developed a focus on the translation of established reinforcement learning algorithms to a dynamical systems description, as well as the development of new algorithms based on insights of a dynamical systems perspective. While there is more work on stateless games (e.g. Q learning Tuyls et al. (2003), frequency adjusted multi-agent Q learning Kaisers and Tuyls (2010)), multi-agent learning dynamics for multi-state environments have been developed as well, such as the piecewise replicator dynamics Vrancx et al. (2008), the state-coupled replicator dynamics Hennes et al. (2009) or the reverse engineering state-coupled replicator dynamics Hennes et al. (2010).

Both communities, statistical physics and machine learning, share the interest in better qualitative insights into multi-agent learning dynamics. While the statistical physics community focuses more on dynamical properties the same set of learning equations can produce, it leaves a research gap of learning equations capable of handling multiple environmental states. The machine learning community on the other hand aims more towards algorithm development, but so far put their focus less on a dynamical systems understanding. Taken together, there is the challenge of developing a dynamical systems theory of multi-agent learning dynamics in varying environmental states.

With this work, we aim to contribute to such a dynamical systems theory of multi-agent learning dynamics. We present a novel methodology for obtaining the deterministic limit of multi-state temporal difference reinforcement learning. In essence, it consists of formulating the temporal difference error for batch learning, and sending the batch size to infinity. We showcase our approach with the three prominent learning variants of Q learning, SARSA learning and Actor-Critic learning. Illustrations of their learning dynamics reveal multiple different dynamical regimes, such as fixed points, periodic orbits and deterministic chaos.

In Sec. II we introduce the necessary background and notation. Sec. III presents our method to obtain the deterministic limit of temporal difference reinforcement learning, and demonstrate it for multi-state Q learning, SARSA learning and Actor-Critic learning. We illustrate their learning dynamics for two previously utilized two-agents two-actions two-states environments in Sec. IV. In Sec. V we conclude with a discussion of our work.

Ii Preliminaries

We introduce the components (incl. notation) of our multi-agent environment systems (see Fig. 1), followed by a brief introduction of temporal difference reinforcement learning.

Figure 1: Multi-agent Markov environment (also known as stochastic or Markov game). agents choose a joint action from their action sets , based on the current state of the environment , according to their behavior profile . This will change the state of the environment from to with probability , and provide each agent with a reward .

ii.1 Multi-agent Markov environments

A multi-agent Markov environment (also called stochastic game or Markov game) consists of agents. The environment can exist in states . In each state each agent has available actions to choose from. Having an identical number of actions for all states and all agents is notational convenience, no significant restriction. A joint action of all agents is referred to by , the joint action of all agents but agent is denoted by .

Environmental dynamics are given by the probabilities for state changes expressed as a transition tensor The entry denotes the probability that the environment transitions to state given the environment was in state and the agents have chosen the joint action . Hence, for all , must hold. The assumption that the next state only depends on the current state and joint action makes our system Markovian. We here restrict ourselves to ergodic environments without absorbing states (c.f. Hennes et al. (2010)).

The rewards receivable by the agents are given by the reward tensor The entry denotes the reward agent receives when the environment transits from state to state under the joint action . Rewards are also called payoffs from a game theoretic perspective.

Agents draw their actions from their behavior profile The entry denotes the probability that agent chooses action in state . Thus, for all and all , must hold. We here focus on the case of independent agents, able to fully observe the current state of the environment. With correlated behavior (see e.g. Busoniu et al. (2008)) and partially observable environments Spaan (2012); Oliehoek (2012) one could extend the multi-agent environment systems to be even more general. Note that what we call behavior profile is usually termed policy from a machine learning perspective or behavioral strategy from a game theoretic perspective. We chose to introduce our own term because policies and strategies suggest a deliberate choice which we do not want to impose.

ii.2 Averaging out behavior and environment

We define a notational convention, that allows a systematic averaging over the current behavior profile and the environmental transitions . It will be used throughout the paper.

Averaging over the whole behavioral profile yields

(1)

Here, serves as a placeholder. If the quantity to be inserted for depends on the summation indices, then those indices will be summed over as well. If the quantity, which is averaged out, is used in tensor form, it is written in bold. If not, remaining indices are added after the right angle bracket.

Averaging over the behavioral profile of the other agents, keeping the action of agent , yields

(2)

Last, averaging over the subsequent state yields

(3)

Of course, these operations may also be combined as and , by multiplying both summations.

For example, given a behavior profile , the resulting effective Markov Chain transition matrix reads , which encodes the transition probabilities from state to . From the stationary distribution of environmental states can be computed. is the eigenvector corresponding to the eigenvalue 1 of . Its entries encode the ratios of the average durations the agents find themselves in the respective environmental states.

The average reward agent receives from state under action , given all other agents follow the behavior profile reads . Including agent ’s behavior profile gives the average reward it receives from state : . Hence, holds.

ii.3 Agent’s preferences and values

Typically, agents are assumed to maximize their exponentially discounted sum of future rewards, called return where is the discount factor of agent and denotes the reward received by agent at time step . Exponential discounting is most commonly used for its mathematical convenience and because it ensures consistent preferences over time. Other formulations of a return use e.g. finite time horizons, average rewards setting, as well as other ways of discounting, such as hyperbolic discounting. Those other forms require their own form of reinforcement learning.

Given a behavior profile the expected return defines the state-value function , which is independent of time . Inserting the return yields the Bellman equation Bellman (1957)

(4)

Thus, the value of a state is the discounted value of the subsequent state plus times the reward received along the way. Evaluating the expected value of the behavior profile and writing in matrix form, we get:

(5)

A solution of the state-values can be obtained using matrix inversion

(6)

The computational complexity of matrix inversion makes this solution strategy infeasible for large systems. Therefore many iterative solution methods exist Wiering and van Otterlo (2012).

Equivalently, state-action-value functions are defined as the expected return, given agent applied action in state and then followed accordingly: They can be computed via

(7)

One can show that holds for the inverse relation of state-action- and state-values.

ii.4 Learning through reinforcement

In contrast to the typical game theoretic assumption of perfect information we assume that agents know nothing about the game in advance. They can only gain information about the environment and other agents through interactions. They do not know the true reward tensor or the true transition probabilities . They experience only reinforcements (i.e. particular rewards ), while observing the current true Markov state of the environment.

ii.4.1 Temporal difference learning

In essence, state-action-propensities get iteratively updated by a temporal difference error :

(8)

with being the learning rate of agent . These state-action-propensities can be interpreted as estimates of the state-action values .

The temporal difference error expresses a difference in the estimation of state-action values. New experience is used to compute a new estimate of the current state-action value and corrected by the old estimate. The estimate from the new experience uses exactly the recursive relation of value functions from the Bellmann equation (Eq. 4),

(9)

The indicates the estimate at time step of the value of the state visited at the next time step . denotes the estimate at time step of the value of the current state . Different choice for these estimations are possible, leading to different learning variants (see below).

The Dirac deltas indicate that the temporal difference error for state-action pair is only non-zero when was actually visited in time step . This denotes and emphasizes, that agents can only learn from experience. In contrast, e.g. experience-weighted-attraction learning Camerer and Ho (1999) assumes that action-propensities can be updated with hypothetical rewards an agents would have received if it had played a different than the current action. These two cases have been referred to as full vs. partial information Marsili et al. (2000). Thus, the Dirac deltas in Eq. 9 indicate a partial information update. The agents use only information experienced through interaction.

The state-action-propensities are translated to a behavior profile according to the Gibbs / Boltzmann distributionSutton and Barto (1998) (also called softmax)

(10)

The behavior profile has become a dynamic variable as well. The parameter controls the intensity of choice or the exploitation level of agent controlling the exploration-exploitation trade-off. For high agents tend to exploit their learned knowledge about the environment, leaning towards actions with high estimated state-action value. For low agents are more likely to deviate from these high value actions in order to explore the environment further with the chance of finding actions, which eventually lead to even higher rewards. Other behavior profile translations exist as well (e.g. -greedy Sutton and Barto (1998))

ii.4.2 Three learning variants

The specific choices of the value estimates in the temporal difference error result in different reinforcement learning variants.

Q learning.

For the Q learning algorithm Sutton and Barto (1998); Wiering and van Otterlo (2012) and . Thus, the Q learning update takes the maximum of the next state-action-propensities as an estimate of the value for the next state, regardless of the actual next action the agent plays. This is reasonable, because the maximum is the highest value achievable given the current knowledge.

SARSA learning.

For SARSA learning Sutton and Barto (1998); Wiering and van Otterlo (2012) and , where denotes the action taken by agent at the next time step . Thus, the SARSA algorithm uses the five ingredients of an update sequence of State, Action, Reward, next State, next Action to perform one update. In practice, the SARSA sequence has to be shifted one time step back to know what the actual ”next” action of the agent was.

Actor-Critic (AC) learning.

For AC learning Sutton and Barto (1998); Wiering and van Otterlo (2012) and . It has an additional data structure of the state-value approximations which get separately updated according to The state-action-propensities serve as the actor which get criticized by the state value approximations .

Tab. 0(a) summarizes the values estimates for these three learning variants. Q and SARSA learning are structurally more similar compared to the Actor-Critic learner, which has an additional data structure of state value approximations .

Q learning
SARSA learning
Actor Critic (AC) learning
   
(a) K=1
   
/
(b)
Table 1: Overview of the three reinforcement learning variants. Shown in the columns are the value estimate for the next state and the current state for both ends of the batch size spectrum: and .

Iii Deterministic limit

In this section we will derive a novel methodology to obtain the deterministic limit of temporal difference reinforcement learning. We showcase our method in the three learning variants of Q, SARSA and Actor-Critic learning. For the statistical physics community, the novelty consists of learning equations, capable of handling environmental state transitions. For the machine learning community the novelty lies in the systematic methodology we use to obtain the deterministic learning equations. Note that these deterministic learning equations will not depend on the state-action-propensities anymore, being iterated maps of the behavior profile alone.

Following e.g. Sato and Crutchfield (2003); Sato et al. (2005); Bladon and Galla (2011), we first combine Eq. 10 with a propensity update and obtain

(11)

Next, we formulate the temporal difference error for batch learning.

iii.1 Batch learning

With batch learning we mean that several time steps of interaction with the environment and the other agents take place before an update of the state-action-propensities and the behavior profile occurs. It has been also interpreted as a form of history replay Lange et al. (2012) which is essential to stabilize the learning process when function approximation (e.g. by deep neural networks) is used Mnih et al. (2015). History (i.e. already experienced state, action, next state triples) is used again for an update of the state-action-propensities.

Imagine that the information from these interactions are stored inside a batch of size . We introduce the corresponding temporal difference error of batch size :

(12)

where denotes the number of times the state action pair was visited. If the state-action pair was never visited, . The agents interact times under the same behavior and use the sample average as the new estimate for the value of the state-action pair .

The state-action-propensities update then follows

(13)

The notation is short for a batch update of batch size 1: .

iii.2 Separation of timescales

We obtain the deterministic limit of the temporal difference learning dynamics by sending the batch size to infinity, .

Equivalently, this can be regarded as a separation of timescales. Two processes can be distinguished during an update of the state-action-propensities : adaptation and interaction,

(14)

By separating the timescales of both processes, we assume that (infinitely) many interactions happen before one step of behavior profile adaptation occurs.

Under this assumption and because of the assumed ergodicity one can replace the sample average, i.e. the sum over sequences of states and actions with the behavior profile average, i.e. the sum over state-action behavior and transition probabilities according to

(15)

For example, the immediate reward in the temporal difference error becomes . The time gets resealed accordingly, as well.

Taking the limit in this way, we choose to stay in discrete time, leaving the continuous time limit following Sato and Crutchfield (2003); Sato et al. (2005); Galla and Farmer (2013) for future work.

iii.3 Three learning variants

Next, we present the deterministic limit of the temporal difference error of the three learning variants of Q, SARSA and Actor-Critic learning. Inserting them into Eq. 11 yields the complete description of the behavior profile update in the deterministic limit. Tab. 1 presents an overview of the resulting equations and a comparison to their batch size versions.

iii.3.1 Q learning

The temporal difference error of Q leaning consists of three terms: i) ii) and iii) . As already stated under . which is defined as

(16)

using the deterministic limit conversion rule (Eq. 15) and the state-action value of the behavior profile according to Eq. 7.

For the third term, we invert Eq. 10, yielding , where is constant in actions, but may vary for each agent and state. Now, one can show that the dynamics induced by Eq. 11 are invariant against additive transformations in the temporal difference error . Thus, the third term can be converted according to .

All together, the temporal difference error for Q learning in the deterministic limit reads

(17)

iii.3.2 SARSA learning

Two of the three terms of the SARSA temporal difference error are identical to the one of Q learning, leaving which we replace by

(18)

using again the deterministic limit conversion rule (Eq. 15) and the state-action-value of the behavior profile according to Eq. 7.

Thus, the temporal difference error for the SARSA learning update in the deterministic limit reads

(19)

iii.3.3 Actor-Critic (AC) learning

For the temporal difference error for AC learning we have to find replacements for i) and ii) . Applying again Eq. 15 yields defined as

(20)

using Eq. 6 for the state-value . This is the average value of the next state given that in the current state the agent took action . One can show that from the SARSA update.

The second remaining term belongs to the slower adaptation timescale, or in other words: occurs outside the batch. Thus, our deterministic limit conversion rule (Eq. 15) does not apply. We could think of a conversion . However, the remaining term is constant in action, and therefore irrelevant for the dynamics, as we have argued above. Thus, we can simply put .

All together, the temporal difference error of the Actor-Critic learner in the deterministic limit reads

(21)

Iv Application to example environments

In the following we apply the derived deterministic learning equations in two different environments. Specifically, we compare the three well established temporal difference learning variants (Q learning, SARSA learning and Actor-Critic (AC) learning) in two different two-agents (), two-actions () and two-states () environments. The two environments, a two-state Matching Pennies game and a two-state Prisoners Dilemma have also been used in ref. Hennes et al. (2010). Note that we leave a comparison between the deterministic limit and the stochastic equations to future work, which would add a noise term to our equations following the example of ref. Galla (2009).

To measure the performance of an agent’s behavior profile in a single scalar, we use the dot product between the stationary state distribution of the effective Markov Chain with the transition matrix and the behavior average reward . Interestingly, we find this relation to be identical to the dot product of the stationary distribution and the state value :

(22)

This relation can be shown by using Eq. 6 and the fact that is an eigenvector of .

In the following examples we will only investigate homogeneous agents, i.e. agents whose parameters will not differ from each other. We will therefore drop the agent-indices from and . The heterogeneous agent case is to be explored in future work.

Figure 2: Two-state Matching Pennies. Rewards are given in black font in the payoff tables for each state. State-transition probabilities are indicated by blue arrows.
Figure 3: Three learners in two-state Matching Pennies environment for low discount factor ; intensity of choice . At the top, the temporal difference errors for the Q learner (Eq. 17), SARSA learner (Eq. 19 ) and Actor-Critic (AC) learner (Eq. 21) are shown in two phase space sections, one for each state. The arrows indicate the average direction the temporal difference errors drives the learner towards, averaged over all phase space points of the other state. Additionally, selected trajectories are shown in the phase space sections, as well as by reward trajectories, plotting the average reward value (Eq. 22) over time steps. Crosses in the phase space subsections indicate the initial behavior . Circles signal the arrival at a fixed point, determined by the absolute difference of behavior profiles between two subsequent time steps being below . Trajectories are shown for two different learning rates (red) and (blue). The bold reward trajectory belongs to agent 1, the thin one to agent 2. Note that the temporal difference error is independent from the learning rate . A variety of qualitatively different dynamical regimes can be observed.

iv.1 Two-state Matching Pennies

The single state matching pennies game is a paradigmatic two-agent two-action game. Imagine the situation of soccer penalty kicks. The keeper (agent 1) can choose to jump either to the left or right side of the goal, the kicker (agent 2) can choose to kick the ball also either to the left or the right. If both agents choose the identical side, the keeper agent wins, otherwise the kicker agent.

In the two-state version of the game according to Hennes et al. (2010) the rules are extend as follows: In state 1 the situation is as described in the single-state version. Whenever agent 1 (the keeper) decides to jump to the left, the environment transitions to state 2 in which the agents switch roles: agent 1 now plays the kicker and agent 2 the keeper. From here, whenever agent 1 (now the kicker) decides to kick to the right side the environment transition again to state 1 and both agents switch their roles again.

Figure 2 illustrate this two-state Matching Pennies games. Formally, with denoting the payoff matrix of agent under the state , rewards are given by in state and in state for . State transitions are governed by and . Thus by construction, the probability of transitioning to the other state is independent of agent 2’s action. Only agent 1 has agency over the state transitions. By playing a uniform random behavior profile both agent would obtain an average reward of per time step.

With Fig. 3 we compare the temporal difference error in the phase space sections at a comparable low discount factor of for each environmental state, as well as learning trajectories for an exemplary initial condition for two learning rates , a low one () and a high one (). Overall, we observe a variety of qualitatively different dynamical regimes, such as fixed points, periodic orbits and chaotic motion.

Specifically, we see that Q learners and SARSA learners behave qualitatively similar in contrast to the AC learners; for both learning rates . For the low learning rate , Q and SARSA learners reach a fixed point of playing both actions with equal probability in both states, yielding a reward of 0.5. Due to the low , this takes approx. 600 time steps. In contrast, the reward trajectory of the AC learner appears to be chaotic. Figure 5 confirms this observation, which we will discuss in more detail below.

For the high learning rate both Q and SARSA learners enter a periodic limit cycle. Differences in the trajectories of Q and SARSA learner are clearly visible. The time average reward of this periodic orbit appears to be approx. for each agent, identical to the reward of the fixed point at lower . The AC learner, however, converges to a fixed point after oscillating near the edges of the phase space. At this fixed point agent 1 plays action 1 with probability 1. Thus, it has trapped the system into state 2. Agent 2 plays action 2 with probability 1 and consequently agent 1 receives a reward of 1, whereas agent 2 receives 0 reward. One might ask, why does agent 2 not decrease its probability for playing action 2, thereby increasing its own reward? And indeed, the arrows of the temporal difference error suggest this change of behavior profile. However, agent 2 cannot follow because its behavior is trapped on the simplex of non-zero action probabilities . For only actions, thus cannot change anymore, regardless of the temporal difference error.

Figure 4: Two-state Matching Pennies environment for high discount factor ; otherwise identical to Fig. 3.

Increasing the discount factor to , we observe the learning rate to set the timescale of learning (Fig. 4). The intensity of choice remained . A high learning rate corresponds to faster learning in contrast to a low learning rate . Also the ratio of learning timescales is comparable to the inverse ratio of learning rates. For both Q and SARSA learners reach a fixed point, whereas the AC learners seem to move chaotically (details to be investigated below). Comparing the trajectories between the learning rates , we observe a similar shape for each pair of learners. However, the similarity of the AC trajectories decreases at larger time steps.

So far, we varied two parameters: the discount factor and the learning rate . Combining Figures 3 and 4 we investigated all four combinations of a low and a high with a low and a high . We can summarize that Q and SARSA learners converge to a fixed point for all combinations of discount factor and learning rate , expect when is low and simultaneously high. AC dynamics seem chaotic for all combinations of and .

Figure 5: Varying discount factor and learning rate in two-state Matching Pennies environment for intensity of choice . On the left, the discount factor is varied with learning rate , as indicated by the gray vertical lines on the right. On the right, the learning rate is varied with discount factor as indicated by the gray vertical lines on the left. The three top panels show a the visited behavior points during iterations after a transient period of time steps from initial behavior for the Q learner (green), the SARSA learner (blue) and the Actor-Critic (AC) learner (red). Visited points are mapped to the function on the vertical axes to give a fuller image of the visited behavior profiles. The bottom panel shows the corresponding largest Lyapunov exponents for the three learners. Overall, Q and SARSA learner behave qualitatively more similar than the Actor-Critic learner.

To investigate the relationship between the parameters more thoroughly, Figure 5 shows bifurcation diagrams with the bifurcation parameters and . Additionally, it also gives the largest Lyapunov exponents for each learner and each parameter combination. A largest Lyapunov exponent greater than zero is a key characteristic of chaotic motion. We computed the Lyapunov exponent from the analytically derived Jacobian matrix, iteratively used in a QR decomposition according to Sandri Sandri (1996). See Appendix A for details.

The largest Lyapunov exponent for Q and SARSA learners align almost perfectly with each other, whereas the largest Lyapunov exponent of the AC learners behaves qualitatively different. We first describe the behavior of the Q and SARSA learner: For high learning rates and low farsightedness Figure 5 shows a periodic orbit with few (four) points in phase space. Largest Lyapunov exponents are distinctly below 0 at those regimes. Increasing the farsightedness both learners enter a regime of visiting many points in phase space around the stable fixed point . The largest Lyapuonv exponents are close to zero. With increasing the distance around this fixed point solution decreases until the dynamics converge from a farsightedness slightly greater than 0.5 on. From there the largest Lyapunov exponent decreases again for further increasing . The same observations can be made along a decreasing bifurcation parameter , except that a the end, for low the largest Lyapunov exponents do not decrease as distinctly as for high .

The behavior of the Actor-Critic dynamics is qualitatively different from the one of Q and SARSA. The placement of the fixed points on the natural numbers grid suggests that the AC learner get confined on one of the 16 () corners of the phase space. No regularity to which fixed point the AC learner converges can be deduced. The largest Lyapunov exponent is always above zero and experiences an overall decreasing behavior. Similarly for a decreasing bifurcation parameter , the largest Lyapunov exponent tends to decrease as well. Different from the bifurcation diagram along , for low the system might enter a periodic motion, but only for some parameters . No regularity can be determined at which parameters the AC learners enter a periodic motion. A more thorough investigation of the nonlinear dynamics, especially those of the Actor-Critic learner seems of great interest, is, however, beyond the scope of this article and leaves promising paths for future work.

Figure 6: Varying intensity of choice under constant in two-state Matching Pennies environment for discount factor . On the left trajectories of the three learners (Q: green, SARSA: blue, Actor-Critic(AC): red) are shown in the two phase space sections, one for each state. On the right, the corresponding reward trajectories are shown. Crosses in the phase space subsections indicate the initial behavior . The bold reward trajectory belongs to agent 1, the thin one to agent 2. One observes the deterministic limit of Actor-Critic learning to be invariant under constant and SARSA learning to converge to AC learning under .

Concerning the parameter , the intensity of choice, one can infer from the update equations (Eq. 11 combined with Eq. 19 and Eq. 21), that the dynamics for the AC learner are invariant for a constant product . This is because the temporal difference error of the Actor-Critic learner in the deterministic limit is independent of . Further, the dynamics of the SARSA learner will converge to the dynamics of the AC learner under . Figure 6 nicely confirms these two observations. Observing Tab. 1 is another way to see this. Since the value estimate of the future state is identical for SARSA and AC learners, letting the value estimate of the current state vanish by sending makes the SARSA learners approximate the AC learners.

As mentioned before, controls the exploration-exploitation trade-off. In the temporal difference errors of the Q and SARSA learner it appears in the term indicating the value estimate of the current state . If this term dominates the temporal difference error (i.e. if is small), the learners tend towards the center of behavior space, i.e. , forgetting what they have learned about the obtainable reward. This characteristic happens to be favorable in our two-state Matching Pennies environment, which is why Q and SARSA learners perform better in finding the solution. On the other hand, if is large, the temporal difference error is dominated by the current reward and future value estimate. Not being able to forget, the learners might get trapped in unfavorable behavior, as we can see observing the Actor-Critic learners. To calibrate it is useful to make oneself clear that it must come in units of [log behavior] / [reward].

iv.2 Two-state Prisoners Dilemma

Figure 7: Two-state Prisoners Dilemma. Rewards are given in black font in the payoff tables for each state. State-transition probabilities are indicated by blue arrows.

The single state Prisoners Dilemma is another paradigmatic two-agent, two-actions game. It has been used to model social dilemmas and study the emergence of cooperation. It describes a situation in which two prisoners are separately interrogated, leaving them with the choice to either cooperate with each other by not speaking to the police or defecting by testifying.

The two-state version, which has been used as a test-environment also in Vrancx et al. (2008); Hennes et al. (2009, 2010), extends this situation somewhat artificially by playing a Prisoner’s Dilemma in each of the two states with a transition probability of 10% from one state to the other if both agents chose the same action, and a transition probability of 90% if both agents chose opposite actions.

Figure 7 illustrates these game dynamics, with rewards given as in state and in state for , respectively. State transition probabilities are given by with . Hence, the probability of remaining in the same state is given by for both states .

A behavior profile in which one agent exploits the other in one state, while being exploited in the other state, would result in an average reward per time step of 5 for each agent, e.g. .

Figure 8: Two-state Prisoners Dilemma environment for discount factor ; otherwise identical to Fig. 3.

However, for all three learning types with a mid ranged farsightedness () and an intensity of choice , the temporal difference error arrows are pointing on average towards the lower left defection-defection point for each state in phase space (Figure 8). To see whether the three learning types may converge to the described defect-cooperate, cooperate-defect equilibrium, individual trajectories from two exemplary initial conditions and for two learning rates are shown, as before a small one () and a high one ().

We observe qualitatively different behavior across all three learners. The Q learners converge to equilibria with average rewards distinctly below 5, the SARSA learners converge to equilibira with average rewards of almost 5 for both learning rates and both exemplary initial conditions. Both Q and SARSA learners converge to solutions of proper probabilistic behavior, i.e. choosing action cooperate and action defect with non-vanishing chance. The Actor-Critic learners on the other hand converge to the deterministic defect-cooperate, cooperate-defect behavior described above for the initial condition shown with the non-dashed lines in Figure 8 for both learning rates (shown in red and blue). For the other exemplary initial condition, shown with the dashed lines, it converges to an all-defection solution in both states for both .

Interestingly, for all learners, all combinations of initial conditions and learning rates converge to a fixed point solution, except for the Q learners with a comparably high learning rate , which enter a periodic behavior solution for the initial condition with the non-dashed line. The same phenomenon occurred also in the Matching Pennies environment for low farsightedness , however there for both, Q and SARSA learners. It seems to be caused by the comparably high learning rate. A high learning rate overshoots the behavior update resulting in a circling behavior around the fixed point. As in Figure 3, the time average reward of the periodic orbit seems to be comparable to the reward of the corresponding fixed point at lower . Furthermore, we observe the same time re-scaling effect of the learning rate in Figure 8 as in Figure 4.

Figure 9: Varying discount factor in two-state Prisoners Dilemma environment for learning rate and intensity of choice . The four top panels show the visited behavior points during iterations after a transient period of time steps from initial behavior in blue and from initial behavior in red for the Q learner on the left, the SARSA learner in the middle and the Actor-Critic learner on the right. The bottom panel shows the corresponding largest Lyapunov exponents for the two initial conditions. Above a critical discount factor all learners find the high rewarding solution from the red initial condition, but do not do so from the blue initial condition.

To visualize the influence of the discount factor on the converged behavior, Figure 9 shows a bifurcation diagram along the bifurcation parameter for two initial conditions. Dots in blue result from a uniformly random behavior profile of , whereas the dots in red initially started from the behavior profile .

Across all learners, lower discount factors correspond to all-defect solutions, whereas for higher the solutions from the initial condition shown in red tend towards the cooperate-defect, defect-cooperate solution. For low , the agents are less aware of the presence of other states and find the all-defect equilibrium solution of the iterated normal form Prisoner’s Dilemma. The state transition probabilities have less effect on the learning dynamics. Only above a certain farsightedness, the agents find the more rewarding cooperate-defect, defect-cooperate solution.

The observation from Figure 8 is confirmed that the probability to cooperate (i.e. here and ) is lowest for the Q learner, mid range for the SARSA learner and 1 for the Actor-Critic learner. One reason for this observation can be found in the intensity of choice parameter . It balances the reward obtainable in the current behavior space segment with the forgetting of current knowledge to be open to new solutions. Such forgetting expresses itself by temporal difference error components pointing towards the center of behavior space. Thus, a relatively small can explain why solutions at the edge of the behavior space cannot be reached by Q and SARSA learners. The AC learner misses this forgetting term in the deterministic limit and can therefore easily enter behavior profiles at the edge of the behavior space.

Q and SARSA learners have a critical discount factor above which the cooperate-defect, defect-cooperate high reward solution is obtained and below which the all-defect low reward solution gets selected. However, for increasing discount factors up to 1, Q and SARSA learners experience a drop in playing the cooperative action probability.

The Actor-Critic learners approach the cooperate-defect, defect-cooperate solution in two steps. For increasing , first the probability to cooperate of agent 2 in state 2 () jumps from zero to one while agent 1 still defects in state 1. Only after a slight increase of , agent 1 then also cooperates in state 1 ().

Interestingly, for the uniformly random initial behavior condition shown in blue, there is no critical discount factor and no learners come close to the cooperate-defect, defect-cooperate solution. Here, only for close to 1, all cooperation probabilities gradually increase. Furthermore, exactly at those , where the cooperate-defect, defect-cooperate solution is obtained from the initial behavior condition shown in red, the solutions from the uniformly random initial behavior condition (blue) have a largest Lyapuonv exponent greater than 0. At other values of , the largest Lyapunov exponents for the two initial conditions overlap. This suggests that largest Lyapunov exponents greater than zero may point to the fact that other, perhaps more rewarding solutions may exist in phase space. A more thorough investigation regarding this multi-stability is an open point for future research.

V Discussion

The main contribution of this paper is the development of a technique to obtain the deterministic limit of temporal difference reinforcement learning. With our work we have combined, for the first time, the literature on learning dynamics from statistical physics with the evolutionary game theory-inspired learning dynamics literature from machine learning. For the statistical physics community, the novelty consists of learning equations, capable of handling environmental state transitions. For the machine learning community the novelty lies in the systematic methodology we have used to obtain the deterministic learning equations.

We have demonstrated our approach with the three prominent reinforcement learning algorithms from computer science: Q learning, SARSA learning and Actor-Critic learning. A comparison of their dynamics in two previously used two-agent, two-actions, two-states environments has revealed the existence of a variety of qualitatively different dynamical regimes, such as convergence to fixed points, periodic orbits and deterministic chaos.

We have found that Q and SARSA learners tend to behave qualitatively more similar in comparison to the Actor-Critic learning dynamics. This characteristic results at least partly from our relatively low intensity of choice parameter , controlling the exploration-exploitation trade-off via a forgetting term in the temporal difference errors. Sending , the SARSA learning dynamics approach the Actor-Critic learning dynamics, as we have shown. Overall the Actor-Critic learners have a tendency to enter confining behavior profiles, due to their non-existing forgetting term. This characteristic leaves them trapped at the edges of the behavior space. In contrast, Q and SARSA learner do not show such learning behavior. Interestingly, this characteristic of the AC learners turns out to be favorable in the two-state Prisoners Dilemma environment, where they find the most rewarding solution in more cases compared to Q and SARSA, but hinders the convergence to the fixed point solution in the two-state Matching Pennies environment

We have demonstrated the effect of the learning rate adjusting the speed of learning, and, within limits, thereby acting as a time re-scaling. A comparably large learning rate might cause limit cycles around the fixed point, thereby hindering the convergence to that point. Nevertheless, the average reward of the limit cycling behavior was approximately equal to the one of the fixed point obtained at lower , but took fewer time steps to reach. Thus, perhaps other dynamical regimes than fixed points such as limit cycles or strange attractors could be of interest in some applications of reinforcement learning.

We have also shown the effect of the discount factor describing the farsightedness of the agents. At low the state transition probabilities have less effect on the learning dynamics compared to high discount factors.

We hope that our work might turn out useful for the application of reinforcement learning in various domains, with respect to parameter tuning, the design of new algorithms, and the analysis of complex strategic interactions using meta strategies, as Bloembergen et al. Bloembergen et al. (2015) have pointed out. In this regard, future work could extend the presented methodology to partial observability of the Markov states of the environment Spaan (2012); Oliehoek (2012), behavior profiles with history, and other-regarding agent (i.e. joint-action) learners (c.f. Busoniu et al. (2008) for an overview of other-regarding agent learning algorithms). Also, the combination of individual reinforcement learning and social learning through imitation Bandura (1977); Smolla et al. (2015); Barfuss et al. (2017); Banisch and Olbrich (2017) seems promising. Such endeavors would naturally lead to the exploration of network effects. It is important to note that only a few dynamical systems reinforcement learning studies have begun to incorporate network structures between agents Bladon and Galla (2011); Realpe-Gomez et al. (2012).

Apart from these more technical extensions, we are confident that our learning equation will prove themselves useful when studying the evolution of cooperation in stochastic games Hilbe et al. (2018). With stochastic games one is able to explicitly account for a changing environment. Therefore, such studies are likely to contribute to the advancement of theoretical research on the sustainability of interlinked social-ecological systems Levin (2013) in the age of the Anthropocene Steffen et al. (2007). Interactions, synergies and trade-offs between social Dawes (1980); Macy and Flache (2002) and ecological Heitzig et al. (2016) dilemmas need to be explored. To advance the modeling endeavors of such social-ecological systems, especially on a global scale, requires a better understanding of agency and general purpose human behavior Donges et al. (2017). Learning is a suitable candidate for a general purpose behavior in such sustainability challenges Lindkvist and Norberg (2014); Lindkvist et al. (2017). By defining a default behavior profile, fruitful connections to the topology of sustainable management framework Heitzig et al. (2016) and viability theory Aubin (2009) could be explored. Also, modern measures of facets of social-ecological resilience, such as adaptability and transformability, could be developed within our framework Donges and Barfuss (2017). Eventually, learning agents could be used to discover resilient governance strategies. With respect to modeling human behavior, it is an open question to what extent behavioral theories from social science Schlüter et al. (2017) can be operationalized, yielding a broader spectrum of behavioral update equations. Ultimately, the comparison of the dynamics of different behavioral theories from social science would be of great interest. Critical to an increased understanding of social-ecological systems across scales is the role of agency Donges et al. (2017). Very roughly, agency shall describe the level of deliberate influence an agent, or a group of agents has onto the system. The framework of multi-agent environment systems, that we work with, seems a promising place to start for attempts to formalize agency and investigate its influence in connection to certain behavioral update equations.

Acknowledgements.
This work was conducted in the framework of the project on Coevolutionary Pathways in the Earth System (COPAN) at the Potsdam Institute for Climate Impact Research (PIK). We are grateful for financial support by the Heinrich Böll Foundation, the Stordalen Foundation (via the Planetary Boundaries Research Network PB.net), the Earth League’s EarthDoc program and the Leibniz Association (project DOMINOES). We thank Jobst Heitzig for discussions and comments on the manuscript.

Appendix A Computation of Lyapunov Exponents

We compute the Lyapunov exponents using an iterative QR decomposition of the Jacobian matrix according to Sandri Sandri (1996). In the following we present the derivation of the Jacobian matrix.

Eq. 11 constitutes a map , which iteratively updates the behavior profile . Consequently, we can represent its derivative as a Jacobin tensor .

Let be the numerator of Eq. 11, and its denominator, i.e. . Hence,

(23)

or, more precisely, in components,

(24)

and are known, and if is known, is easily obtained by . Therefore we need to compute for the three learner types Q, SARSA and Actor-Critic learning.

a.1 Q learning

Let us rewrite for the Q learner according to

(25)

where we removed the estimate of the current value from the temporal difference error, leaving the truncated TD error as

(26)

Hence, we can write the derivative of as

(27)

Since , can be expressed as

(28)

The derivative of the truncated temporal difference error reads