Evolution of cooperation facilitated by reinforcement learning with adaptive aspiration levels
Abstract
Repeated interaction between individuals is the main mechanism for maintaining cooperation in social dilemma situations. Variants of titfortat (repeating the previous action of the opponent) and the winstay loseshift strategy are known as strong competitors in iterated social dilemma games. On the other hand, real repeated interaction generally allows plasticity (i.e., learning) of individuals based on the experience of the past. Although plasticity is relevant to various biological phenomena, its role in repeated social dilemma games is relatively unexplored. In particular, if experiencebased learning plays a key role in promotion and maintenance of cooperation, learners should evolve in the contest with nonlearners under selection pressure. By modeling players using a simple reinforcement learning model, we numerically show that learning enables the evolution of cooperation. We also show that numerically estimated adaptive dynamics appositely predict the outcome of evolutionary simulations. The analysis of the adaptive dynamics enables us to capture the obtained results as an affirmative example of the Baldwin effect, where learning accelerates the evolution to optimality.
1 Introduction
The mechanisms of cooperation in social dilemma situations are a central topic in interdisciplinary research fields including evolutionary biology, ecology, economics, and sociology. As analyzed by the prisoner’s dilemma (PD) game and its relatives, direct reciprocity is among the main known mechanisms underlying cooperative behavior [Trivers, 1971, Axelrod, 1984]. In direct reciprocity, iterated interaction between the same individuals motivates them to continue cooperating (C) rather than to defecting (D) to obtain momentarily large payoffs; defection would be negatively rewarded by the opponent player’s retaliation in later rounds. Variants of the celebrated retaliatory strategy titfortat (mimicking the opponent’s action in the previous round) [Nowak & Sigmund, 1992] and a winstay loseshift strategy [Kraines & Kraines, 1989, Nowak & Sigmund, 1993] are recognized as strong competitors in the iterated PD game.
In the iterated games concerning direct reciprocity, it is natural to assume that players modify their strategies in response to their experiences in past rounds. The titfortat, its variants, and winstay loseshift strategies can be interpreted as examples of such learning strategies because the titfortat, for example, implies that the player selects the action (i.e., C or D) depending on the result of the last round. A more sophisticated learning player of this kind exploits a longer history of the game for action selection (e.g., cooperate if the player and the opponent cooperated in the previous two rounds, and defect otherwise) [Lindgren, 1991]. Classes of other learning models include fictitious play and reinforcement learning [Camerer, 2003, Fudenberg & Levine, 1998]. Learning apparently seems beneficial in iterated games because learning players are more flexible than nonlearning players.
If learning is a key factor in promoting cooperation in real societies, the number of learning players should increase when a population evolves under selection pressure. However, the advantage of learners over nonlearners in evolutionary dynamics is elusive because a pair of learning players often results in mutual defection [Macy, 1996, Sandholm & Crites, 1996, Posch et al., 1999, Taiji & Ikegami, 1999, Macy & Flache, 2002, Masuda & Ohtsuki, 2009] and learning may be costly.
The constructive roles of learning in the evolution of certain traits are collectively called the Baldwin effect (see Simpson, 1953; Turney et al., 1996; Weber & D. J. Depew, 2003; Crispo, 2007; Badyaev, 2009 for reviews). Although earlier examples of the Baldwin effect are not necessarily founded on firm empirical evidence [Simpson, 1953, Weber & Depew, 2003], there exists a plethora of positive evidence of the Baldwin effect. Examples include fly’s morphological developments [Waddington, 1942], colonization of house finch in North America [Badyaev, 2009], and persistence of coastal juncos [Yeh & Price, 2004]. In fact, the concept of the Baldwin effect differs by authors (see Simpson, 1953; Downes, 2003; Turney et al., 1996). Although earlier computational models suggest that learning accelerates evolution [Hinton & Nowlan, 1987, Ancel, 1999, Maynard Smith, 1987], later theoretical and numerical studies suggest that learning either accelerates or decelerates evolution toward the optimum depending on the details of the models [Ancel, 2000, Dopazo et al., 2001, Borenstein et al., 2006, Paenke et al., 2007, Paenke et al., 2009]. The advantage of learning in evolution is also nontrivial in this broader context.
We numerically investigate the effect of learning on evolution in the iterated PD game. This question was explored in previous literature (Suzuki & Arita, 2004; also see Wang et al., 2008 for discussion). Our emphasis in this study is to use a reinforcement learning model for the iterated PD game [Masuda & Nakamura, 2011] that is much simpler in terms of the number of plastic elements than the plastic lookuptable model adopted in [Suzuki & Arita, 2004]. In our model, players are satisfied with and persist in the current action when the obtained payoff is larger than a plastic threshold. Our model of players introduced in [Masuda & Nakamura, 2011] modifies those in [Karandikar et al., 1998, Posch et al., 1999, Macy & Flache, 2002]. Via the stability analysis for nonlearning players, the numerical analysis of the discretized adaptive dynamics with nonlearning and learning players, and full evolutionary simulations, we show that learning is needed for a noncooperative population to evolve to be able to engage in mutual cooperation for wide parameter ranges. We also discuss our results in the context of the Baldwin effect.
2 Model
2.1 Iterated PD game
We assume that each player plays the PD game against each of the other players in a population. In each round within a generation, a player selects C or D without knowing the action (i.e., C or D) of the opponent player. The payoff to the focal player is defined by
(1) 
where and . Equation (1) represents the row player’s payoff. The payoff to the opponent (column player) is defined likewise; the PD game is symmetric. Because and , mutual defection is the only Nash equilibrium of the singleshot PD game.
However, players may continue mutual cooperation for their own benefits in the iterated PD game [Trivers, 1971, Axelrod, 1984]. We denote the number of rounds per generation by . Technically, the Nash equilibrium of the iterated PD game is perpetual mutual defection if the players know beforehand. The number of rounds is often randomized to avoid this effect [Axelrod, 1984]. To simplify the analysis, we assume that the players are unaware of the fixed value of .
Earlier studies identified titfortat, which involves imitating the previous action of the opponent, as a strong strategy in the iterated PD game when various strategies coexist in a population [Trivers, 1971, Axelrod, 1984]. However, later studies showed that titfortat is not robust against error and that alternative strategies such as generous titfortat [Nowak & Sigmund, 1992] and Pavlov [Kraines & Kraines, 1989, Nowak & Sigmund, 1993] are strong competitors in the iterated PD game with error. By definition, a Pavlov player receiving payoff or is satisfied and does not change the action in the next round, whereas the same player receiving payoff or is dissatisfied and flip the action. A population composed of Pavlov players, for example, realizes mutual cooperation such that a player gains approximately per round.
2.2 Reinforcement learning
Intuitively, the ability to learn may seem to be an advantageous trait in the iterated PD game if the cost of learning is negligible. However, this is generally not the case. A pair of learning players often ends up with mutual defection unless a learning algorithm is carefully designed [Macy, 1996, Sandholm & Crites, 1996, Posch et al., 1999, Taiji & Ikegami, 1999, Macy & Flache, 2002, Masuda & Ohtsuki, 2009]. Learning requires trial and error, i.e., the exploration of unknown behavioral patterns as well as the exploitation of known advantageous behavioral patterns. Exploratory behavior of a learning player may look just random to opponents, and it is rational to defect against randomlooking players.
To compromise the possibility of mutual cooperation, the simplicity of the learning algorithm, and the biological plausibility of the model as compared to some other learning algorithms, we use a variant of the Bush–Mosteller (BM) reinforcement learning model [Masuda & Nakamura, 2011]. This model modifies the models in the previous literature [Karandikar et al., 1998, Posch et al., 1999, Macy & Flache, 2002] such that players learn to mutually cooperate for wide parameter ranges.
In round , the cooperability of the learning player is given by the probability . We update using the results of the singleshot PD game as follows:
(2) 
where
(3) 
and is the payoff to the player in round . stands for the degree of satisfaction in round . When is large, the player increases the probability of taking the current action in round . For example, the third line in Eq. (2) indicates that the player decreases the probability of cooperation because selecting D in round has yielded a satisfactory outcome. In addition, we assume that the player misimplements the action with a small probability such that the player in fact cooperates with probability in round . Equations (2) and (3) indicate that the player is satisfied with the current situation if the obtained payoff is larger than the socalled aspiration level . Otherwise, the player is motivated to flip the action. controls the sensitivity in the plasticity of . If , for any such that is constant. If , or for any such that or 0.
Unless otherwise stated, we set the initial condition to , i.e., the player defects in round 1. This value of is the most adverse to mutual cooperation. We will confirm in Secs. 3.1 and 3.4 that our main results are qualitatively the same if we set .
The dynamics of the aspiration level are given by
(4) 
where represents the learning rate. If , is constant, and the model is equivalent to the classical BM model. If , the player compares the current payoff and the payoff obtained in the last round to determine . In our previous work, we showed that mutual cooperation is established among the players only after rounds if is large and is small [Masuda & Nakamura, 2011]. In the numerical simulations, we set , which is large enough to support mutual cooperation if other conditions, such as small and small , are met.
We remark that the initial condition is a key parameter to characterize the player.
2.3 Evolutionary dynamics
We set the number of players in the population to . In a single generation, each player plays the iterated PD game with against all the other players. We always reset and to and when a player starts the iterated PD game with a new opponent. The single generation payoff is equal to the summation of the payoff obtained by playing against players, which is divided by .
After the single generation payoffs to all the players are determined, we select two players and with equal probability for strategy update. We use the Fermi rule [Szabo & Toke, 1998, Traulsen et al., 2006] in which player adopts ’s and values in the next generation with probability , and player adopts ’s parameter values, otherwise. We set . To account for mutation, we assume that after strategy update, and of the adopter are displaced by random small values obeying the uniform density on and , respectively. If the displaced exceeds 1 or is negative, we reset to 1 or 0, respectively. However, the resetting seldom occurs in our evolutionary simulations.
The phenotype of a player in round is specified by and . It should be noted that and are not inherited over generations. In other words, the natural selection operates on the capacity to learn (i.e., ) but not on the acquired behavior (i.e., and ). Because we let in Eq. (3) to be relatively large to realize mutual cooperation [Masuda & Nakamura, 2011], is sensitive to the excess payoff relative to in the sense that is close to or unless is close to . Therefore, is similar to the probability of cooperation conditioned on the outcome of the PD game in the previous round. When we use the term learning in the following, we exclusively refer to that induced by in the iterated PD game. A positive value of directly raises the plasticity of and indirectly controls that of .
3 Results
3.1 Nash equilibria when without learning
To show that learning is necessary for the emergence of cooperation, we start by analyzing the competition between players that do not learn. With learning rate equal to zero, the aspiration level is fixed over rounds (i.e., , ). For the sake of analysis, we set . Then, Eqs. (2) and (3) imply that the player persists in the current action (i.e., or ) if and flips the action (i.e., or ) otherwise. When is fixed, there are five strategies:

Strategy st1 is defined by . Except for the action misimplemantation, an st1 player always cooperates or always defects, depending on the action in the first round.

Strategy st2 is defined by . An st2 player does not flip the action unless .

Strategy st3 is defined by . An st3 player does not flip the action if mutual cooperation or unilateral defection is realized. It is equivalent to Pavlov, which is a strong competitor in the iterated PD game [Kraines & Kraines, 1989, Nowak & Sigmund, 1993].

Strategy st4 is defined by . An st4 player flips the action unless .

Strategy st5 is defined by . An st5 player flips the action in every round except when the player misimplements the action.
In Table 1, the average payoff to a nonlearning (i.e., ) player (row player) playing against another nonlearning player (column player) is shown for and . For example, st1 playing against st2 obtains per round on an average. The results shown in Table 1 are a subset of those obtained in [Nowak et al., 1995] (see Appendix A for details). Table 1 indicates that st3 is a Nash equilibrium when the five strategies are considered. In particular, st3 playing against another st3 realizes mutual cooperation and obtains the largest average payoff per round . Therefore, a unanimous population composed of st3 players represents a eusocial situation. Mutual cooperation is not realized by any other combination of two players.
Table 1 indicates that st2 is also a Nash equilibrium when . In addition, although st4 is not a Nash equilibrium, a homogeneous population composed of st4 players is resistant to invasion by st3 in evolutionary situations because st4 gains a larger payoff than st3 does when playing against an st4 opponent.
To test the robustness of the results shown in Table 1, we set , and , and numerically calculate the payoff averaged over generations to different nonlearning players with different fixed aspiration levels . We also set and in this and the following numerical simulations. The average payoff to a nonlearner playing against another nonlearner is shown for and in Fig. 1. The presented values are averages over 100 trials for each pair of values. The results shown in Fig. 1 are qualitatively the same as those shown in Table 1. We also confirmed that the results hardly change for and .
3.2 Possibility of mutual cooperation via learning
If players different from st3 may adjust until is satisfied such that they learn to behave as Pavlov. Therefore, learning may play a constructive role in the evolution of mutual cooperation. In fact, this is not always the case; is a necessary condition for mutual cooperation to evolve.
To explain this point, we set and , and numerically examine the behavior of a pair of players. Typical time courses of the aspiration level for a pair of learning players over rounds without action misimplementation (i.e., ) are shown in Fig. 2(a). Each of the three pairs with close values represents a pair of st1 (thick lines), st3 (dotted lines), and st5 players (medium lines), respectively. We used different values of for each pair for the clarity of the figure; making equal for two players does not qualitatively change the results. The thick lines in Fig. 2(a) indicate that the two st1 players playing with each other are satisfied with payoff obtained by mutual defection. Therefore, their aspiration levels converge to . The results would be the same if we start from a pair of st2 players or a combination of an st1 player and an st2 player. A pair of st3 players begin mutual cooperation from the second round, and their values converge to (dotted lines). Mutual cooperation is also realized if the two players are initially either st4 or st5, although some rounds are required before the players mutually cooperate (medium lines).
Although two learning players having do not end up with mutual cooperation when , the action misimplementation (i.e., ) can trigger a shift from mutual defection to mutual cooperation. Artificially generated time courses in the presence of action implementation are shown in Fig. 2(b) for expository purposes. Until the intended action is misimplemented ( in Fig. 2(b)), two players starting with keep mutual defection (thick lines). When has sufficiently approached , we assume that one player misimplements the action (). Then, the values of both players cross from below within a couple of rounds such that the players start to behave as Pavlov and mutually cooperate. The possibility of mutual cooperation through this mechanism is sensitive to the value of . Two players starting with end up with owing to the action misimplementation when (see Appendix B for derivation). When , and , this condition yields for an arbitrary value of . Then, the values of the two players converge to .
The values also converge to when we start with a pair of st3 players (dotted lines in Fig. 2(b)) and a pair of st4 or st5 players (medium lines in Fig. 2(b)). This is because mutual cooperation is stable against action misimplementation; if one player turns into D by action misimplementation in round , both players defect in round and cooperate in round , if the actions are not misimplemented in rounds and . This event sequence is likely unless is large.
3.3 Adaptive dynamics
In the evolutionary numerical simulations that we will describe in Sec. 3.4, we allow the initial aspiration level and learning rate to mutate (Sec. 2.3). If the distribution of and that of for an evolving population are single peaked and sufficiently localized, we can grasp the evolutionary dynamics for a population by tracking the dynamics of the population averages of and , denoted by and , respectively. In the extreme case in which all the players share identical values of and , the instantaneous dynamics of and are captured by adaptive dynamics [Metz et al., 1996, Hofbauer & Sigmund, 1998, Hofbauer & Sigmund, 2003, Doebeli et al., 2004]. Adaptive dynamics reveal the possibility for mutants with a slightly deviated parameter value to invade a homogeneous resident population. In this section, we numerically examine twodimensional adaptive dynamics with respect to and to foresee the evolutionary simulations carried out in Sec. 3.4.
In this and the following sections, we set , and unless otherwise stated. Consider a homogeneous population of players sharing the parameter values and . A mutant player with aspiration level and learning rate can invade the population if
(5) 
or
(6) 
where and are the strategies of the resident and mutant players, respectively, and represents the average payoff of strategy when playing with strategy . is ESS if the converse of Eq. (5) or the converse of Eq. (6) is satisfied. If Eq. (5) or (6) is satisfied, the homogeneous population comprising strategy would evolve toward . We numerically calculate , where and . We confine in the neighborhood of because the amount of mutation for and is assumed to be small. Examining corresponds to looking at the discretized adaptive dynamics, i.e., the discretized derivative of with respect to at .
For various values of and , is shown in Fig. 3. The plotted values are averages over runs for any . In Fig. 3(a), obtains a larger payoff than in the red region. In this region, would invade a homogeneous resident population of such that increases. In contrast, obtains a smaller payoff than does in the blue region. Figure 3(a) indicates that, if learning is prohibited (i.e., ), the population starting from , for example, is expected to evolve such that increases, but only up to . Therefore, a population does not evolve from st2 to st3 without learning.
Figure 3(b), which reveals the possibility of invasion by mutant in the resident population of , is a sign flipped version of Fig. 3(a) in most parameter regions. Nevertheless, neither the mutants with nor the ones with invade the resident population (i.e., parameter regions colored in blue in both Figs. 3(a) and 3(b)) for and along a bent line passing through and . These regions constitute singular points of the adaptive dynamics and serve as repellers. In other words, does not pass through for and for various values of in adaptive dynamics. The observations for that the homogeneous population of st2 is not invaded by st3 mutants, that of st3 is not invaded by st2 or st4 mutants, and that of st4 is not invaded by st3 mutants, are consistent with the results obtained in Sec. 3.1.
The possibility of invasion by mutant in the homogeneous population of is shown in Fig. 3(c). The figure suggests that would increase for a population of st1 players (i.e., ). Learning is preferred to nonlearning when for the following reason. As shown in Sec. 3.2, when and , increases until the players behave as Pavlov to mutually cooperate within a relatively small number of rounds (Fig. 2(b)). In contrast, the players do not establish mutual cooperation when or , as shown in Sec. 3.1 (Fig. 2(a)). Figure 3(c) indicates that increases up to . This value of is consistent with the upper bound of for mutual cooperation to be possible, which was derived in Sec. 3.2. Based on these results, is expected to initially increase in evolutionary dynamics starting with a population of nonlearning st1 players. We refer to the stage of evolutionary dynamics in which increases as stage 1. The existence of stage 1 is also supported by Fig. 3(d) in which the mutant has .
After has increased, Figs. 3(a) and 3(b) imply that increases to cross . When , a larger value of is beneficial because fewer rounds are required for such players to turn to Pavlov (i.e., ). Once exceeds for a majority of players, they earn a large average payoff through mutual cooperation. We refer to the transition for learning players from a small corresponding to st1 or st2 to a large corresponding to st3 as stage 2. Figures 3(a) and 3(b) indicate that the difference between and when , , is small, presumably because and are only slightly different in terms of the number of transient rounds before the entrance to . Therefore, we expect that stage 2 occurs slowly in evolutionary dynamics.
Although it is a minor phenomenon as compared to stages 1 and 2, a smaller is more beneficial on the boundary between st2 and st3 (i.e., ), as shown in Figs. 3(c) and 3(d). For expository purposes, time courses of the iterated PD game between an st2 player and an st3 player are shown in Fig. 3(e). As shown by the solid lines, the initial st3 player flips to st2 before establishing mutual cooperation if . In fact, a nonlearning st3 player (i.e., ) realizes mutual cooperation with a learning st2 player in earlier rounds (dotted lines) than a learning st3 player does (solid lines). Therefore, in evolutionary dynamics, in the vicinity of is expected to decrease. We refer to this transition as stage 3. It should be noted that stage 3 occurs in a narrow range of (i.e., and in Fig. 3(a)).
Through stages 1, 2, and 3, evolution from a defective population of nonlearning st1 players to a cooperative population of st3 players is logically possible. In contrast, the emergence of mutual cooperation is hampered if learning is prohibited.
After stage 3, would not evolve beyond ; is a line of repellers in adaptive dynamics, as already explained in Figs. 3(a) and 3(b). When (i.e., ), the mutant’s payoff is indistinguishable from the resident’s payoff unless is large (Figs. 3(a)–(d)). Therefore, and would perform approximately unbiased diffusion. This implies that that has decreased via stage 3 may increase again.
When at least one of the two players is st4 or st5, a player with a larger is more advantageous than the opponent with a smaller . This is because the former exploits the latter in early rounds. Nevertheless, these players do not obtain the average payoff as large as that for a pair of st3 players, which would start to mutually cooperate from the second round. Therefore, st3 is stable against invasion by st4 and vice versa.
We predict that the learning rate would not eventually decrease to the small value in evolutionary simulations. In other words, the disadvantage of learning is too small to be evolutionarily relevant unless the cost of learning is explicitly incorporated.
To assess the robustness of the results obtained from the adaptive dynamics, we reproduced Figs. 3(a)–(d) with and . The results for are qualitatively the same as those for (results not shown). The results for are different in some aspects from those for (Fig. 4). Most notably, when , st3 is no longer stable against invasion by st2 even without learning (i.e., ). Therefore, mutual cooperation would not be stable in evolutionary dynamics. In Figs. 5(a) and 5(b), is shown for and , respectively, for a variety of values of and . Figure 5(a) indicates that an st3 mutant does not invade the population of st2 residents for all the examined values of and . Figure 5(b) indicates that a population of st3 residents is resistant to invasion by st2 mutants when is large and is small. Nevertheless, st3 is stable for various values of and . Because stage 2 is hampered when , must take an intermediate value for the learningmediated mutual cooperation to emerge.
3.4 Evolutionary simulations
The results in Sec. 3.3 predict the presence of a learning mediated evolutionary route from a noncooperative population composed of st1 players to a cooperative population composed of st3 players. In this section, we carry out direct numerical simulations of the evolutionary dynamics using a population composed of players. We initially set and select for each player independently from the uniform density on . Therefore, all the players are initially nonlearning st1. Refer to Sec. 2.3 for details of the numerical setup.
The evolution of , the total amount of plasticity experienced in a generation, defined by , , and the fraction of mutual cooperation for an example run with and are shown in Fig. 6(a). The average learning rate and the total amount of plasticity rapidly increase until the th round. The payoff and the fraction of mutual cooperation also increase during this period because st1 players learn to behave as Pavlov when . This period corresponds to stage 1 described in Sec. 3.3. Then, the fraction of mutual cooperation and the total amount of plasticity gradually increase until the th round, corresponding to stage 2. In the th round, an st3 mutant emerges in the population mostly composed of st2 players and gains a larger payoff than st2 residents do. Then, st3 players rapidly replace st2 players in the population such that and the fraction of mutual cooperation suddenly increase (Fig. 6(a)). This is because stable mutual cooperation between st3 players emerges in an early round, whereas that between st2 players emerges after rounds. The learning rate decreases almost at the same time, corresponding to stage 3. The time courses of the fractions of st1, st2, st3, st4, and st5 players corresponding to the run shown in Fig. 6(a) are shown in Fig. 6(b). For example, the fraction of the st1 player is defined by the fraction of players having and any value of . Figure 6(b) indicates that the population initially composed of st1 players evolves to that of st3 players. The trajectory of and corresponding to the same run is shown in Fig. 6(c). Figure 6(c) is consistent with the scenario of the evolution of cooperation described in Sec. 3.3. The population evolves from no cooperation to mutual cooperation via the three stages involving learning. After stage 3, and diffuse without a recognizable bias, which is also consistent with the results obtained in Sec. 3.3 (white regions in Figs. 3(a)–(d)). However, it should be noted that the total amount of plasticity remains small after stage 3.
To examine the robustness of the results, we carry out five runs of numerical simulations for each of the different parameter sets; we could not carry out more extensive numerical simulations because of the computational cost. We measure two quantities in each run. The first quantity is the number of generations necessary for to exceed for the first time. We call this number the end of stage 1. The second quantity is the number of generations necessary for to exceed for the first time. We call this number the end of stage 2. The ends of stages 1 and 2 with , , and are equal to and , respectively, where the mean standard deviation on the basis of the five runs are indicated. Those with , , and are equal to and . Those with , , and are equal to and . Those with , , and are equal to and . For this parameter set, one out of the five runs did not reach the end of stage 2 within generations, such that the statistics are based on the other four runs. Those with , , and are equal to and . Mutual cooperation evolves via learning (i.e., finite value of the end of stage 2 up to our numerical efforts) in most cases. When , evolution to mutual cooperation is slower than when . This may be because learning players having different values of turn into Pavlov (i.e., ) within a small number of rounds when is relatively large. Then, the payoff to different learning players would differ relatively little to weaken the selection pressure.
We perform another robustness test. For the original parameter values , , and , the trajectory of and obtained from a single run with is shown in Fig. 7. The results are qualitatively the same as those for (Fig. 6(c)) although establishment of cooperation takes a considerably larger number of generations when than when .
3.5 Baldwin effect
If we assume an explicit cost of learning, the learning rate decreases after mutual cooperation is reached. An example time course of and when a linear cost is added to the single generation payoff to each player (Suzuki & Arita 2004; see Ancel 1999, 2000 for a different implementation of the explicit learning cost), where , is shown in Fig. 8(a). The final value of is smaller than that in the case without the learning cost (Fig. 6(c)). The result shown in Fig. 6(d) is an example of the standard Baldwin effect in which the learning rate initially increases and then decreases [Ancel, 2000, Dopazo et al., 2001, Borenstein et al., 2006, Paenke et al., 2007, Paenke et al., 2009].
We examine the robustness of the observed Baldwin effect against the variation of . An example time course of and when is shown in Fig. 8(b). As compared to when (Fig. 8(a)), is smaller throughout the evolution, and increases more slowly. Nevertheless, the population mostly consists of st3 players in the end. We carried out five runs for various values of . The largest value in generations is shown in Fig. 8(c) for each run. The largest value decreases with because learning is costly for a large value of . The final value of , calculated as the average over the last generations, is shown in Fig. 8(d). The final value of is considerably smaller than the largest value (Fig. 8(c)) for each , indicative of stage 3 of the Baldwin effect. The final value of , calculated as the average over the last generations, is plotted against in Fig. 8(e). If this value is larger than and smaller than , we expect that the final population is mostly composed of st3 players and that the Baldwin effect is operative. Figure 8(e) suggests that the Baldwin effect occurs in the five runs when . When , stage 1, i.e., the initial increase in , is often too small in magnitude such that stage 2 does not sometimes occur. We conclude that the Baldwin effect occurs for a wide range of .
4 Discussion
We have shown that reinforcement learning promotes the evolution of mutual cooperation in a population of players involved in the iterated PD game. Cooperation evolves under some conditions such as , positive but not too large values of , and that is not too small. The present study is motivated by previous investigations of the Baldwin effect. Our results provide an example of the Baldwin effect in the form of a computational model of social behavior.
To understand the behavior of our model analytically, writing down the Fokker–Planck equation for the joint density of and may be useful. Starting from the singular density at a small value of and , we may be able to solve the Fokker–Planck equation numerically to track the evolution of the joint density to find the Baldwin effect. Alternatively, discretizing and and then formulating a Markov chain on the discretized states may also be useful. Nevertheless, we refrained from such analyses because we consider that they eventually necessitates some numerical simulations and would not sufficiently advance the understanding of our numerical results.
The concept of the Baldwin effect is diverse [Simpson, 1953, Downes, 2003, Turney et al., 1996]. However, arguably, the most accepted variant of the Baldwin effect is formulated as a twostage mechanism [Simpson, 1953, GodfreySmith, 2003, Crispo, 2007, Turney et al., 1996]. In stage 1, plasticity increases because plastic individuals are better at finding the optimal behavior than nonplastic individuals. In stage 2, mutation makes the optimal behavior innate and decreases the plasticity of individuals. Mutants that play optimally from the outset of their life without plasticity and resident individuals that acquire the optimal behavior through plasticity are eventually equally efficient. Nevertheless, because of the cost of learning, the mutants overwhelm the residents via natural selection. Stage 2 is often called genetic assimilation.
Stage 1 in our model corresponds to stage 1 of the standard Baldwin effect outlined above. In stage 2 in our model, increases such that the optimal behavior (i.e., mutual cooperation by turning into st3) becomes innate. Nevertheless, after stage 3 in our model in which the learning rate rapidly decreases, the learning rate starts to perform a random walk because the learning cost is marginal in our model (Fig. 6(c)). Therefore, the behavior of our model in stages 2, 3, and onward does not qualify as stage 2 of the standard Baldwin effect in which the learning rate decreases. With a modified model with an explicit learning cost, we showed that the learning rate decreases after stage 3 (Fig. 6(d)). In this case, our model naturally fits the framework of the Baldwin effect.
In a previous computational model of the Baldwin effect, learning rates remain large when the optimal behavior dynamically changes owing to environmental fluctuations [Ancel, 1999]. In our model without an explicit cost of learning, the learning rate remains large for a different reason. In our model, the optimal parameter set (i.e., and ) does not fluctuate after sufficient generations. Instead, approximate optimality is realized for various parameter sets, i.e., any and . Therefore, the learning rate performs a random walk to occasionally visit large values (Fig. 6(c)).
Godfrey–Smith points out three alternative reasons why stage 1 cannot be skipped in the twostage mechanism of the Baldwin effect [GodfreySmith, 2003]. First, learning may provide a breathing space by which a population can survive long enough to transit to stage 2. This reason is irrelevant to our model because our model is not concerned with the survival of the population. The population size is fixed in our model such that the population always survives. Second, the preferred state may be accessible for learners but not for nonlearners. Although not explicitly stated in Godfrey–Smith (2003), this mechanism seems to be relevant to cases in which the fitness landscape does not depend on the configuration of the population. In our case, however, the fitness landscape depends on the fractions of the different types of players because the payoff to a player is affected by the strategies of the other players. Third, evolution may change the “social ecology” of the population such that learners are more advantageous than nonlearners, a phenomenon called niche construction in a broad sense. The social ecology implies a fitness landscape that depends on the configuration of the population. In our model, the social ecology evolves via learning of players. This third mechanism seems to be relevant to our model. Suppose a hypothetical population comprising st1 nonlearners except two st1 learners. For a focal st1 learner, the social ecology is such that there is one st1 learner and st1 nonlearners. If , the focal st1 learner is likely to gain a payoff that is larger than an st1 nonlearner because the focal player learns to mutually cooperate with the other st1 learner, whereas an st1 nonlearner does not. The focal st1 learner would not overwhelm st1 nonlearners if the other st1 learner is absent in the social ecology.
The main purpose of this study is to provide an evolutionary model of concrete social behavior in which learning plays a constructive role. We are not the first to achieve this end. Suzuki and Arita observed the Baldwin effect in the iterated PD game using different learning models [Suzuki & Arita, 2004]. In their model, the learning rate is assumed to be binary, and the player’s strategy is specified by a lookup table that associates the action to take (i.e., C or D) with the actions of the previous two rounds of the two players. The entries of the lookup table dynamically change when the plasticity is in operation. They also considered the effects of meta–learning in which the player adapts how to update each entry of the lookup table. The main contribution of the present work relative to theirs is to provide a much simpler model in terms of the number of plastic parameters. In contrast, the learning rates and the range of parameters are continuous in our model, whereas they are mostly binary in their model. Our model may be amenable to real animals and facilitates a mechanistic understanding of evolutionary dynamics by the numerically calculated adaptive dynamics. Apart from the fixed parameters common to all the individuals, our players only have two parameters that are plastic within a generation, and , and two parameters inherited across generations, and . The results obtained from the adaptive dynamics predict those of direct evolutionary numerical simulations and provide an intuitive reason why learning promotes the emergence of mutual cooperation. In particular, we showed the necessity of the evolution of learning ability for cooperation by explicitly comparing the cases with and without learning. The combination of adaptive dynamics and evolutionary simulations may also be useful for analyzing the Baldwin effect in different models.
Acknowledgements
We thank Reiji Suzuki and Kohei Tamura for the valuable discussions. N.M. acknowledges the support provided through GrantsinAid for Scientific Research (Nos. 20760258 and 23681033, and Innovative Areas “Systems Molecular Ethology”(No. 20115009)) from MEXT, Japan.
Appendix A: Payoff to nonlearning players
Nowak et al. (1995) analyzed iterated matrix games between a pair of players that select an action (i.e., C or D) in response to the actions of the two players in the previous round. There are four combinations of the actions of the two players in the previous round, i.e., (C, C), (C, D), (D, C), and (D, D). Because a player assigns C or D to each of these possible outcomes in the previous round, there are 16 strategies . In fact, st1, st2, st3, st4, and st5 in the present study are equivalent to and in [Nowak et al., 1995], respectively.
By calculating the steady state of the Markov chain with four states , , , and , Nowak et al. (1995) calculated the average payoff to focal player playing against the opponent under a small probability of error in action implementation. Their assumption for the action misimplementation is slightly different from ours. We assumed that is the probability that each player independently misimplements the action, whereas only one of the two players may misimplement the action in a round in their model. Nevertheless, our model is equivalent to theirs in the limit if we set , where is the probability of action misimplementation in the sense of Nowak et al. (1995). Therefore, our results shown in Table 1 are a corollary of their results.
Appendix B: Upper bound of for st2 players to turn into Pavlov
Given , and , of the two players, denoted by X and Y, are sufficiently close to when one player, which we assume to be Y without loss of generality, misimplements the action to select C for the first time in round . Without any further action misimplementation, X keeps D and Y flips to D in round because . In round , X flips to C and Y keeps D. Therefore, we obtain , and . Combining the four equations, we obtain
(7) 
Using and , we obtain the condition for X to become Pavlov in round as , i.e.,
(8) 
The condition for Y to become Pavlov in round is given by , i.e.,
(9) 
Equation (8) implies Eq. (9) because and . Therefore, the two players become Pavlov in round if Eq. (8) holds true.
We assume that Eq. (8) is violated. If Eq. (9) is also violated, we obtain such that the two players mutually defect until the occurrence of another action misimplementation. If Eq. (9) is satisfied, the two players mutually defect in round . Because , and , we obtain . Because , and , we obtain . These two inequalities indicate that X and Y behave as st3 and st2 in round , respectively. By repeating the same procedure with X and Y swapped, we obtain
(10) 
Therefore, we obtain
(11) 
by induction. Equation (11) implies that the two players do not realize mutual cooperation if Eq. (8) is violated.
Therefore, an upper bound of for a pair of st2 players to turn into Pavlov is given by solving Eq. (8) with equality.
References
 Ancel, 1999 Ancel, L. W. 1999. A quantitative model of the Simpson–Baldwin effect. J. Theor. Biol. 196, 197–209.
 Ancel, 2000 Ancel, L. W. 2000. Undermining the Baldwin expediting effect: does phenotypic plasticity accelerate evolution? Theor. Popul. Biol. 58, 307–319.
 Axelrod, 1984 Axelrod, R. 1984. The Evolution of Cooperation. Basic Books, NY.
 Badyaev, 2009 Badyaev, A. V. 2009. Evolutionary significance of phenotypic accommodation in novel environments: an empirical test of the Baldwin effect. Philos. Trans. R. Soc. B 364, 1125–1141.
 Borenstein et al., 2006 Borenstein, E., Meilijson, I. & Ruppin, E. 2006. The effect of phenotypic plasticity on evolution in multipeaked fitness landscapes. J. Evol. Biol. 19, 1555–1570.
 Camerer, 2003 Camerer, C. F. 2003. Behavioral Game Theory. Princeton University Press, NJ.
 Crispo, 2007 Crispo, E. 2007. The Baldwin effect and genetic assimilation: revisiting two mechanisms of evolutionary change mediated by phenotypic plasticity. Evolution 61, 2469–2479.
 Doebeli et al., 2004 Doebeli, M., Hauert, C. & Killingback, T. 2004. The evolutionary origin of cooperators and defectors. Science 306, 859–862.
 Dopazo et al., 2001 Dopazo, H., Gordon, M. B., Perazzo, R. & RisauGusman, S. 2001. A model for the interaction of learning and evolution. Bull. Math. Biol. 63, 117–134.
 Downes, 2003 Downes, S. M. 2003. Baldwin effects and the expansion of the explanatory repertoire in evolutionary biology. In: Evolution and learning — The Baldwin effect reconsidered pp. 33–51. MIT Press, Cambridge, UK.
 Fudenberg & Levine, 1998 Fudenberg, D. & Levine, D. K. 1998. The Theory of Learning in Games. MIT Press, Cambridge, UK.
 GodfreySmith, 2003 GodfreySmith, P. 2003. Between Baldwin skepticism and Baldwin boosterism. In: Evolution and learning — The Baldwin effect recosndiered pp. 53–67. MIT Press, Cambridge, UK.
 Hinton & Nowlan, 1987 Hinton, G. E. & Nowlan, S. J. 1987. How learning can guide evolution. Complex Syst. 1, 495–502.
 Hofbauer & Sigmund, 1998 Hofbauer, J. & Sigmund, K. 1998. Evolutionary Games and Population Dynamics. Cambridge University Press, Cambridge, UK.
 Hofbauer & Sigmund, 2003 Hofbauer, J. & Sigmund, K. 2003. Evolutionary game dynamics. Bull. Am. Math. Soc. 40, 479–519.
 Karandikar et al., 1998 Karandikar, R., Mookherjee, D., Ray, D. & VegaRedondo, F. 1998. Evolving aspirations and cooperation. J. Econ. Theory 80, 292–331.
 Kraines & Kraines, 1989 Kraines, D. & Kraines, V. 1989. Pavlov and the prisoner’s dilemma. Theory Decis. 26, 47–79.
 Lindgren, 1991 Lindgren, K. 1991. Evolutionary phenomena in simple dynamics. In: Proceedings of Artificial Life II pp. 295–312.
 Macy, 1996 Macy, M. 1996. Natural selection and social learning in prisoner’s dilemma: coadaptation with genetic algorithms and artificial neural networks. Sociol. Methods Res. 25, 103–137.
 Macy & Flache, 2002 Macy, M. W. & Flache, A. 2002. Learning dynamics in social dilemmas. Proc. Natl. Acad. Sci. USA 99, 7229–7236.
 Masuda & Nakamura, 2011 Masuda, N. & Nakamura, M. 2011. Numerical analysis of a reinforcement learning model with the dynamic aspiration level in the iterated prisoner’s dilemma. J. Theor. Biol. 278, 55–62.
 Masuda & Ohtsuki, 2009 Masuda, N. & Ohtsuki, H. 2009. A theoretical analysis of temporal difference learning in the iterated prisoner’s dilemma game. Bull. Math. Biol. 71, 1818–1850.
 Maynard Smith, 1987 Maynard Smith, J. 1987. Natural selection: when learning guides evolution. Nature 329, 761–762.
 Metz et al., 1996 Metz, J. A. J., Geritz, S. A. H., Meszena, G., Jacobs, F. J. A. & van Heerwarden, J. S. 1996. Adaptive dynamics: a geometrical study of the consequences of nearly faithful reproduction. In: Stochastic and spatial structures of dynamical systems (S. J. van Strien and S. M. Verduyn Lunel eds.) pp. 188–231. North Holland, Amsterdam, The Netherlands.
 Nowak & Sigmund, 1993 Nowak, M. & Sigmund, K. 1993. A strategy of winstay, loseshift that outperforms titfortat in the prisoner’s dilemma game. Nature 364, 56–58.
 Nowak & Sigmund, 1992 Nowak, M. A. & Sigmund, K. 1992. Tit for tat in heterogeneous populations. Nature 355, 250–253.
 Nowak et al., 1995 Nowak, M. A., Sigmund, K. & ElSedy, E. 1995. Automata, repeated games and noise. J. Math. Biol. 33, 703–722.
 Paenke et al., 2009 Paenke, I., Kawecki, T. J. & Sendhoff, B. 2009. The influence of learning on evolution: a mathematical framework. Artif. Life 15, 227–245.
 Paenke et al., 2007 Paenke, I., Sendhoff, B. & Kawecki, T. J. 2007. Influence of plasticity and learning on evolution under directional selection. Am. Nat. 170, E47–E58.
 Posch et al., 1999 Posch, M., Pichler, A. & Sigmund, K. 1999. The efficiency of adapting aspiration levels. Proc. R. Soc. Lond. B 266, 1427–1435.
 Sandholm & Crites, 1996 Sandholm, T. W. & Crites, R. H. 1996. Multiagent reinforcement learning in the iterated prisoner’s dilemma. Biosystems 37, 147–166.
 Simpson, 1953 Simpson, G. G. 1953. The Baldwin effect. Evolution 7, 110–117.
 Suzuki & Arita, 2004 Suzuki, R. & Arita, T. 2004. Interactions between learning and evolution: outstanding strategy generated by the Baldwin effect. Biosystems 77, 57–71.
 Szabo & Toke, 1998 Szabo, G. & Toke, C. 1998. Evolutionary prisoner’s dilemma game on a square lattice. Phys. Rev. E 58, 69–73.
 Taiji & Ikegami, 1999 Taiji, M. & Ikegami, T. 1999. Dynamics of internal models in game players. Physica D 134, 253–266.
 Traulsen et al., 2006 Traulsen, A., Nowak, M. A. & Pacheco., J. M. 2006. Stochastic dynamics of invasion and fixation. Phys. Rev. E 74, 011909.
 Trivers, 1971 Trivers, R. L. 1971. The evolution of reciprocal altruism. Q. Rev. Biol. 46, 35–57.
 Turney et al., 1996 Turney, P., Whitley, D. & Anderson, R. W. 1996. Evolution, learning, and instinct: 100 years of the Baldwin effect. Evol. Comput. 4, 4–8.
 Waddington, 1942 Waddington, C. H. 1942. Canalization of development and the inheritance of acquired characters. Nature 150, 563–565.
 Wang et al., 2008 Wang, S., Szalay, M. S., Zhang, C. & Csermely, P. 2008. Learning and innovative elements of strategy adoption rules expand cooperative network topologies. PLoS One, 3, e1997.
 Weber & Depew, 2003 Weber, B. H. & Depew, D. J. eds. 2003. Evolution and learning — The Baldwin effect reconsidered. MIT Press, Cambridge, UK.
 Yeh & Price, 2004 Yeh, P. J. & Price, T. D. 2004. Adaptive phenotypic plasticity and the successful colonization of a novel environment. Am. Nat. 164, 531–542.
st1  st2  st3  st4  st5  

st1  
st2  
st3  
st4  
st5 