Understanding The Impact of Partner Choice on Cooperation and Social Norms by means of Multi-agent Reinforcement Learning


The human ability to coordinate and cooperate has been vital to the development of societies for thousands of years. While it is not fully clear how this behavior arises, social norms are thought to be a key factor in this development. In contrast to laws set by authorities, norms tend to evolve in a bottom-up manner from interactions between members of a society. While much behavior can be explained through the use of social norms, it is difficult to measure the extent to which they shape society as well as how they are affected by other societal dynamics.

In this paper, we discuss the design and evaluation of a reinforcement learning model for understanding how the opportunity to choose who you interact with in a society affects the overall societal outcome and the strength of social norms. We first study the emergence of norms and then the emergence of cooperation in presence of norms. In our model, agents interact with other agents in a society in the form of repeated matrix-games: coordination games and cooperation games. In particular, in our model, at each each stage, agents are either able to choose a partner to interact with or are forced to interact at random and learn using policy gradients.

oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.


Understanding The Impact of Partner Choice on Cooperation and Social Norms by means of Multi-agent Reinforcement Learning


Nicolas Anastassacos0 0  Steve Hailes0  Mirco Musolesi0 


Many animal populations have evolved social groups to form cooperative societies, preferring long-term over short-term gains and exhibiting pro-social behavior (Riedman, 1982; Davies et al., 2012). The social norms developed by humans, though more complex, are developed on the same principles and are self-enforcing (Axelrod, 1997). Formally, norms are defined as some common understandings or informal rules that govern the behaviors of members of a society. In the present day, these principles might derive from a social code such as a modern day constitution or a religious text. Norms can be as simple as dining etiquette or fashion sense or more complex such as those at the basis of political systems. They have been shown to have a great impact on the collective outcomes and progression of a society; indeed, while the sources and the objectives related to these norms can be vastly different, they promote efficient cooperation on a large-scale and enable individuals to create a stable society (Elster, 1989). How and why humans have a tendency to conform to norms is a topic of much study and dispute in the fields of psychology, sociology, and multi-agent systems (Abrams et al., 1990; Fehr & Gachter, 2000; Epstein, 2001; Boella et al., 2006). Existing studies are based on agents with static (i.e., rule-based) behavior according to game theoretic approaches. The use of reinforcement learning allows for dynamic adaptation to the interactions in society as they develop and shows promise as a useful tool for these types of societal simulations (Sen & Airiau, 2007; Leibo et al., 2017; Perolat et al., 2017).

While it has been argued that normative behavior emerges from societal interactions, it is not clear as to what behavior is likely to emerge given some societal configuration. Multi-agent social dilemmas provide a powerful platform to study the emergence of behavior and the development of cooperative solutions in the presence or absence of social norms (Macy & Flache, 2002; Sen & Airiau, 2007; Izquierdo et al., 2008). Social dilemmas are useful when studying these complex interactions as they allow for interplay between the desires of the individual and the collaborative desires of the group (Gotts et al., 2003; Izquierdo et al., 2008). Often, the short-term benefits of free-riding or exploiting relationships is sufficient to have harmful consequences on the rest of society and such behavior needs to be regulated.

In this paper, we analyze how the freedom to choose a partner for interactions can affect the outcome for society using reinforcement learning to facilitate agent learning and social dilemmas as a platform for interaction. We first study the emergence of norms and then the emergence of cooperation in presence of norms. Partner choice has been thought to be a driving factor in the emergence of cooperative societies and a catalyst for altruistic behavior (Barclay & Willer, 2006; Cuesta et al., 2015). Furthermore, both partner choice and reputation have shown to have direct effects on the formation of groups in trials that used human participation. Though the majority of this research has been conducted on human behavior, Feng et al. developed a model that produces comparable results using artificial agents (Fu et al., 2008); however, they utilize rules to facilitate partner switching and directly penalize agents with low reputation. In our model, agents are presented only with the previous interactions of their opponents when making decisions and are not constrained to choose partners in any predetermined way and learn purely from interacting with others in the society.

Through our experiments, we demonstrate how choice of partner affects the emergence of norms, their robustness and the potential for influencers to impact the status quo. We further show that in the presence of an existing norm, adding this capability to agent behavior allows cooperative agents to thrive as they become heavily sought after as partners. The fact that cooperative and defecting agents alike look to avoid defecting behavior directly increases the strength of a cooperative norm and reduces the motivation to defect in future episodes. We contrast these results with the outcomes when agents interact at random to show the impact of partner choice and reputation on the development of norms and cooperation.

The implications of this work are several. First of all, the results presented in this paper can be used as a basis for the design of systems in which cooperation emerges without the need of implementing mechanisms at agent level, e.g., autonomous and robotic systems. Moreover, the findings presented in this work also shed a light on the mechanisms at the basis of human and animal cooperation and coordination, providing a novel methodology framework for researchers in a variety of disciplines from game theory to evolutionary psychology.


In this section, we review the existing work related to the use of reinforcement learning for multi-agent systems and the emergence of norms.


While researchers can employ learning strategies in agents to simulate the emergence of norms, it is a difficult task to build simulations that encompass the natural physical and social constraints that individuals are faced within a society. Nevertheless, investigations into normative behavior have been revealing. Sen et al. proposed a replicable multi-agent learning framework for social norm emergence in which agents interact at random in a society (Sen & Airiau, 2007). In this simulation, agents had access to actions taken by the opponent but not their identity. They demonstrated that agents trained with a hill-climbing technique, Win-or-Lose-Fast (WoLF) (Bowling & Veloso, 2002) were capable of converging to a norm in a driving simulation that required all agents in a population to drive either on the left side or the right side of the road for maximum societal gain (Sen & Airiau, 2007). In another set of experiments, Sen et al. test the outcome of this framework while constraining agents to interactions with their neighbors (Sen & Sen, 2010). This has since been further extended to agent societies considering a variety of network topologies (Yu et al., 2013) that have proven suitable for the emergence of norms and conventions as they provide an effective infrastructure in which interactions within a social network can take place.

These simulated societies provide a simplified version of reality. However, they can be used as a basis for understanding collective human behavior by abstracting key aspects and mechanisms. More specifically, in this paper, we explore the impact of partner selection on the emergence of cooperation and coordination by augmenting traditional simple agents with a module implementing partner choice based on the outcome of previous interactions with other agents in the system.


The traditional reinforcement learning objective is to maximize cumulative reward over a trajectory of states and actions (Sutton & Barto, 2018). Instead of modeling the environment dynamics explicitly, the agent aims to optimize its behavior by interacting directly with the environment over a large number of episodes.

An efficient solution adopted in deep reinforcement learning is the use of an experience replay buffer to store previous interactions with the environment, which is needed to derive more accurate updates (Mnih et al., 2013). Non-stationarity has been one of the most problematic issues as far as the design of reinforcement learning algorithms is concerned; indeed, changes might occur (and will occur in any real world settings) in the environment and they might affect the learning dynamics (Sandholm & Crites, 1996; Busoniu et al., 2008). Consequently, a focus of multi-agent reinforcement learning is to design mechanisms that allow informed updates the values of agent parameters by incorporating estimates over opponent behavior known as agent-tracking (Busoniu et al., 2008). Other techniques have been based on the use of centralized controllers to share information across agents and stabilize learning (Lowe et al., 2017).


Sen et al. propose a method to study norm emergence using Q-learning and policy hill climbing, though they note that Q-learning converges only to deterministic behavior, which could be problematic (Sen & Airiau, 2007). This work highlighted the emergence of norms and the capability of learning agents to converge to similar patterns of behavior and achieve a social goal. Leibo et al. use Q-learning in a variety of games that can be defined as social dilemmas; they account for Q-learning being deterministic by training under different environmental settings, e.g., in the presence of many or few resources (Leibo et al., 2017). The authors measure the effect of network size and learning capability on the outcome of the society, and conclude that it is likely that, as the size of the network increases, so does the complexity of the resulting behavior. Like other social dilemmas, such as the IPD, resource appropriation games such as the Tragedy of the Commons have been further studied using reinforcement learning techniques (Perolat et al., 2017). In their investigation, Perolat et al. use a reinforcement learning model to observe the interactions of evolving agents (Perolat et al., 2017) and describe agents’ behavior through the use of intuitive societal metrics to monitor some important social dynamics such as equality, peace and sustainability.


We model sequential social dilemmas as general-sum Markov games with each agent having partial observation of their environment. Agents must learn an appropriate policy while coexisting with one another. We focus specifically on two archetypal examples of social dilemmas: a cooperation-based game and a coordination-based game. A cooperation game exposes the tensions between an individual’s selfish desires and the good of the group while a coordination game examines agent conformity which is key to the functioning of any society. To capture these phenomena, we use the following variations of social dilemmas that are formalized by matrix payoffs and are temporally extended. This is further discussed in Section id1.


As a first step, we study the emergence of norms in a society of agents. We use a coordination game that is summarized in Table 1b. More specifically, the table displays the outcomes for a variation of a coordination game that has two Nash Equilibria. This is akin to a “Choosing Sides” game where it does not matter which action each agent takes as long as they agree to choose the same one. Example choices are eating in versus eating out or going to a party versus staying at home. For generalizability, we will differentiate between these actions by referring to them simply as and . Tables (1c and 1d) show outcomes for pure coordination games where players would prefer the same Nash Equilibrium outcome. In Table 1b, it does not matter which action agents pick as long as they pick the same one. In Tables 1c and 1d, the choice of a specific action matters more as action maximizes the joint reward. The emergence of norms is assessed by tracking the number of agents that converge to a particular equilibrium point. Over the course of the simulation, agents will learn behavior dependent on their experiences with the agents that they are paired with at random or by choice.

(a) A B A a, a b, b B c, c d, d (b) A B A 5, 5 0, 0 B 0, 0 5, 5 (c) A B A 5, 5 0, 0 B 0, 0 4, 4 (d) A B A 5, 5 0, 0 B 0, 0 3, 3
Table 1: Payoff Matrix for Coordination Games. Each agent in the society has an incentive to play the same strategy as the partner that they have been paired with. This is formalized by a payoff matrix in (a) where payoffs for player 1 follow , , and for player 2 follow , . We test three variations of the cooperation game to understand how unbalancing the reward signals affects convergence of norms and the size of resulting groups. In (b) the payoff matrix has two Nash Equilibria that are Pareto efficient as there exists no outcome that increases any agent’s rewards without decreasing that of any another agent’s . (c) and (d) each have one Nash Equilibrium where Pareto dominates .

We next investigate the effects of partner choice on the emergence of cooperation in the presence of a developed norm. Firstly, we look to study whether cooperation can develop in the absence of any norm. Next, we analyze how cooperation evolves with partner choice. Finally, we consider the same scenarios but in the absence of partner choice. While, in principle, many consistencies in behavior can be explained through the use of norms (for example in compliance with some moral standard), there is a complex interplay between the dynamics of many existing norms and emergent behavior. Observing the outcomes of behavior in the presence of various norms may provide support for insights on how cooperation can develop and the circumstances that allow it to last. This type of modeling methodology has been termed the direct social norm approach by Fehr in (Fehr & Schurtenberger, 2018).

In our model, individual agents repeatedly interact with other agents in the society in coordination and cooperation scenarios. As discussed above, our main focus is on the effect of choice on the emergence of norms and other behaviors, as well its effect on the stability of existing behavior.

C R, R S, T
D T, S P, P
C 3, 3 0, 5
D 5, 0 1, 1
Table 2: Payoff Matrix for Social Dilemmas and Iterated Prisoner’s Dilemma. The motivation to defect comes from fear of an opponent defecting or acting greedily to gain the maximum reward when one anticipates the opponent might cooperate. The game is modeled so that and . From a game theoretic perspective, the optimal strategy in a finite game is to defect. This is undesirable as they could both achieve greater reward if they agreed to cooperate.

We define a social norm via a utility function , modeled after Fehr et al. (Fehr & Schurtenberger, 2018), where the direct reward of each action is but the overall utility accounts for whether or not was in compliance with a cooperative norm. If it is not, then the reward is penalized by some negative value proportional to the number of other agents who have complied with the norm. When no agents comply with the norm, no disutility is generated by deviating from it. This is written formally as


where is the number of individuals who have complied with the social norm in this episode and is a hyper-parameter that governs the effect of the social norm. For simplicity, we will simply refer to this term as the social norm strength. Interpretations of this norm could be an agent’s “code of ethics” or guilt associated with inequity. For clarity, we will name the outcomes in Table 2 as the following: is cooperation, is exploitation, is deception, and is defection.


In order to study the effect of choice on the emergence of norms, we use a policy gradient learning algorithm for learning behavior as it can learn mixed strategies. For our experiments we use a population size of 50 agents. We tested societies with varying population sizes and found that the size of the population does not influence the outcome of our experiments and that a size of 50 was representative. For fairness, we provide every agent with an equal amount of information at every time step and each agent learns independently from the other agents. Our agents are updated using REINFORCE (Williams, 1992). Similar to (Leibo et al., 2017), we account for changes in the behavior of the agents by refreshing the replay buffer after each iteration. In a two-player cooperation game, agents trained with either policy gradients Q-learning learn to defect, the same solution produced by the Nash Equilibrium. In a two-player coordination game, agents converge to a single strategy that maximizes their joint reward. Both games studied here were implemented in the form of two-player matrix payoff games and the results of encounters were recorded from the perspective of .

1:  for each episode do
2:     for each in population  do
3:        select at random
10:     end for
11:  end for
12:  for each in population  do
13:     update
14:  end for
Algorithm 1 Without Choice of Partner
1:  for each episode do
2:     for each in population  do
13:     end for
14:  end for
15:  for each in population  do
16:     update
17:  end for
Algorithm 2 With Choice of Partner
(a) Random Partners* (Table 1b)
(b) Random Partners and Unbalanced Reward (Table 1c)
(c) Random Partners and Unbalanced Reward (Table 1d)
(d) With Choice of Partner (Table 1b)
(e) With Choice of Partner and Unbalanced Reward (Table 1c)
(f) With Choice of Partner and Unbalanced Reward (Table 1d)
Figure 1: Results of simulations for Coordination game over 10,000 episodes. Payoff matrices used in (a & d), (b & e), and (c & e) are shown in Table 1b, 1c, and 1d respectively. Maximum societal gain is reached in (b) and (c) by the end of the simulation. When agents have choice (e & f), there is slight resistance to converge to a single norm. With a greater differential in reward this resistance is significantly less pronounced. All simulations run for 10,000 episodes and agents are trained with =1e-4.                                                                                                                                                     * In (a), though it is equally likely for agents to pick action , for simplicity, we display only the results where the emerging norm is for both agents to pick action as both distributions would be the same.

In each episode, agents were able to choose one agent to partner with. The same agent could be chosen more than once and by other agents, and agents could not choose themselves. The agents then play a pre-selected game once and store the experience in a replay buffer. After every agent has chosen once, agents are updated on gathered experiences. Each agent has two policies one for partner selection and one for actually playing either a Coordination or Cooperation game parameterized by and , respectively. Agents are presented with two states and . The state is determined by the previous 10 interactions of all other agents in the society. The number of previous interactions that were given was a determining factor in how quickly agents were able to find suitable partners. We deemed 10 previous interactions was deemed to be representative of an agent’s behaviour. The state is determined by the agent’s own previous 10 interactions and their chosen opponent’s 10 interactions. Upon receiving , the agent chooses a partner that they will play the game with. Each agent then chooses an action that they will play when matched with their opponent, and receives the corresponding reward, of their joint actions. Default neural networks had two hidden layers with 64 units and were trained using a standard policy gradient. For agent , the distribution over partners updates using a policy gradient update with the same reward that is seen after both agents play their actions. Furthermore, to ensure that, when having choice of partner, agents had sufficient opportunity to explore and interact with numerous agents the learning rate of the choice component was set to 1e-4 while the playing component was set to 1e-3; both were trained with an entropy regularizer. Societies were simulated for episodes in the coordination setting and episodes in the cooperation setting.


From a societal point of view, the goal of the game is to achieve the highest cumulative score while each agent individually tries to maximize their own score. In a 2-player game where both RL agents make their decisions simultaneously, agents have been shown to adopt either a strategy where both agents always choose action or both agents always choose action . As more agents are introduced into the society, agents can no longer reap the rewards of choosing just one action as they can no longer assume the behavior of their opponent. Our experiments investigate how behaviors develop and how partner choice affects this outcome.

When agents interact at random, our experiments show that they conform to choosing either action or action exclusively. However, in the presence of partner choice, we observe the emergence of a different type of group behavior. Instead of a single norm emerging for a single optimal social outcome, multiple norms emerge and can co-exist. In Figure 1a we display the results using a learning rate of 1e-4 and, contrasting this with Figure 1d where agents can choose their partners, we can see that, although agents that interact at random conform to a single norm, this happens at a noticeably slower rate. Each agent is randomly initialized with a probabilistic policy and, when agents can select their partners, they quickly learn to choose to interact with agents who behave similarly to them so as to maximize their individual reward. During training, the number of agents that converge to either norm increases at approximately the same rate and with the same variance as expected. This results in the formation of two groups of similar size that have each conformed to a dominant norm. An agent’s choice of partner fluctuates but the groups are formed quickly and early on in the simulation. Figure 1 shows the number of times a type of interaction occurs in a single episode. With a population of 50 agents and 5 turns per agent in every episode, the total number of interactions per episode is 250. Nearing the end of the simulation, almost all the agents interact exclusively with other agents in the same group and very few interactions take place between agents of different groups.


In the previous experiment, we observed an outcome in which multiple social norms could exist due to partner choice. However, in the real world it is often the case that one trend or convention can suddenly be seen as less attractive and can then be succeeded by another. This can be interpreted as a change in the perceived rewards, can emerge naturally or can be steered via interactions and influences in the society. In an agent society in which interactions are determined by a random walk, when the rewards are changed such that action gives less reward than action , the emerging norm is for all agents to pick action every time (Figure 1). It is notable that just a small change in the reward results in such a significant change in the societal outcome and with such consistency. However, when agents are given choice of partner, though a smaller proportion of agents select action , both actions remain viable for agents and two groups are formed. This diversity comes at the cost of a lower cumulative reward for the society as a portion of agents favor an action that produces less reward, yet survives due to the grouping together of agents who favor the same action. There is also a higher variance in the number of agents that converge to each norm. Intuitively, the smaller the reward, the lower the expected overall size of the group and this is represented in the results. Likewise, the higher the relative reward, the more quickly agents learn to select it and discard other actions. Upon further investigation, it can be seen that larger group sizes are dependent on whether agents that adopt the same convention are able to find each other in the society early on in the simulation.

While randomizing interactions between individuals demonstrates how to maximize societal gain, a model that accounts for choice of partner demonstrates that, in a sense, agents prefer to be around like-minded individuals and distance themselves from those that have different perspectives, despite being able to achieve potentially a higher reward elsewhere. While these results are interesting in their own right, we should be careful about how we interpret them. The former suggests that if we were to naively consider each reward as being related to let us say a certain measure of happiness or satisfaction, then forcing individuals to interact with one another regardless of perspectives is an effective way of getting people to conform to some common understanding. However, the latter suggests that agents prefer to be around like-minded individuals and distance themselves from those that have different perspectives. Furthermore, this distancing only continues to grow during the society’s lifespan. The resulting diversity can be a point of strength for the society or a point of weakness depending on whether or not agents have a consistent opportunity to interact with a diverse range of other agents in the society. At the end of our simulation, once the groups are formed, it becomes increasingly unlikely for agents to leave their groups and join another group. This raises questions of the dynamics that would be needed to facilitate the bridging together of multiple groups.

(a) Five influencers with random partners
(b) Five influencers with partner choice
Figure 2: Five influencers who play a fixed strategy of “always ” are inserted into the environment. (a) shows the outcome when agents are forced to interact at random. (b) shows the outcome when agents have choice of partner. The effect of the influencers is significantly stronger in (a) and only one norm emerges.
(a) Cooperative societal outcome with choice of partner
(b) Defective societal outcome with random partners
Figure 3: Results of experiments that show the amount of cooperation in the society at the end of every episode and the cumulative rewards for each outcome. In (a) exploitation and cooperation initially increase at the same rate as agents search for partners. The number of agents being deceived or choosing opponents who will defect against them decreases and plateaus quickly. Cooperative behavior is reinforced eventually resulting in a maximum cumulative gain for society with 250 cooperation encounters. (b) Results when agents do not have choice of partner and have to interact with a random agent. Agents are never rewarded for cooperating and cooperation is not seen often enough. Consequently, cooperative societies achieve significantly higher reward per episode and without freedom to choose a partner, reward per episode decreases as defecting increases.

We will term fixed agents as influencers as their expected role is to steer the crowd. In Figure 2 we show the results of our experiments with five influencers playing a fixed strategy of always playing action . In the presence of influencers, RL agents that interact with partners at random are shown to conform to the influencers’ behavior even with large populations. The influence of an influencer has a strong effect on the learning agents despite interacting infrequently. These influencers cannot choose partners to interact with themselves but they can be chosen by the other agents. Nevertheless, their effect causes an obvious shift in the norm that the agents converge to. After 4,000 episodes, the trajectories of each norm diverge more noticeably. As more agents conform to the norm of the influencer, the faster it spreads. After 10,000 episodes there are no agents left who play actions that diverge from the norm. When choice of partner is available, this is no longer the case. The influencers may play a role in developing the behavior of some agents; however, agents who have had better experiences with other agents learn instead to simply avoid interacting with the influencers and choose partners who share similar values.

Additionally, while there is a noticeable increase in the number of agents that conform to the influencers’ behaviors early on, the rate at which agents conform to either norm is approximately equal once agents have adjusted and found desired partners. In other words, it becomes harder to influence or regulate societal behavior through assimilation or supervision where agents are free to make a choice as to who they can interact with in the society. On the contrary, when agents are forced to interact at random, they need to conform to a societal norm that can be determined by influencers as they never know when they will run into others who have different or conflicting behavior and, therefore, receive less reward. Meanwhile, when agents have partner choice, the introduction of influencers plays a smaller role on the outcome than small variations in the reward signal (Figure 1).


We test first a society without a pre-existing norm and find that, with or without partner choice, agents quickly learn to defect against one another. The early gains from defecting have an influence in this instance and, even with partner choice, agents that have initialized with highly cooperative behavior learn to defect. We then analyze how giving agents the capability to choose a partner for interaction can affect cooperative behavior in the presence of an existing norm compared to the situation in which they are forced to interact randomly. When setting the social norm, to , the RL agents in the society exhibit different behaviors. While agents are randomly initialized, they quickly adapt to the responses of their partners. The amount of cooperation and exploitation increases at the same rate early on in the society’s lifespan as both cooperative agents and defecting agents look to pair with cooperative agents to maximize their reward. Tracking the selected partners of each agents we can see that agents who have a high probability of defecting are rarely selected after a few thousand episodes. As cooperation rises, so does the opportunity for defecting agents to exploit the good nature of others in society; however, at the same time the utility of defecting decreases as the social norm penalty increases (see Eq. (1)). The cooperative agents that have found other cooperative agents in the society early on benefit greatly and promote cooperation amongst themselves, reinforcing cooperative behavior with each episode. At approximately 2,000 episodes into the simulation, the disutility generated from the norm causes the defecting agents to adjust their behavior and start cooperating. By the end of the simulation, all the agents in the society have learned to cooperate and defecting behavior is altogether removed (Figure 3a).

In contrast, in the presence of the same social norm and without choice of partner, agents do not learn to cooperate. Not enough agents cooperate early on the society’s lifespan for the norm to take effect and agents begin defecting. However, while this general trend occurs consistently, the rate at which it occurs varies and there is often a slow resistance in the society depending on whether cooperative agents are lucky enough to meet other cooperative agents as pairs are randomly assigned (Figure 3b). This adds some strength to the norm but not enough to dissuade agents from defecting. In both of these scenarios, the fate of the society relies on a suitably strong social norm but also allows for cooperative agents to reinforce the behavior of other cooperative agents and find each other early on in the society. If agents are not presented with that opportunity, then defecting remains the superior strategy. Furthermore, convergence to cooperation can be significantly slower and less pronounced depending on the number of agents that are initialized to behave cooperatively and, therefore, the number of cooperative interactions that happen in the early stages of the simulation. As agents are not given the option to reject being chosen, there are cases where only smaller groups of cooperative agents find each other, resulting in only a small amount of disutility being generated from defecting; as a result, exploiting behavior is sustained. We expect this outcome will not be observed if agents can refuse to be paired with other agents. We plan to investigate this type of dynamics as future work.

In both the coordination and cooperation settings, the added dynamic of choosing a partner has significant consequences for the outcomes of a society. In coordination settings, choice of partner allowed multiple social norms to develop instead of all agents conforming to a single norm. Furthermore, the addition of choice also allowed agents to resist external influence, notably an outsider trying to enforce change. However, when something inherent about one norm was affected, such as a perceived reward being decreased, having choice of partner on its own was not sufficient to sustain it. The proposed methodology can be extended further to simulate the formation of groups using temporally extended social dilemmas. It can also be used to investigate concepts such as reputation and group cohesion.

While these results provide some evidence for how having the capacity to select a partner and an awareness of the reputation of others can have a significant effect on the emergence of cooperation, we cannot perfectly simulate the way in which individuals make that decision. In order to allow agents to interact with multiple partners, we need to assign learning rates and use regularizers. Our simulation results indicate that cooperation is promoted in societies in which agents have partner choice and are they are significantly more prosperous for it. As agents in our experiments learn, on one hand, behaving negatively has the long-term consequence of being shunned by other individuals in the society. On the other hand, behaving positively results in being chosen (and, therefore, achieving more reward) and allows cooperative behavior to be sustained and, ultimately, thrive.


In this paper we have investigated the effect of partner choice on agent behavior when agents interact in a society in presence of norms. Our experimental results show that, in coordination settings, agents that have choice of partner are able to sustain norms in a society and show resistance to change in the presence of influencing agents who play a fixed strategy. Much of this resistance is a consequence of freedom to choose a partner as agents can choose to avoid these influencers; however, there is a significant shift in the outcome when the rewards themselves are adjusted.

Furthermore, we have also showed that partner choice can promote cooperative behavior in presence of norms. Using an initially weak norm, we have observed that, where agents have the freedom to choose their partners, both cooperative and defecting agents pair themselves almost purely with other cooperative agents whose reputation is determined by their prior behavior. This is the key factor that stabilizes cooperation as untrustworthy agents are avoided and cooperative behavior can be reinforced as the social norm is strengthened. We believe that the findings presented in this paper can be used as a basis for the design of autonomous systems and provide novel insights into the emergence of cooperation in a society.


  • Abrams et al. (1990) Abrams, D., Wetherell, M., Cochrane, S., Hogg, M. A., and Turner, J. C. Knowing what to think by knowing who you are: Self-categorization and the nature of norm formation, conformity and group polarization. British Journal of Social Psychology, 29(2):97–119, 1990.
  • Axelrod (1997) Axelrod, R. The complexity of cooperation: Agent-based models of competition and collaboration. Princeton University Press, 1997.
  • Barclay & Willer (2006) Barclay, P. and Willer, R. Partner choice creates competitive altruism in humans. Proceedings of the Royal Society B: Biological Sciences, 274(1610):749–753, 2006.
  • Boella et al. (2006) Boella, G., Torre, L. V. D., and Verhagen, H. Introduction to normative multiagent systems. Computational and Mathematical Organization Theory, 12(2-3):71–79, 2006.
  • Bowling & Veloso (2002) Bowling, M. and Veloso, M. Multiagent learning using a variable learning rate. Artificial Intelligence, 136(2):215–250, 2002.
  • Busoniu et al. (2008) Busoniu, L., Babuska, R., and De Schutter, B. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, And Cybernetics-Part C: Applications and Reviews, 38 (2), 2008, 2008.
  • Cuesta et al. (2015) Cuesta, J. A., Gracia-Lázaro, C., Ferrer, A., Moreno, Y., and Sánchez, A. Reputation drives cooperative behaviour and network formation in human groups. Scientific Reports, 5:7843, 2015.
  • Davies et al. (2012) Davies, N. B., Krebs, J. R., and West, S. A. An Introduction to Behavioural Ecology. John Wiley & Sons, 2012.
  • Elster (1989) Elster, J. The Cement of Society: A Survey of Social Order. Cambridge University Press, 1989.
  • Epstein (2001) Epstein, J. Learning to be thoughtless: Social norms and individual computation. Computational economics, 18(1):9–24, 2001.
  • Fehr & Gachter (2000) Fehr, E. and Gachter, S. Cooperation and punishment in public goods experiments. American Economic Review, 90(4):980–994, 2000.
  • Fehr & Schurtenberger (2018) Fehr, E. and Schurtenberger, I. Normative foundations of human cooperation. Nature Human Behaviour, 2(7):458, 2018.
  • Fu et al. (2008) Fu, F., Hauert, C., Nowak, M. A., and Wang, L. Reputation-based partner choice promotes cooperation in social networks. Physical Review E, 78(2):026117, 2008.
  • Gotts et al. (2003) Gotts, N., Polhill, J., and Law, A. Agent-based simulation in the study of social dilemmas. Artificial Intelligence Review, 19(1):3–92, 2003.
  • Izquierdo et al. (2008) Izquierdo, S., Izquierdo, L., and Gotts, N. Reinforcement learning dynamics in social dilemmas. Journal of Artificial Societies and Social Simulation, 11(2):1, 2008.
  • Leibo et al. (2017) Leibo, J., Zambaldi, V., Lanctot, M., Marecki, J., and Graepel, T. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 464–473. International Foundation for Autonomous Agents and Multiagent Systems, 2017.
  • Lowe et al. (2017) Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., and Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390, 2017.
  • Macy & Flache (2002) Macy, M. and Flache, A. Learning dynamics in social dilemmas. Proceedings of the National Academy of Sciences, 99(suppl 3):7229–7236, 2002.
  • Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Perolat et al. (2017) Perolat, J., Leibo, J., Zambaldi, V., Beattie, C., Tuyls, K., and Graepel, T. A multi-agent reinforcement learning model of common-pool resource appropriation. In Advances in Neural Information Processing Systems, pp. 3643–3652, 2017.
  • Riedman (1982) Riedman, M. L. The evolution of alloparental care and adoption in mammals and birds. The Quarterly Review of Biology, 57(4):405–435, 1982.
  • Sandholm & Crites (1996) Sandholm, T. W. and Crites, R. H. Multiagent reinforcement learning in the iterated prisoner’s dilemma. Biosystems, 37(1-2):147–166, 1996.
  • Sen & Sen (2010) Sen, O. and Sen, S. Effects of social network topology and options on norm emergence. In Coordination, Organizations, Institutions and Norms in Agent Systems V, pp. 211–222. Springer, 2010.
  • Sen & Airiau (2007) Sen, S. and Airiau, S. Emergence of norms through social learning. In IJCAI, volume 1507, pp. 1512, 2007.
  • Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Introduction to Reinforcement Learning. MIT press Cambridge, 2018.
  • Williams (1992) Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • Yu et al. (2013) Yu, C., Zhang, M., Ren, F., and Luo, X. Emergence of social norms through collective learning in networked agent societies. In Proceedings of the 2013 International Conference on Autonomous agents and Multi-agent Systems, pp. 475–482. International Foundation for Autonomous Agents and Multiagent Systems, 2013.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description