Self-organization in a distributed coordination game through heuristic rules
In this paper we consider a distributed coordination game played by a large number of agents with finite information sets, which characterizes emergence of a single dominant attribute out of a large number of competitors. Formally, agents play a coordination game repeatedly which has exactly Nash equilibria and all of the equlibria are equally preferred by the agents. The problem is to select one equilibrium out of possible equilibria in the least number of attempts. We propose a number of heuristic rules based on reinforcement learning to solve the coordination problem. We see that the agents self-organize into clusters with varying intensities depending on the heuristic rule applied although all clusters but one are transitory in most cases. Finally, we characterize a trade-off in terms of the time requirement to achieve a degree of stability in strategies and the efficiency of such a solution.
Keywords : Majority games, adaptation, reinforcement learning, distributed coordination, self organization.
JEL code: C72, C63, D61
Understanding collective behavior of large-scale multi-agent systems is an important question in the econophysics and the sociophysics literature [1, 2]. Often in social and economic worlds, we find emergence and evolution of global characteristics that cannot be explained in terms of fundamental properties . We find examples of particular social norms or technologies that become more popular than their competitors, which are not necessarily worse in terms of attributes. Similarly, norms and opinions emerge as an equilibrium through reinforcement among the social and economic agents . Leaders emerge in the political context through a complicated process of competition and interaction among millions of individuals . In this paper, we present a simple multi-agent game to study the emergence of one dominant attribute out of many potential competitors through complex and adaptive interactive processes (;see also Ref. ).
We focus on two properties of large scale interaction. One, agents can coordinate to specific choices from a number of potentially identical choices which may also be interpreted as emergence of cooperation  and two, such coordination may take time to arrive at but once arrived, can be quite stable. Therefore, we address the dynamic (and potentially non-equilibrium) process through which coordination takes place as well as the stability of the eventual equilibrium . We consider a prototype model to study this kind of situations. In particular, we consider a simple coordination game with agents and choices. Individual agents aim to converge to a single universally chosen outcome; i.e., the game can be thought of as a majority game.
In the language of game theory, this relates to the idea of equilibrium selection. In our game, there are possible pure strategy Nash equilibria, each of which is equally attractive to the agents. The question is how, in the absence of communication, do the agents converge to only one equilibrium? Naturally, we do not allow a central planner to dictate the solution as that would make the problem trivial as well as unrealistic.
In our model, agents play the game repeatedly and they always want to be in the majority. We first present several strategies based on naïve learning that allow the agents to solve this coordination problem in a distributed manner . We next assume that the agents want to minimize their cost of experimentation, i.e., to come up with some fixed strategy as soon as possible even if it results in not being in the absolute majority. This leads to a trade off between the degree of stability (time to attain an approximate fixed rule of thumb) and the efficiency of the solution i.e., degree of coordination. We propose multiple heuristic strategies for coordination that solves the problem to different degrees. We propose a Polya scheme following the famous Polya’s urn model, which allows us to interpolate between multiple types of reinforcment learning processes .
This paper is intimately related to the literature of minority game [10, 11, 12] and the generalization of the minority game that goes by the name of Kolkata Paise Restaurant (KPR) problem [13, 14]. In the minority game, there are agents and 2 options to choose from. The agents’ objective is to be in the minority. KPR problem extended this to a minority game with agents and options labeled restaurants. In spirit of Ref. , multiple attempts were made to propose strategies that uses finite information sets with bounded rationality. Interested readers can refer to Ref.  for a comprehensive review. The model we propose is the exact opposite of the multi-choice minority game. Both are examples of large scale distributed coordination problems that study competing agents employing adaptive strategies with limited learning. .
In this paper, we show that agents converge to specific choices due to reinforcement learning. In particular, depending on the degree of reinforcement, agents may be get stuck to different choices creating clusters of different sizes. Clustering behavior has been studied in the context of minority games . Here, such behavior also implies that due to reinforcement, non-equilbrium configurations may also survive and hence, it is not necessarily a ‘winner-take-all’ scenario. Finally, we show that if the agents value not only coordination but also the time requirment to achieve absolute coordination, then there would be a trade-off in terms of efficiency and stability of the final solution.
2 -agent coordination game
We consider agents and options. Time is discrete and at every point of time, each agent makes a choice about which among the options to use. To fix the idea, one can imagine each option to represent one restaurant which an agent will visit in a time slice. Therefore, each of the agents’ strategy is to choose a restaurant to visit in each time slice. Any given restaurant can accommodate a maximum of agents in any particular time slice. The agents’ objective is to stay in majority, i.e., the agents would like to move to a restaurant which has higher number of agents. In principle, may not be equal to . To impose symmetry on the problem, we assume , i.e., the number of agents is equal to the number of restaurants.
We also emphasize here that the game is necessarily non-cooperative and no communication is allowed among the agents. The information set for all agents is constrained to only their history and partial knowledge about past evolution of the restaurant occupancies. Naturally, allowing full set of history across all restaurants to be available to the agents would immediately solve the problem as the agents can employ a strategy that in time slice 1 they choose randomly and in the next time slice, they move to the restaurant that attracted most number of agents in the first time slice. To have a non-trivial solution, we allow only partial set of history to be available to the agents. We elaborate on the specifics of the information sets for each type of strategies below.
Fig. 1 shows the payoff matrix for a general convergence game for two players. Both players have strategies A and B i.e., they may choose to visit either restaurant A or restaurant B. If both of them decide to visit the same restaurant (either A or B), then the outcome for both would be better than if the chose different restaurants. A couple of points may be noted. This game is a simplified version of the famous Battle of Sexes game (see for example,  for a textbook treatment). The Battle of Sexes game allows two players, in which agents aim to converge to a single restaurant although they differ in their preferences over the restaurants. In this paper we assume a multi-agent multi-choice scenario with agents, but assume that all agents have identical preference over the restaurants.
The agents decide on their strategies based on attractiveness of a restaurant. We define attractiveness () of a restaurant as the number of agents that have chosen that restaurant. Thus attractiveness depends on the information set that the agent possess. Naturally, at any given time slice, it is not possible to know how many other agents are choosing a given option.
For the sake of completeness, we define Nash equilibria for the coordination game. A Nash equilibrium is defined as a strategy collection such that given every other agent’s strategy each agent is weakly better off by not switching to a different strategy. For our purpose this description suffices. For a textbook description, see . From Fig. 1 it can be verified that there are two pure-strategy Nash equilibria, viz. both go to either restaurant or both go to . In a general -agent game, there would be pure strategy Nash equilbria.
It may be noted that Nash equilibrium is an equilibrium description and a static concept. It does not explain how one equilibrium would be chosen from many candidate equilibria in reality. So the essential question is how do agents coordinate to converge on one equilibrium out of possible choices, in absence of any information about what the other agents are thinking?
We specify a set of strategies below that solves this problem using finite sets of information and in certain cases, with no information about the other agents.
3 Heuristic Updating Strategies
In this section, we present a set of updating strategies that the agents may employ in the coordination game. These can be thought of as rule-of-thumb strategies. In particular, they do not exhaust all possible strategies, but provides a comprehensive set that is useful for solving the game.
In the following, we define a strategy of an agent as a vector of probabilities that she assigns to the restaurants i.e. each of the elements of the vector would represent the probability with which she chooses one restaurant. Formally, we denote the -th agent’s strategy at time slice as for . Learning is introduced as updating the probability vector based on success of failure in the past.
3.1 No learning
We begin with a No Learning strategy. This entails zero probability updating and represents a baseline case.
3.1.1 Zero updating
This strategy has two parts. Consider any generic time slice . First, the -th agent () assigns the following probability to the restaurants,
Naturally, this would lead to a randomly distributed allocation of agents across restaurants. In particular, [KPR] shows that the occupancy fraction i.e. the number of restaurants occupied as a fraction of the total number , would be 63.5%. So the first part is far from sufficient to ensure coordination.
The second part of the strategy allows the agent at time slice , to make a comparison between the choice made at time slice and the restaurant she is at time slice . Because attractiveness depends on the number of agents in a restaurant, we denote the -th restaurant’s attractiveness at time slice by . Therefore, an agent’s strategy who is at restaurant is to go to restaurant if
else, the agent stays at .
The information set of the -th agent who is at restaurant at the -th time slice comprises and where is the outcome of random selection scheme (Eqn. 1) for the -th agent. Note that it entails gathering information about the -th restaurant that the -th agent has not visited at time slice , implying that we are allowing for local information. In principle, one can imagine that the agents may have to pay a cost to gather that information. This is a point we will later take on in fuller details.
We present simulation results in Fig. 2 and Fig. 3. Fig. 2 shows the time required for absolute convergence i.e. the minimum number of time slices required for all agents to converge at one restaurant, as a function of the number of agents . It shows a linear trend with a coefficient about 8 on an average. In the inset, we show the ratio as a function of which fluctuates around 8 after an initial steep rise. In the main diagram, we also provide an estimation of the standard deviation across number of simulations.
Fig. 3 shows the dominance of one restaurant over others (we show the second and the third most populated ones) over time in one simulation with . The second and the third most crowded restaurant initially starts attracting more agents before decaying completely in terms of the number of agents as the dominant one becomes absolutely dominant and attracts all agents.
These results show that symmetry-breaking occurs due to stochastic choices. All restaurants start off by being equally popular. But at the end, only one of them emerges as the most popular choice and all other restaurants have no agents.
3.2 Learning Strategies
In this section, we introduce updating rules based on success and failures of the past choices.
3.3 Ex-ante knowledge
This is a direct extension of the previous strategy. At each time slice, the th agent () makes a choice of restaurants using a probability vector . Then she compares the attractivenesses of the chosen restaurant and the restaurant she is currently in, and moves to the one with higher attractiveness in the next time slice. Finally, the -th agent updates her probability vector based on the attractiveness. This last step of probability updating differentiates the strategy from the No Learning strategy.
We call this strategy ex-ante as the agents can decide whether or not to move to a chosen restaurant by gathering information about attractiveness of the current restaurant and the newly chosen one. Later in Sec. 3.5 we study a case with ex-post updating that relaxes this assumption.
We extend the strategy under consideration in multiple dimensions. In the first case, agents reward for higher attractiveness and punishment for lower attractiveness. Formally, higher attractiveness implies that the agent would assign higher weight in the probability vector and would reduce weight for restaurants with lower attractiveness. This strategy we label as symmetric in updating.
In the second case, the agents only reward higher attractiveness. We label this strategy as asymmetric updating. Further, we consider the cases where the agents are allowed to choose more than one restaurant to pick the best option. Formally, the information set increases to choices per agent, where etc. Naturally, setting makes the problem trivial. So we concentrate on cases with sufficiently small values of .
Below we describe the strategies in details.
3.3.1 Symmetric updating
Consider agent where , at any generic time slice . Suppose she is at restaurant and given her probability vector , she probabilistically picks restaurant . If , she stays at restaurant . Else, she moves to restaurant .
Simultaneously, the agent updates probability of restaurants and such that the one with higher attractiveness will gain in probability by a fraction () while the other will decrease by fraction (). Naturally, the resulting sum is normalized to 1. Formally, if ,
and if ,
Finally, probabilities are normalized:
The information set is identical to the No Learning strategy for . For higher values of , we allow the agents to have more information about the occupancy of the restaurants in the previous time slice to make a comparison.
Fig. 4 shows the simulation results for this strategy wih . On the -axis, we plot the restaurants and on the -axis, we plot the number of agents that goes to the restaurants for all restaurants i.e. for all . We show two snapshots. One at time slice 5000 and the other at 10000. The three rows show the distribution of agents under three different information sets, .
The first thing to notice is that the dynamics becomes considerably slow. Even after 10000 time slices (for ), attaining coordination is very difficult as the panels on the right in Fig. 4 show very clearly. However, we note that convergence is guaranteed. As an explanation, consider a case with distribution of agents across all restaurants as (501,499,0,….,0). Given the current strategy, only the first restaurant can attract agents and the second one can only lose agents however slow the process might be.
The next important feature is that by increasing the information set even by limited amount (going from to 2 and 3) drastically improves degree of coordination although the dynamics becomes slow after a certain point. For example, in the bottom row, we see that the distribution changes very slowly going from to .
Therefore, we see that for a long time there are clusters of agents in different restaurants before all collapse into one giant cluster i.e. absolute convergence takes place. Such clustering behavior is transitory.
3.3.2 Asymmetric updating
Consider agent at time in restaurant , probabilistically picking another restaurant . If , she stays at restaurant . Else, she moves to restaurant . The asymmetric updating scheme differs from the symmetric scheme in the way she updates the probability vector .
If there is a difference between attractiveness of the current restaurant and the probabilistically picked one, the agent assigns a higher weight to the more attractive option and reduce weight for every other restaurants. Formally, if
Finally, probabilities are normalized:
This strategy requires exactly the same set of information as the symmetric updating strategy.
Fig. 5 presents the simulation results with the asymmetric updating strategy for . The results are comparable to the symmetric updating scheme. We see that the dynamics becomes slow. As we expand the information set from to 2 and 3, convergence takes place much faster in the initial phase. But after a while, it becomes slow for all information sets. But again with sufficient number of iterations, absolute convergence takes place.
Therefore, we see that for a long time there are clusters. But as with the symmetric updating, this behavior is transitory. By varying the parameter we studied the dynamics before convergence. Fig. 6 presents simulation results for two different values of with multiple information sets (). in order to quantify the degree of stability before convergence, we compute the average of the maximum probabilities that the agents assign to any restaurant. With smaller values of ( = 0.1), the average probability goes up very fast compared to larger values (). Also, with bigger information sets, the average of the maximum probabilities rise slower than with smaller information sets. This is consistent with the finding that coordination occurs much faster with bigger information sets, as that requires multiple switching to ensure convergence. Naturally, with switching happening at a higher frequency leads to lesser reinforcements to specific restaurants.
3.4 Reinforcement learning through Polya’s urn model
We introduce a new strategy using the Poly’s urn model () that effectively captures reinforcement learning. Let us define
where is a tunable parameter taking discrete values within 0 and . We denote the number of times the -th agent has visited restaurant before time slice , by . Then the probability of choosing restaurant is given by
Intuitively, this is an extension of the basic No Learning strategy (which would require ) by embedding reinforcement learning through Polya’s urn model.
The required information set for the -th agent is derived only from the full sequence of success of the agent at different restaurants. It is reasonable to assume that the agents keep track of their own visits. Also note that at any time slice, the agent does not require any information from a restaurant that she is not visiting as was required with the earlier strategies. This is possible because there is no comparison involved. The probabilistic strategies are devised based on historical success.
Fig. 7 presents numerical results for different values of (see Eqn. 3) with = 500 and = 5000. In the left panel, we show the number of restaurants occupied () with at least one agent for different values of the factor and at different time slices . Clearly when , the Polya’s scheme would converge to No Learning case and absolute convergence occurs. This implies only one restaurant would be occupied. This can be seen from the figure by looking at the bars for different time slices by fixing = 0. In the other extreme with (for simulations, we can not set ), we see that around 318 restaurants out of 500 have been occupied. This is consistent with the notion that setting the factor very close to leads to infinite reinforcement implying if an agent goes to one restaurant, she would stick there for the rest of the time slices. So effectively, the choices in the first time slice itself determines the distribution of agents across restaurants as that distribution will never change because of infinite reinforcement. It is easy to show that as the agents are starting with uniformly distributed probabilities (), in the first time slice 63.2% of the restaurants would be occupied. We are skipping the derivation of this fraction. Interested readers can refer to [KPR]. One can easily verify that 318/500 is close to 63.5% and hence this validates our results. The right panel in the same figure shows the fraction of restaurants occupied i.e. . The results are perfectly consistent with the left panel.
We also note that having in Polya’s scheme (i.e. infinite reinforcement) is identical to assuming in the asymmetric updating strategy. Thus in the limit, these two strategies are exactly identical. This strategy allows us to interpolate between a wide spectrum of reinforcement by changing the factor . In particular, it allows us to cover the same range as are separately done by the symmetric and asymmetric updating strategies.
3.5 Ex-post knowledge
In the case of ex-ante knowledge in Sec. 3.3, we studied strategies where the agents can obtain information about the newly chosen restaurant’s attractiveness and make a comparison between the chosen restaurant’s and the current restaurant’s attractiveness. However, this might be a costly activity to know the attractiveness of another restaurant before actually visiting it. In the present section, we study the same set of strategies where the agents can obtain information about attractiveness only after she moves to the chosen restaurant. An important distinction from the earlier cases is that the present strategy allows for regret. After the agent moves to a new restaurant, she comes to know about its attractiveness and hence cannot do comparison prior to switching. Updating the probability vector happens the same way depending on relative attractiveness as was done in Sec. 3.3.
3.5.1 Symmetric updating
Consider agent where , at any generic time slice . Suppose she is at restaurant and given her probability vector , she probabilistically picks restaurant . After knowing both and , probability vector is updated exactly the same way as in Sec. 3.3.1. To avoid repetition, we are skipping the probability updating schemes.
The required information comes from the restaurants that the agent has visited. Hence, there is no external information acquired.
Fig 8 shows the simulation results in the top panels for and . We show results fr two time slices, at and . As in the earlier case, this strategy is also quite slow but eventually converges to a single restaurant in the limit. Naturally, this is slower than the ex-ante knowledge case.
3.5.2 Asymmetric updating
Similar to above, consider agent , at any generic time slice . Suppose she is at restaurant and given her probability vector , she probabilistically picks restaurant . After knowing both and , probability vector is updated exactly the same way as in Sec. 3.3.2.
Required information comes solely form the restaurants she visited and hence no external information is acquired.
Simulation results have been reported in the lower panels of Fig. 8 and Fig. 9. We see that the result for distribution of agents across the restaurants are qualitatively similar to those in the case of symmetric updating except that coordination is poorer as there are many restaurants with small numbers of agents.
We study the degree of stability of the transient clusters thus formed in Fig. 9. For higher values of , the average over maximum probabilities rises quite fast compared to lower values of though eventually their behavior is similar.
4 Self-organization and coordination
In the present section, we discuss the extent to which self-organization occurs in the multi-agent system that solves the coordination problem.
4.1 Emergence of coordination
We have seen that some of the strategies especially those which require ex-ante information or knowledge can in principle be thought of as requiring some costs to be payed in order to acquire the information. Also realistically, the agents might have a trade-off in terms of how quickly they can converge to a solution versus the efficiency of the solution. That is they may find it useful to be in majority, not necessarily absolute majority, at a lower time to reach the solution.
A parallel theme is that initially all restaurants are identical. But with absolute convergence, only one of them emerge as the winner. This can be interpreted as how a specific social norm may emerge from multiple possibilities that are a priori equally likely.
Thus emergence of absolute coordination has two contributing factors that can be potentially costly. The first one is obviously the cost of lack of coordination. The second one is the cost of waiting to reach coordination. This can be most clearly seen in the clustering behavior where multiple choices survive as the agents achieve partial coordination reasonably fast.
4.2 Cluster formation
We have already seen that clustering behavior can be transient but in almost all cases they are very slowly evolving. This implies that we observe clusters of agents in different restaurants for a very long time.
Fig. 11 shows four instances of probability density function of clusters. We tracked choices of agents over time slices. We assumed all four cases (ex-ante, ex-post and symmetric-asymmetric) with the previosly mentioned parameter values. The resulting probability density function has been averaged over number of simulations. Both ex-ante and ex-post with symmetric updating rules (panels (a) and (c)) show strong clustring behavior whereas the other two cases show very moderately distributed clusters (panel (b): fit with exponential distribution with paramter value 4.8564; panel (d): fit with gamma distribution with paramter values 2.7103, 1.3834).
4.3 Efficiency and cost of waiting
As discussed before, the agents may have a cost to execute the strategies and hence if there are strategies that takes very long time to reach a state of absolute convergence, the agents may prefer less efficient solution i.e. smaller clusters, if that is achievable soon enough.
We study this trade-off in Fig. 12 which plots the number of time slices required by the average over maximum probabilities to reach at least 0.8 versus the percentage of restaurants occupied. The variable in the -axis represents the cost in terms of waiting time. The variable in the -axis represents the cost in terms of inefficiency of the solution (smaller percentage occupancy would be more efficient). We plot the trade-off by simulating a system of = 500 agents with the Polya updating scheme ( = 50, 75, 100, , 475, 495). the values on the -axis shows the occupancy at the time slice when reaches 0.8.
The trade-off is clearly seen in terms of cost minimization. A lower waiting cost leads to higher occupancy and hence to inefficiency and vice versa. This is a very useful feature of the model to understand the trade-off between the waiting cost to arrive at an allocation and the accuracy of the allocation.
In this paper, we study a model of distributed coordination in the context of a multi-agent, multi-choice system. We consider a game with multiple Nash equilibrium all of which are equally likely. The basic problem is to find which equilibrium will materialize if the agent engage in repeated interaction and how quickly can they converge to the equilibrium. Essentially, we solve the problem of equilibrium selection through distributed coordination algorithms.
We propose a number of strategies based on different types of naïve learning. In particular, reinforcement learning via Polya’s urn model provides a very useful benchmark. We show that the system self-organizes with very slow dynamics and transient clusters. Finally, we characterize a trade-off between waiting cost to attain an allocation and the accuracy of the allocation. With lower waiting costs (stability is attained sooner), efficiency of the solution is low and the opposite is also true.
This problem sheds light on complexity of equilibrium selection and may provide an useful model for multi-agent coordination and collective dynamics in general.
-  A. Namatame, S-H. Chen, Agent-Based Modeling and Network Dynamics, Oxford University Press, 2016.
-  P. Sen, B. K. Chakrabarti, Sociophysics: An Introduction, Oxford University Press, 2014.
-  M. J. Salganik, P. S. Dodds, D. J. Watts, Experimental study of inequality and unpredictability in an artificial cultural market, Science, 311 5762, 854-856 (2006).
-  D. J. Watts, P. S. Dodds, Influentials, networks, and public opinion formation, J. Cons. Res., 34 4, 441-458 (2007).
-  D. Stauffer, Sociophysics Simulations II: Opinion Dynamics, arxiv link: http://arxiv.org/abs/physics/0503115
-  E. Pugliese, C. Castellano, M. Marsili and L. Pietronero, Collaborate, compete and share, Eur. Phys. J. B, 67 3, 319-327 (2009).
-  D. Challet and Y.-C. Zhang, Emergence of cooperation and organization in an evolutionary game, Physica A, 246 407 (1997).
-  D. Challet, Inter-pattern speculation: Beyond minority, majority and $-games, J. Econ. Dyn. Control, 32 1, 85-100 (2008).
-  D. Sornette, Why Stock Markets Crash: Critical Events in Complex Financial Systems. Princeton Univ. Press, 2004.
-  D. Challet, M. Marsili, and Y.-C. Zhang, Minority games: interacting agents in financial markets. Oxford University Press, 2004.
-  D. B. Fogel, K. Chellapilla, and P. J. Angeline, Inductive reasoning and bounded rationality reconsidered, IEEE Trans. Evol. Comp., 3 2, 142 (1999).
-  J.-Q. Zhang, Z.-X. Huang, Z. Wu, R. Su and Y.-C. Lai, Controlling herding in minority game systems, Sci. Rep. (2016).
-  A. S. Chakrabarti, B. K. Chakrabarti, A. Chatterjee, and M. Mitra, The Kolkata paise server problem and resource utilization, Physica A, 388, 2420 – 2426 (2009).
-  A. Ghosh, A. Chatterjee, M. Mitra, and B. K. Chakrabarti, Statistics of the Kolkata Paise server problem, New J. Phys., 12 7, 075033 (2010).
-  W. B. Arthur, Inductive reasoning and bounded rationality: the El Farol problem, Amer. Econ. Rev., 84, 406–411 (1994).
-  A. Chakraborti, D. Challet, A. Chatterjee, M. Marsili, Y.-C. Zhang, and B. K. Chakrabarti, Statistical mechanics of competitive resource allocation using agent-based models, Phys. Rep., 552, 1â25 (2015).
-  D. Challet, Competition between adaptive agents: learning and collective efficiency. In Collective and the design of complex systems, Eds.: K. Turner and D. Wolpart, Springer, 2004.
-  S. Hod, E. Nakar, Self-segregation versus clustering in the evolutionary minority game, Phys. Rev. Lett. 2002 Jun 10;88(23):238702.
-  M. J. Osborn, An introduction to game theory, Oxford University Press (2004).