Decision making is a fundamental capability of living organisms, and has recently been gaining increasing importance in many engineering applications. Here, we consider a simple decision-making principle to identify an optimal choice in multi-armed bandit (MAB) problems, which is fundamental in the context of reinforcement learning. We demonstrate that the identification mechanism of the method is well described by using a competitive ecosystem model, i.e., the competitive Lotka-Volterra (LV) model. Based on the “winner-take-all” mechanism in the competitive LV model, we demonstrate that non-best choices are eliminated and only the best choice survives; the failure of the non-best choices exponentially decreases while repeating the choice trials. Furthermore, we apply a mean-field approximation to the proposed decision-making method and show that the method has an excellent scalability of with respect to the number of choices . These results allow for a new perspective on optimal search capabilities in competitive systems.
Lotka-Volterra competition mechanism embedded
in a decision-making method
Tomoaki Niiyama, Genki Furuhata, Atsushi Uchida, Makoto Naruse, Satoshi Sunada
Faculty of Mechanical Engineering, Institute of Science and Engineering, Kanazawa University,
Kakuma-machi Kanazawa, Ishikawa 920-1192, Japan
Graduate School of Natural Science and Technology, Kanazawa University,
Kakuma-machi, Kanazawa, Ishikawa 920-1192, Japan
Department of Information and Computer Sciences, Saitama University,
255 Shimo-Okubo, Sakura-ku, Saitama City, Saitama, 338-8570, Japan
Department of Information Physics and Computing, Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
Recent research suggests that nature is a great source of inspiration in providing solutions for complicated problems and developing intelligent information processing . Inspired by biological functions, physical structures, and organizational principles found in nature, numerous mathematical and metaheuristic models have been developed. These include genetic algorithms, ant colony optimization, bee algorithms, and simulated annealing, which have been used for addressing various optimization problems [1, 2, 3, 4, 5, 6, 7].
Nature-inspired algorithms have also been applied to solve a decision-making or reinforcement learning problem, i.e., the multi-armed bandit (MAB) problem . A key point of the MAB problem is to resolve the exploration-exploitation dilemma inherent in decision making under uncertainty; sufficient exploratory actions may allow us to determine the best choice, but it may be accompanied by a significant amount of loss. In contrast, insufficient exploration may result in missing the best choice. While there are numerous methods based on statistics to resolve such a dilemma [9, 8], a method shown in Refs. [10, 11] has been based on an inspiration from the spatiotemporal dynamics of micro-organisms, such as amoebas, to resolve such a dilemma. The dynamic stretching and contracting of amoebas when seeking food while maintaining their volume constant generates a frustrating non-local correlation as a whole, leading to an efficient and adaptive ability of identifying an optimal solution (best arm) in MAB problems. At present, the amoeba-inspired decision-making method has been implemented in various physical systems [10, 11, 12, 13, 14, 15]; however, the theoretical guarantee of the best choice identification has not yet been provided.
Meanwhile, a frustration similar to that in the amoeba dynamics, i.e., fluctuating dynamics under a conservative constraint, can generally be seen in a variety of competitive dynamical systems, in which each component (or state) competes for common finite resources. For instance, each species in an ecosystem attempts to grow its population while competing for limited resources. Such interspecific competition has been modeled by simple ordinally differential equations, known as the competitive Lotka-Volterra (LV) equations [16, 17]. The LV equations describe the dynamics of competitive systems, such as multi-mode lasers [18, 19], as well as ecological communities . Moreover, the LV model is closely related to the Moran process in the context of population genetics  and evolutionary game theory , suggesting the applicability of the competitive mechanism to explore an optimal solution adapted to a given environment.
In this study, we propose a decision-making principle in which the LV competitive mechanism is embedded. The model is a natural and simple extension of the amoeba-inspired decision-making method [10, 11] and enables the identification of the best choice in the MAB problems, based on the competitive growth under a conservation law. We theoretically ensure the validity of the best choice identification, based on the LV competition model.
First, let us consider an MAB with arms providing unknown stochastic rewards , which are assumed to be an independent realization of a random variable with mean . The mean of the reward from the -th arm is expressed as , where is the probability distribution of . We assume that is a positive value in this study, without the loss of generality. A goal of the MAB problem is to identify the best arm with the largest mean reward, , through multiple plays, in which we regard arm as the best arm in this study. Hence, the MAB problem can be considered, for example, as a decision-making problem for a gambler who plays slot machines (or a slot machine with multiple arms).
Our decision-making method for identifying the best arm is based on the dynamic behavior of an object with a total length of , which consists of segments, as schematically shown in Fig. 1(a). Let be the length of the -th segment at the -th play. At , the length of each segment is assumed to be identical, e.g., . To identify the best arm, we repeat the following three processes:
(i) Selecting an arm to play: Select an arm with the following probability defined by the ratio of each segment length and total length ,
We refer to as the selection probability of arm at -th play.
(ii) Playing the chosen arm: By playing the arm chosen in step (i), reward is received based on the reward probability distribution .
(iii) Learning and updating: The length of the -th segment is altered based on the total length of the object, , and :
where is a function of the reward and a small incremental parameter , . Although one can choose an arbitrary form of the function , we use the following function in this study;
As the aforementioned processes (i)–(iii) are repeated, the length of each segment, , is expected to increase in accordance with the reward expectation of the corresponding arm [Fig. 1(a)]. As a result, the selection probability of the best arm, , increases compared to that of the others [Fig. 1(b)].
where Eq. (3) was used to derive the right-hand sides of .
The previously described decision-making method can be analogous to ideal gases bounded by movable partitions in a vessel. That is, corresponds to the volume of the -th gas bounded by the -th and -th partitions, and the change in the volume of -th gas results from the increase or decrease in the number of moles of the -th gas corresponding to [Fig. 1(b)].
The aforementioned method can also be regarded as a modified version of the tug-of-war model [10, 11] in the sense that the volume of each segment (probability ) grows and represses under the conservative condition of the total volume of the body (). Notably, our decision-making method has a similarity to the linear reward schemes, known as classical schemes in learning automata [22, 23], in the case of a binary bandit problem, as well as replicator equations in the field of evolutionary game theory .
As demonstrated in the following, the LV competing principle is embedded in the proposed decision-making method; therefore the exponential decrease in the error probability (the probability of choosing a non-best arm) is ensured.
3 Results and discussions
Before revealing the connection between the proposed decision-making method and the LV competitive dynamics, we show the results of numerical simulations for three typical examples to demonstrate that our method typically has the ability to identify the optimal arm of MAB problems. For the numerical demonstration, we set the reward of -th arm, , to follow a normal distribution with the mean and variance , in which the number of arms, , is 10 () and the best arm with the largest mean reward is set to be the arm . To quantify the capability of the best arm identification, consecutive arm playing was conducted until the cycle , and the process was repeated times. The mean of the selection probability of the best arm, , was evaluated.
As shown in Fig. 2(a), in the three typical MAB problems depicted by red, green, and blue lines converges to unity. Figure 2(b) shows that the error probability, ; thus, the probability of taking the non-best arm exponentially decreases. Hence, it is numerically shown that the proposed decision-making method can work well in finding the best arm in MAB problems (more rigid proof of the ability of the best arm identification will be provided in subsection 3.2).
The selection probability of the best arm is expected to depend on as well as the difference between the reward expectations. The average number of plays resulting in the decision error probability being sufficiently small, i.e., , is demonstrated in Fig. 3 as a function of . The figure clearly shows that the average number of plays increases only with . The dependence of will be important in the scalable solution of MAB problems with a large number of arms, and it is better than the Upper Confidence Bound Exploration (UCB-E)  and an extended tug-of-war model with -scalability .
3.1 Hidden Lotka-Volterra competition dynamics
In this subsection, we show that the Lotka-Volterra type of interspecific competition dynamics lurks behind our model and yields the performance of the decision-making method illustrated in the previous subsection.
The starting point of our analysis is Eq. (4) of the selection probability . We here focus on the average dynamics of a stochastic variable . Let us consider an ensemble composed of “trajectories” of plays and the ensemble average of at the -th play; , where is the selection probability at the -th play on the -th trajectory of the ensemble. The update of is described as follows:
where is the expectation of . One can derive the expectation as the following equation (the detailed derivation is described in Appendix A):
Next, we reconfigure “time” as . When is sufficiently small, ; thus, the average dynamics of the selection probabilities are described as follows:
where and represent the population of the species and the growth rate of the species , respectively, and represents intraspecific () and interspecific interactions. By comparing Eq.(8) to Eq. (9), searching the optimal arm in our method can be interpreted as follows: attempts to “grow” according to (the reward expectation of the arm ); however, the selection probabilities of the other arms also attempt to grow, thus they compete for survival. The arm (species) that survives this competition will be considered the best arm. The origin of this competition mechanism lies in frustration, such as an ecosystem competing for limited resources, in which (or ) increases under the conservation conditions of total probability; .
As an example of such competition among arms, the selection probability of each arm obtained by the simulation solving 4-armed bandit problems using our method are shown in Fig. 4, where , , , and are given by , , , and , respectively. As a result of the competition among the arms, the selection probability (population) of the arm overwhelms that of the others. The numerical result well agrees with the black solid curve shown in Fig. 4, which was obtained by numerical integration of the competitive LV equations using the Euler method with time step .
3.2 Feasibility of best arm identification
The competitive LV equations have been investigated from the context of physics and mathematics for many decades . Hence, accumulated knowledge and theorems regarding them can be applied to the decision-making problems examined in this study. Among the most significant theorems known is the condition for only a single species to survive, i.e., the global stability of only one solution of the competitive LV equations [16, 17]. This condition ensures that the present method supported by the competitive LV dynamics can eventually identify the best choice.
However, the theorem providing the global stability is only for the case that all mean rewards are different from each other. Thus, in this subsection, we present a brief demonstration that this global stability exists even when some mean rewards are identical by analyzing the global stability of a fixed point in Eq. (8), corresponding to the identification of the best arm with probability . This type of stability analysis allows us to gain an insight from physics and information theory into competitive LV dynamics.
Here, we consider the stability by introducing a Lyapunov function for , satisfying and for , in a space . If , the fixed point is globally stable in .
As the Lyapunov function, we chose Kullback-Leibler (KL) divergence between the probability distribution and ,
represents the information gained from a prior distribution to a posterior distribution , and satisfies for and for . Introducing the mean reward at as and regarding only for , we obtain
where represents the maximum reward expectation and Eq. (8) was used. Obviously, in because of . Accordingly, we conclude that the decision-making method described by Eq. (8) monotonically obtains the information of the best arm and the probability to select the best arm always converges to 1. The aforementioned consideration is always valid when .
3.3 Efficiency of best arm identification
This subsection provides further insight into the global behavior of by applying mean-field approximation to Eq. (8). Let us replace the selection probabilities, , and the reward expectations, , except for the best arm , with the mean values and , respectively: and , where is the number of arms. Applying this approximation to the second term in Eq. (8), we obtain
Equation (13) is easily solved, and and error probability, , are given as follows:
where and was used as the initial selection probability. Considering that , it is interesting that the convergence rate does not depend on the number of machines, but only on and .
Although this mean-field approximation is a bold approximation, it provides reasonably good predictions. Actually, as shown by the dashed lines in Fig. 2, the time evolution given by Eqs. (14) and (15) corresponds well to the actual simulation results.
Regarding , , and , one can evaluate the average number of plays, , required for the error probability to satisfy as follows:
Thus, increases at most for arms. The scalability of well explains the numerical results shown in Fig. 3.
3.4 Adaptability to environmental change
One of the most important abilities of reinforcement learning is to rapidly learn and robustly adapt to a non-stationary environment that can be interpreted as a setting of MAB problems in which rewards change with time. The correspondence between our decision-making method and the competitive LV equation suggests an insight into how quickly the system adapts to environmental change as obtained from the viewpoint of an ecosystem. In Appendix B, the adaptability of our decision-making method to environmental change is discussed in terms of natural biodiversity.
In this study, we developed a decision-making principle for solving MAB problems, in which the optimal choice or arm identification is theoretically guaranteed by the LV competitive mechanism. Furthermore, by applying mean-field approximation to the competitive LV equations, we showed that the error probability exponentially decreases and that the time required for the best arm identification depends on only a logarithm of the number of arms, which is an important attribute in realizing decision-making scalability.
The present study of our decision-making method demonstrates the possibility of competitive system utilization for reinforcement learning. Methods harnessing nature may combine superior performances in nature, such as adaptability and robustness, and provide a means to map our knowledge of nature into reinforcement learning techniques.
This work was supported in part by the Japan Science and Technology Agency CREST Grant Number JPMJCR17N2, Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research Grant No. JP17H01277, JP19H00868, and Murata Science Foundation.
Appendix A Derivation of
To evaluate the average amount of change in that varies according to the results of a probabilistic trial, we consider trajectories of , as described in the main text. Here, we denote the of the th trajectory at the th play as . The ensemble average of at trial is defined as in the limit of .
Recalling the update rule of our method described in Eq. (5), we can describe the average amount of change in at the -th play as follows:
where is the number of trajectories that chose arm at the th play. Note that is a reward stochastically determined according to the probability distribution of the arm with the reward expectation , and is independent of . Thus, the following relationship holds when is sufficiently large:
Because the ratio means that the arm is selected times out of trials, the ratio can be interpreted as a mean value of the selection probability:
in the limit of .
Appendix B Simulations of adaptability
As a demonstration of the adaptability of the presented decision-making method to non-stationary “environments,” we numerically performed the simulations of the MAB problem (, ) in which the expected rewards are cyclically changed at a constant interval of steps as follows: . To achieve adaptable decision-making in the simulations, we introduced a lower bound of the selection probability of each arm, , such that no matter how much the selection probability decreases as a result of the search. In this study, is used as a control parameter regarding the decision-making adaptability and optimality.
The time evolution of the average selection probability of each arm, , when is shown in Fig. 5(a), where the best arm is switched every steps. immediately after the environmental changes, competition has arisen, and then, the system eventually finds the best arm during each term.
Fig. 5(b) shows the time evolution of the selection probability of the best arm, with different values of . The maximum value of can be approximately given as , resulting in low optimality for a too large , whereas a too small makes the adaptive best arm identification difficult. To investigate the balance between the optimality and adaptability, we calculated the mean value of over time from to :
As can be seen from Fig. 5(c), a smaller value of produces a larger , but in the range where , rapidly decreases. Thus, under this setting, has a peak value of approximately .
The simulation results in this appendix can be interpreted as follows: the adaptability is maximized by preventing extinction of all species, even though some of them are not optimal in a certain environment, from the perspective of the “winner-take-all” competition mechanism in ecosystems. This interpretation is reminiscent of the ecosystem stability exerted by natural biodiversity (species richness). Though the actual effect of biodiversity on the stability of ecosystems is more complicated, this type of analogy might provide a new insight into the field of reinforcement learning.
-  P. Agarwal and S. Mehta, Int. J. Comput. Appl. 100, 14 (2014).
-  J. H. Holland, R. E. Nisbett, K. J. Holyoak, and P. R. Thagard, Induction: Processes of Inference, Learning, and Discovery (MIT Press, Cambridge, 1989).
-  M. Dorigo, V. Maniezzo, and A. Colorni, IEEE Trans. Syst. Man. Cybern. Syst. B 26, 29 (1996).
-  D. Karaboga and B. Basturk, J. Global Optim. 39, 459 (2007).
-  D. Karaboga and B. Basturk, Appl. Soft Comput. 8, 687 (2008).
-  S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, Science 220, 671 (1983).
-  V. Černý, J. Optim. Theory Appl. 45, 41 (1985).
-  R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction (MIT press, Cambridge, 2018) 2nd ed., Chap. 2.
-  J.-Y. Audibert and S. Bubeck, COLT - 23th Conference on Learning Theory - 2010, Haifa, 2010, p. 13.
-  S.-J. Kim, M. Aono, and M. Hara, Biosystems 101, 29 (2010).
-  S.-J. Kim, M. Aono, and E. Nameda, New J. Phys. 17, 083023 (2015).
-  M. Naruse, M. Berthel, A. Drezet, S. Huant, M. Aono, H. Hori, and S.-J. Kim, Sci. Rep. 5, 13253 (2015).
-  S.-J. Kim, M. Naruse, and M. Aono, Philosophies 1, 245, (2016).
-  M. Naruse, Y. Terashima, A. Uchida, and S.-J. Kim, Sci. Rep. 7, 8772 (2017).
-  R. Homma, S. Kochi, T. Niiyama, T. Mihana, Y. Mitsui, K. Kanno, A. Uchida, M. Naruse, and S. Sunada, Sci. Rep. 9, 9429 (2019).
-  S. A. Baigent, Lotka-Volterra dynamics, an introduction. preprint, University of College, London (2010).
-  M. L. Zeeman, Proc. Am. Math. Soc. 123, 87 (1995).
-  M. Sargent III, M. O. Scully, and W. E. Lamb, Laser Physics (Addison-Wesley, Reading, 1993), Chap. XI.
-  M. Sargent, Phys. Rev. A 48, 717 (1993).
-  A. E. Noble, A. Hastings, and W. F. Fagan, Phys. Rev. Lett. 107, 228101 (2011).
-  J. Hofbauer and K. Sigmund, Bull. Am. Math. Soc. 40, 479 (2003).
-  N. Baba, New Topics in Learning Automata Theory and Applications (Springer-Verlag, Heidelberg, 1985).
-  K. S. Narendra, S. Mukhopadyhay, and Y. Wang, arXiv:1510.05034.
-  M. Naruse, T. Mihana, H. Hori, H. Saigo, K. Okamura, M. Hasegawa, and A. Uchida, Sci. Rep. 8, 10890 (2018).
-  F. Pennekamp, M. Pontarp, A. Tabi, F. Altermatt, R. Alther, Y. Choffat, E. A. Fronhofer, P. Ganesanandamoorthy, A. Garnier, J. I. Griffiths, S. Greene, K. Horgan, T. M. Massie, E. Mächler, G. M. Palamara, M. Seymour, and O. L. Petchey, Nature 563, 109 (2018).