# Distributed Cooperative Decision-Making in Multiarmed Bandits: Frequentist and Bayesian Algorithms
^{†}^{†}thanks: This research has been supported by ONR grant N00014-14-1-0635, ARO grant W911NF-14-1-0431, and the DoD through the NDSEG Program.

###### Abstract

We study distributed cooperative decision-making under the explore-exploit tradeoff in the multiarmed bandit (MAB) problem. We extend state-of-the-art frequentist and Bayesian algorithms for single-agent MAB problems to cooperative distributed algorithms for multi-agent MAB problems in which agents communicate according to a fixed network graph. We rely on a running consensus algorithm for each agent’s estimation of mean rewards from its own rewards and the estimated rewards of its neighbors. We prove the performance of these algorithms and show that they asymptotically recover the performance of a centralized agent. Further, we rigorously characterize the influence of the communication graph structure on the decision-making performance of the group.

## I Introduction

Cooperative decision-making under uncertainty is ubiquitous in natural systems as well as in engineering networks. A fundamental feature of decision-making under uncertainty is the explore-exploit tradeoff: the decision-making agent needs to learn the unknown system parameters (exploration), while maximizing its parameter-dependent decision-making objective (exploitation).

Multiarmed bandit (MAB) problems are canonical formulations of the explore-exploit tradeoff. In a stochastic MAB problem a set of options (arms) is given. A stochastic reward with an unknown mean is associated with each option. A player can pick only one option at a time, and the player’s objective is to maximize the cumulative expected reward over a sequence of choices. In an MAB problem, the player needs to balance the tradeoff between learning the mean reward at each arm (exploration), and picking the arm with maximum mean reward (exploitation).

MAB problems are pervasive across a variety of scientific communities and have found application in diverse areas including control and robotics [1, 2], ecology [3, 4], and communications [5]. Despite the prevalence of the MAB problem, the research on MAB problems has primarily focused on policies for a single agent. The increasing importance of networked systems warrants the development of distributed algorithms for multiple communicating agents faced with MAB problems. In this paper, we build upon previous work by extending two popular single-agent algorithms for the stochastic MAB problem to the distributed multiple agent setting and analyze decision-making performance as a function of the network structure.

The MAB problem has been extensively studied (see [6] for a survey). In their seminal work, Lai and Robbins [7] established a logarithmic lower bound on the expected number of times a sub-optimal arm needs to be selected by an optimal policy in a frequentist setting. In another seminal work, Auer et al. [8] developed the upper confidence bound (UCB) algorithm for the stochastic MAB problem, which achieves the lower bound in [7] uniformly in time.

The MAB problem has also been studied in the Bayesian setting. Kaufmann et al. [9] proposed the Bayes-UCB algorithm and showed that it achieves Lai-Robbins’ lower bound for Bernoulli rewards and uninformative priors. Reverdy et al. [10] developed and analyzed the upper credible limit (UCL) algorithm for correlated multiarmed bandits by applying the approach of [9] to the case of Gaussian rewards.

The classical single-agent MAB problem was extended by Anantharam et al. [11] to the setting of multiple centralized players. Recently, researchers [12, 13, 5] have studied the decentralized multi-player MAB problem with no communication among agents. Kar et al. [14] investigated the multi-agent MAB problem in a leader-follower setting. Here, we use a running consensus algorithm [15] for assimilation of information. Running consensus has been used to study related collective decision-making models in social networks [16].

In the present paper we study the distributed cooperative MAB problem in which agents are faced with a stochastic MAB problem and communicate their information with their neighbors in an undirected and connected communication graph. We use a set of running consensus algorithms for cooperative estimation of the mean reward at each arm, and we design an arm selection heuristic that leads to an order-optimal performance for the group. The major contributions of this paper are as follows.

First, we propose and thoroughly analyze the coop-UCB2 and coop-UCL algorithms. We derive bounds on decision-making performance for the group and characterize the influence of the network structure on performance. To predict nodal performance, we propose a measure of nodal “explore-exploit centrality,” which depends on the location of the node in the graph.

Second, we demonstrate that the ordering of nodes by performance predicted by our explore-exploit centrality measure matches the order obtained using numerical simulations. We also show that the incorporation of priors that are well-informative about the correlation structure markedly improve performance.

The remainder of the paper is organized as follows. In Section II we introduce the cooperative MAB problem. In Section III we recall a cooperative estimation algorithm. We review the coop-UCB algorithm in Section IV, and propose and analyze the improved coop-UCB2 and new coop-UCL algorithms. We illustrate our analytic results with numerical examples in Section VI. We conclude in Section VII.

## Ii Cooperative Multiarmed Bandit Problem

Consider an MAB problem with arms and decision-making agents. The reward associated with arm is a random variable with an unknown mean . Let the communication topology of agents be modeled by an undirected graph in which each node represents an agent and edges represent the communication between agents. Let be the adjacency matrix associated with and let be the corresponding Laplacian matrix. We assume that the graph is connected, i.e., there exists a path between each pair of nodes.

Let the -th agent choose arm at time and receive a reward . The objective of each decision-maker is to choose using its local information a sequence of arms such that the total expected cumulative reward is maximized, where is the horizon length of the sequential allocation process.

For an MAB problem, the expected regret of agent at time is defined by , where . The collective objective of the decision-makers can be equivalently defined as minimizing the expected cumulative regret defined by , where is the total cumulative number of times arm has been chosen by agent until time and is the expected regret due to picking arm instead of arm . It is known that the regret of any algorithm for an MAB problem is asymptotically lower bounded by a logarithmic function of the horizon length [7], [11], i.e., no algorithm can achieve an expected cumulative regret smaller than a logarithmic function of horizon length as .

In this paper, we focus on Gaussian rewards, i.e., the reward at arm is sampled from a Gaussian distribution with mean and variance . We assume that the variance is known and is the same at each arm. In the context of Gaussian rewards, the lower bound [11] on the expected number of times a suboptimal arm is selected by a fusion center that has access to reward for each agent is

(1) |

In the following, we will design policies that sample a suboptimal arm within a constant factor of the above bound.

## Iii Cooperative Estimation of Mean Rewards

In this section we recall the algorithm for cooperative estimation of mean rewards proposed in our earlier work [17].

### Iii-a Cooperative Estimation Algorithm

For distributed cooperative estimation of the mean reward at each arm , we employ two running consensus algorithms to estimate (i) total reward provided at the arm, and (ii) the total number of times the arm has been sampled.

Let and be agent ’s estimate of the total reward provided at arm per unit agent and the total number of times arm has been selected until time per unit agent, respectively. Using and agent can calculate , the estimated empirical mean of arm at time as

(2) |

Let be the arm sampled by agent at time and let where is the indicator function, which is equal to 1 if and 0 otherwise. Define , where rnd is a random variable sampled from . Let be a row stochastic matrix given by

(3) |

where is the identity matrix of order , is a step size parameter [18], , and is the degree of node .

The estimates and are updated locally using running consensus [15] as follows:

(4) | ||||

(5) |

where , , , and are vectors of , , , and , , respectively.

### Iii-B Analysis of the Cooperative Estimation Algorithm

We now recall the performance of the estimation algorithm defined by (2–5). Let be the total number of times arm has been selected per unit agent until time , and let be the total reward provided at arm per unit agent until time . Also, let denote the -th largest eigenvalue of , the eigenvector corresponding to , the -th entry of , and

(6) |

Note that and . Let us define

We also define

(7) |

where . Furthermore, let

We note that both and depend only on the topology of the communication graph and are measures of distributed cooperative estimation performance. We now recall the following result from [17].

## Iv Frequentist Cooperative Decision-Making

In this section, we first review the coop-UCB algorithm proposed in our earlier work [17]. We then improve on this algorithm with a new algorithm: coop-UCB2. Unlike coop-UCB the improved algorithm does not require each agent to know the global graph structure. Finally, we compute bounds on the performance of the group for this algorithm as a function of the graph structure.

### Iv-a The coop-UCB Algorithm

The coop-UCB algorithm is analogous to the UCB algorithm [8], and uses a modified decision-making heuristic that captures the effect of the additional information an agent receives through communication with other agents as well as the rate of information propagation through the network.

The coop-UCB algorithm is initialized by each agent sampling each arm once and proceeds as follows. At each time , each agent updates its cooperative estimate of the mean reward at each arm using the distributed cooperative estimation algorithm described in (2–5). Then, at time each agent selects the arm with maximum , where

(8) |

The coop-UCB provides a distributed, cooperative solution to the MAB problem such that every agent in the network achieves logarithmic regret. However, the heuristic may be overly restrictive in the sense that it requires the agent to know , which depends on the global graph structure. Further, agents with a relatively high are essentially forced to explore more while better positioned agents exploit, leading to wide disparities in performance across some networks. We will develop the coop-UCB2 algorithm that addresses these issues in the next section.

### Iv-B The coop-UCB2 Algorithm

The coop-UCB2 algorithm is initialized by each agent sampling each arm once and proceeds as follows. At each time , each agent updates its cooperative estimate of the mean reward at each arm using the distributed cooperative estimation algorithm described in (2–5). Then, at time each agent selects the arm with maximum , where

(9) |

is an increasing sublogarthmic function, and . Note that the heuristic requires the agent to know the total number of agents , but not the global graph structure.

###### Theorem 1 (Regret of the coop-UCB2 Algorithm)

For the coop-UCB2 algorithm and the cooperative Gaussian MAB problem, the number of times a suboptimal arm is selected by all agents until time satisfies

where .

Proof: We proceed similarly to [8]. The number of selections of a suboptimal arm by all agents until time is

(10) |

where is a constant that will be chosen later.

At a given time an individual agent will choose a suboptimal arm only if . For this condition to be true at least one of the following three conditions must hold:

(11) | ||||

(12) | ||||

(13) |

We now bound the probability that (11) holds:

(14) |

where is the standard Gaussian random variable and the last inequality follows from the tail bounds for the error function [19] and the second statement of Proposition 1.

We now examine the argument of (14). Note that

Now for , , and therefore

If , then

It follows analogously that

We now examine the event (13).

(15) |

The quadratic equation (15) can be solved to find the roots, and if is greater than the larger root the inequality will never hold. Solving the quadratic equation (15), we obtain that event (13) does not hold if

Now, we set . It follows from monotonicity of and and statement (i) of Proposition 1 that event (13) does not hold if . Therefore, from (10) we get that

(16) |

The argument of the first summation of the RHS yields

Substituting into (16) then establishes the theorem.

###### Remark 1 (Asymptotic Regret for coop-UCB2)

In the limit , , and

and we recover the upper bound on regret for a centralized agent as given in (1) within a constant factor.

###### Remark 2 (Performance of Individual Agents)

Theorem 1 provides bounds on the performance of the group as a function of the graph structure, and the logarithmic portion of the bound is independent of agent location. However, the constant factor is dependent on the agent’s position in the network since it depends on . In this sense, can be thought of as a measure of “explore-exploit centrality,” which indicates that agents with a higher will contribute more to the group’s regret.

## V Bayesian Cooperative Decision-Making

In this section, we extend the coop-UCB2 algorithm to a Bayesian setting and develop the coop-UCL algorithm. The Bayesian setting allows us to model correlated bandits and incorporate a priori knowledge about reward and correlation structure in the Bayesian prior. We first recall the UCL algorithm proposed in [10, 9] and extend it to the cooperative setting. We then analyze the performance of this algorithm for an uninformative prior.

### V-a The UCL Algorithm

The UCL algorithm developed in [10] applies the approach of Bayes-UCB [9] to correlated Gaussian bandits. The UCL algorithm at each time computes the posterior distribution of mean rewards at each option and then computes the upper-credible-limit for each arm, i.e., an upper bound that holds with probability . The algorithm chooses the arm with highest upper credible limit. For Gaussian rewards, the upper-credible-limit can be written as

(17) |

where is the posterior mean and the posterior standard deviation of mean reward at time . is the standard Gaussian inverse cumulative distribution function.

Let the prior on rewards from each arm be multivariate Gaussian with mean vector and covariance matrix . Then, the posterior mean and covariance of mean reward at time can be computed using the following recursive update rule [20]:

(18) |

where and are column vectors of and , respectively, and is the indicator function of selecting arm at time . The update equation (18) can be reduced to

(19) |

where , is a diagonal matrix with entries , and is the vector of , which is the empirical mean of rewards from arm until time . Note that diagonal entries of are .

### V-B The coop-UCL Algorithm

We now extend the UCL algorithm to the distributed cooperative setting and propose the coop-UCL algorithm. In the coop-UCL algorithm, each agent first computes an approximate posterior distribution of mean rewards conditioned on rewards obtained by all the agents. To this end, each agent uses the approximate frequentist estimator from Section III in update equation (19).

Let the prior of agent be a multivariate Gaussian distribution with mean and covariance . Let and be the estimated covariance matrix and posterior mean respectively at time . Then, the coop-UCL algorithm performs cooperative approximate Bayesian estimation:

(20) |

where is a diagonal matrix with diagonal entries , and .

After computing and , the coop-UCL algorithm at time requires each agent to choose the option with maximum -upper-credible-limit given by

(21) |

where is defined such that

is the -th entry of , and is the -th diagonal entry of .

### V-C Regret of the coop-UCL Algorithm

We now derive bounds on the expected cumulative regret for each agent using the coop-UCL algorithm for uninformative priors for each agent. For an uninformative prior, , for each , and consequently, and .

###### Theorem 2 (Regret of the coop-UCL Algorithm)

For the Gaussian MAB problem and the coop-UCL algorithm with uninformative priors for each agent, the number of times a suboptimal arm is selected by all agents until time satisfies

where .

Proof: For uninformative priors, coop-UCL is analogous to coop-UCB2 with . Similar to the proof of Theorem 1, we note that

(22) |

It follows that for

Similarly, .

## Vi Numerical Illustrations

In this section, we elucidate our theoretical analyses from the previous sections with numerical examples. We first demonstrate that the ordering of the performance of nodes obtained through numerical simulations is identical to the ordering predicted by the nodal explore-exploit centrality measure: the larger the the lower the performance. We then investigate the effect of the graph connectivity on the performance of agents in random graphs.

For all simulations we consider a -armed bandit problem with mean rewards drawn from a normal random distribution with mean and standard deviation . The sampling standard deviation is . These parameters were selected to give illustrative results within the displayed time horizon, but the relevant conclusions hold across a wide variation of parameters. The simulations used .

###### Example 1 (Regret on Fixed Graphs)

Consider the set of agents communicating according to the graph in Fig. 1 and using the coop-UCB2 algorithm to handle the explore-exploit tradeoff in the distributed cooperative MAB problem. The values of for nodes and are and , respectively. As noted in Remark 2, agent should have the lowest regret, agents and should have equal and intermediate regret, and agent should have the highest regret. These predictions are validated in our simulations shown in Fig. 1. The expected cumulative regret in our simulations is computed using Monte-Carlo runs.

Fig. 1 demonstrates the relative performance differences between coop-UCB, coop-UCB2, coop-UCL, and single agent UCB with the same run parameters. Here the coop-UCL algorithm is shown with an informative prior and no correlation structure. Each agent in the coop-UCL simulation shown here has and . The use of priors markedly improves performance.

We now explore the effect of on the performance of an agent in an Erdös-Réyni (ER) random graph. ER graphs are a widely used class of random graphs where any two agents are connected with a given probability [21].

###### Example 2 (Regret on Random Graphs)

Consider a set of agents communicating according to an ER graph and using the coop-UCB2 algorithm to handle the explore-exploit tradeoff in the aforementioned MAB problem. In our simulations, we consider connected ER graphs, and for each ER graph we compute the expected cumulative regret of agents using Monte-Carlo simulations with , as in (3), and . We show the behavior of the expected cumulative regret of each agent as a function of the normalized in Fig. 1. It is evident that increased results in a sharp decrease in performance. Conversely, low is indicative of better performance. This disparity is due to the relative scarcity of information at nodes that are in general less “central.”

## Vii Final Remarks

In this paper we used the distributed multi-agent MAB problem to explore cooperative decision-making in networks. We designed the coop-UCB2 and coop-UCL algorithms, which are frequentist and Bayesian distributed algorithms, respectively, in which agents do not need to know the graph structure. We proved bounds on performance, showing order-optimal performance for the group. Additionally, we investigated the performance of individual agents in the network as a function of the graph topology, using a proposed measure of nodal explore-exploit centrality.

Future research directions include rigorously exploring other communications schemes, which may offer better performance or be more suitable for modeling certain networked systems. It will be important to consider the tradeoff between communication frequency and performance as well as the presence of noisy communications.

## References

- [1] V. Srivastava, P. Reverdy, and N. E. Leonard. Surveillance in an abruptly changing world via multiarmed bandits. In IEEE CDC, pages 692–697, 2014.
- [2] M. Y. Cheung, J. Leighton, and F. S. Hover. Autonomous mobile acoustic relay positioning as a multi-armed bandit with switching costs. In IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pages 3368–3373.
- [3] J. R. Krebs, A. Kacelnik, and P. Taylor. Test of optimal sampling by foraging great tits. Nature, 275(5675):27–31, 1978.
- [4] V. Srivastava, P. Reverdy, and N. E. Leonard. On optimal foraging and multi-armed bandits. In Allerton Conference on Comm., Control, and Computing, pages 494–499, 2013.
- [5] A. Anandkumar, N. Michael, A. K. Tang, and A. Swami. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications, 29(4):731–745, 2011.
- [6] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Machine Learning, 5(1):1–122, 2012.
- [7] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
- [8] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, 2002.
- [9] E. Kaufmann, O. Cappé, and A. Garivier. On Bayesian upper confidence bounds for bandit problems. In Int. Conf. on Artificial Intelligence and Statistics, pages 592–600, 2012.
- [10] P. B. Reverdy, V. Srivastava, and N. E. Leonard. Modeling human decision making in generalized Gaussian multiarmed bandits. Proceedings of the IEEE, 102(4):544–571, 2014.
- [11] V. Anantharam, P. Varaiya, and J. Walrand. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part I: I.I.D. rewards. IEEE Transactions on Automatic Control, 32(11):968–976, Nov 1987.
- [12] D. Kalathil, N. Nayyar, and R. Jain. Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory, 60(4):2331–2345, 2014.
- [13] Y. Gai and B. Krishnamachari. Distributed stochastic online learning policies for opportunistic spectrum access. IEEE Transactions on Signal Processing, 62(23):6184–6193, 2014.
- [14] S. Kar, H. V. Poor, and S. Cui. Bandit problems in networks: Asymptotically efficient distributed allocation rules. In IEEE CDC and ECC, pages 1771–1778, 2011.
- [15] P. Braca, S. Marano, and V. Matta. Enforcing consensus while monitoring the environment in wireless sensor networks. IEEE Transactions on Signal Processing, 56(7):3375–3380, 2008.
- [16] V. Srivastava and N. E. Leonard. Collective decision-making in ideal networks: The speed-accuracy trade-off. IEEE Transactions on Control of Network Systems, 1(1):121–132, 2014.
- [17] P. Landgren, V. Srivastava, and N. E. Leonard. On distributed cooperative decision-making in multiarmed bandits. In European Control Conference, Aalborg, Denmark, 2016.
- [18] R. Olfati-Saber and R. M. Murray. Consensus problems in networks of agents with switching topology and time-delays. IEEE Transactions on Automatic Control, 49(9):1520–1533, 2004.
- [19] M. Abramowitz and I. A. Stegun, editors. Handbook of Mathematical Functions: with Formulas, Graphs, and Mathematical Tables. Dover Publications, 1964.
- [20] S. M. Kay. Fundamentals of Statistical Signal Processing, Volume I : Estimation Theory. Prentice Hall, 1993.
- [21] B. Bollobás. Random Graphs. Springer, 1998.