Decentralized Cooperative Stochastic Multiarmed Bandits
Abstract
We study a decentralized cooperative stochastic multiarmed bandit problem with arms on a network of agents. In our model, the reward distribution of each arm is agentindependent. Each agent chooses iteratively one arm to play and then communicates to her neighbors. The aim is to minimize the total network regret. We design a fully decentralized algorithm that uses a running consensus procedure to compute, with some delay, accurate estimations of the average of rewards obtained by all the agents for each arm, and then uses an upper confidence bound algorithm that accounts for the delay and error of the estimations. We analyze the algorithm and up to a constant our regret bounds are better for all networks than other algorithms designed to solve the same problem. For some graphs, our regret bounds are significantly better.
1 Introduction
One of the most studied problems in online learning is the multiarmed bandit (MAB) problem. In the classical version of this problem agents have to choose or pull one among a finite set of actions or arms, and we receive a reward corresponding to this action. We keep choosing actions and obtaining rewards iteratively and our aim is to get a cumulative reward as close as possible to the reward we could have obtained with the best fixed action. In the MAB problem, we only observe the rewards corresponding to the actions we choose. There are two main variants of this problem, namely the stochastic and the adversarial MAB problem. In the former, each action yields a reward that comes from a fixed unknown distribution. In the latter, the rewards are chosen by an adversary who is usually assumed to be aware of our strategy but does not know in advance the result of random choices the strategy makes. Optimal algorithms have been developed in both the stochastic and the adversarial case, see [6].
Another active area of research is the development of distributed algorithms for solving optimization and decision making problems, which is motivated in part by recent development of large scale distributed systems that make possible to speed up computations. Sometimes the distributed computation is a necessary restriction that is part of the problem, like in packet routing or in sensor networks. Gossip algorithms are a commonly used framework in this area [5, 21, 23, 9, 10, 22]. In this context, we have an iterative algorithm with processing units that are the nodes of a graph and they can communicate information to their neighbors at each time step. In these problems, it is usual to have a value at each node that we want to average or synchronize across the network. In fact most solutions reduce to approximate averaging or synchronization. To achieve this end, we can use the following simple and effective method. We make each node compute iteratively a weighted average of its own value and the ones communicated by its neighbors ensuring that the final value at each node will be the average of the initial values. Formally, this communication can be represented as a multiplication by a matrix that respects the network structure and satisfies some conditions that guarantee fast averaging.
This work focuses on a distributed stochastic MAB problem, that we solve with a gossip algorithm. We consider a network composed of agents that play the same MAB problem and the aim is to obtain regret close to the one incurred by an optimal centralized algorithm running for iterations, where is the number of iterations of the decentralized algorithm. At each time step each agent pulls an arm simultaneously and obtains a reward drawn from the distribution corresponding to the pulled arm, and then agents send information to their neighbors. Rewards are drawn independently across time and across the network. Our algorithm incurs regret equal to the optimal regret in the centralized problem plus a term that depends on the spectral gap of the graph and the number of agents, (cf. creftypecap 1). At each iteration each agent sends values to her neighbors. This condition can be relaxed at the expense of incurring in greater regret. The algorithm needs to have access to the total number of agents in the network and to an upper bound on the spectral gap of the communication matrix.
The MAB problem epitomizes the exploitationexploration dilemma: in order to maximize our cumulative reward we have to trade off between exploration of the arms and exploitation of the seemingly best arm. This dilemma appears in a substantial number of applications and the MAB problem models many of them. Applications range from advertising systems to clinical trials and queuing and scheduling. The distributed computation could be a restriction given by the impossibility of having a single computation unit being able to perform the required number of actions or it could be used to improve the total running time, improving by a factor of , since arms are pulled at each time step.
The contribution of this work lies in presenting an algorithm for the fully decentralized setting that exhibits a natural and simple dependence on the spectral gap of the communication matrix presenting lower, and simpler to interpret asymptotic regret compared to other algorithms previously designed for the same setting. In the literature, other models for performing distributed MAB problems have been considered but many of them define features that are not fully decentralized or that impose restrictions on the network topology. In this work, we have focused on a fully decentralized model running on an arbitrary network.
2 Model and Problem Formulation
The model we study in this work is the following. We consider a multiagent network with agents. The agents are represented by the nodes of an undirected and connected graph and each agent can only communicate to her neighbors. Agents play the same armed bandit problem for time steps, send some values to their neighbors after each play and receive the information sent by their respective neighbors to use it in the next time step if they so wish. If an agent plays arm , she receives a reward drawn from a fixed distribution with mean that is independent of the agent. The draw is independent of actions taken at previous time steps and of actions played by other agents. We assume that rewards come from distributions that are subgaussian with variance proxy .
Assume without loss of generality that
and let the suboptimality gap be defined as for any action . Let be the random variable that represents the action played by agent at time . Let be the number of times arm is pulled by node up to time and let be the number of times arm is pulled by all the nodes in the network up to time . We define the regret
The problem is to minimize the regret while allowing each agent to send values to her neighbors per iteration.
3 Related Work
There are several works that study stochastic and nonstochastic distributed multiarmed bandit problems, but the precise models vary considerably.
In the stochastic case, the work [17] and its followup [18] propose three algorithms to solve the same problem that we consider in this paper: coopUCB, coopUCB2 and coopUCL. The algorithm coopUCB follows a variant of the natural approach to solve this problem that will be explained in Section 4. It needs to know more information about the graph than just the number of nodes and the spectral gap: the algorithm uses a value per node that depends on the whole spectrum and the set of eigenvectors of the communication matrix. The algorithm coopUCB2 is a modification of coopUCB, in which the only information used about the graph is the number of nodes, but the regret is greater. Finally, coopUCL is a Bayesian algorithm that also incurs greater regret than coopUCB. Our algorithm obtains a lower asymptotic regret than these other algorithms while keeping the same computational complexity (cf. Remark 2).
Many other variants of the distributed stochastic MAB problem have been proposed. In [8], the agents at each time step can either broadcast the last obtained reward to the whole network or pull an arm. In [15], each agent can only send information to one agent per round, but they can send it to any other agent in the network. [25] studies the MAB problem in P2P random networks and analyzes the regret based on delayed rewards estimates. Other authors do not assume independence of the rewards’ draws across the network. [19] and [13] consider a distributed MAB problem with collisions: if two players pull the same arm, the reward is split or no reward is obtained at all. Moreover in the latter and its followup [20], the act of communicating increases the regret. [1] also considers a model with collisions and agents have to learn from actions’ collisions rather than by exchanging information.
Other authors have studied the problem of identifying an optimal arm using a distributed network. In [11], matching upper and lower bounds are provided in the case that the communication happens only once. The graph topology is restricted to be the complete graph. They provide an algorithm that achieves a speed up of given that there are communication steps. In [24], each agent plays a different MAB problem and the total regret is minimized in order to identify the best action when averaged across nodes. Nodes only send values to their neighbors but at each time step the arm played by all the nodes is given by the majority vote of the agents, so it is not a completely decentralized algorithm. A distributed MAB problem but with global feedback, that is, with no communication involved, is studied in [27]. [14] also consider a different distributed bandit model in which only one agent observes the rewards she plays while the others observe nothing and have to rely on the information of the first agent, that is broadcast.
With our model, there is not much that can be said in the adversarial case. Optimal algorithms for the centralized adversarial case, that incur regret , have been designed [2]. In the decentralized case, if we do not communicate and just use an optimal centralized algorithm at each node the regret incurred is ; and is a lower bound so only the dependence on can be improved. [4] study a distributed adversarial MAB problem with some Byzantine users, that is, users that do not follow the protocol or report fake observations as they wish. In the case in which there are no Byzantine users they obtain a regret of . To the best of our knowledge, this is the first work that considers a decentralized adversarial MAB problem. They allow for communication rounds between decision steps so it differs with our model in terms of communication. Also in the adversarial case, [7] studied an algorithm that achieves regret and proves some results that are graph dependent. The model is the same as ours but in the algorithm, each agent communicates to her neighbors all the values computed by her or that she has received and that were computed in the last iterations, getting the aforementioned regret for . Thus, in this work the communication between two nodes could be more than at a given time step.
4 Algorithm
We propose an algorithm that is an adaptation of UCB to the problem at hand that uses a gossip protocol. We call the algorithm Distributed Delayed Upper Confidence Bound (DDUCB). UCB is a popular algorithm for the stochastic MAB problem. At each time step, UCB computes an upper confidence interval for the mean of each arm , using two values: the empirical mean observed, , and the number of times arm was pulled, . UCB plays at time the arm that maximizes the following upper confidence interval
In our setting, as the pulls are distributed across the network, agents do not have access to these two values, namely the number of times each arm was pulled across the network and the empirical mean reward observed for each arm computed using the total number of pulls. Our algorithm maintains good approximations of these values and it incurs a regret that is no more than the one for a centralized UCB plus a term depending on the spectral gap and the number of nodes, but independent of time. The latter term is a consequence of the approximation of the aforementioned values. Let be the sum of rewards coming from all the pulls done to arm by the entire network up to time . We can use a gossip protocol, for every , to obtain at each node a good approximation of and the number of times arm was pulled, i.e. . Let , be the approximations of and made by node with a gossip protocol at time , respectively. Having this information at hand, agents could compute the ratio to get an estimation of the average reward of each arm. But care needs to be taken when computing the aforementioned approximations.
A classical and effective way to keep a running approximation of the average of values that are iteratively added at each node is what we will refer to as the running consensus. Let be the set of neighbors of agent in graph . In this protocol, every agent stores her current approximation and performs communication and computing steps alternatively: at each time step each agent computes a weighted average of her neighbors’ values and adds to it the new value she has computed. We can represent this operation in the following way. Let be a matrix that respects the structure of the network, which is represented by a graph . So if there is no edge in that connects to . We consider satisfies the sum of each row and the sum of each column is . We further assume that the second largest eigenvalue in norm is strictly less than . See [26] for a discussion on how to choose . If we denote by the vector containing the current approximations for all the agents and by the vector containing the new values added by each node, then the running consensus can be represented as
(1) 
The conditions imposed on ensures that values are averaged. Formally, if are the norms of the eigenvalues of , then, for any and any in the dimensional simplex
(2) 
see [12], for instance. A natural approach is to run running consensus algorithms in order to compute approximations of and , . [17] follow this approach and use extra global information of the graph to account for the inaccuracy of the mean estimate. We can estimate average rewards by their ratio and the number of times each arm was pulled can be estimated by multiplying the latter by . The running consensus protocols would be the following. For , start with and update , where the entry of contains the reward computed by node at time if arm is pulled or otherwise. Note that the update of each entry is done by a different node. Similarly, for , start with and update , where the entry of is if at time node pulled arm and otherwise.
The problem with this approach is that even if the values computed are being mixed at a fast pace it takes some time for the last added values to be mixed, resulting in poor approximations, especially if is large. Indeed, we can rewrite (1) as , assuming that . For the values of that are not too close to we have by (2) that is very close to the vector that has as entries the average of the values in , that is, , where . However, for values of close to this is not true and the values of influence heavily the resulting estimate, being specially inaccurate as an estimation of the true mean if is large. The key observations for our algorithm are that the number of these values of close to is small and that the regret of UCB does not increase much when working with delayed values of rewards so we can temporarily ignore the recently computed rewards in order to work with much more accurate approximations of and . In particular, ignoring rewards that are at most steps old suffices for our purposes.
We consider that any value that has been computed since at least iterations before the current time step is mixed enough to be used to approximate and . Agents run the running consensus in stages of iterations. Let be the time at which a stage begins, so it ends at . At , agent stores mixed rewards and number of pulls, done by the network, in the variables and , respectively. In particular, using the notation above, at it is and but in the first stage, in which their values are initialized from a local pull. In particular, in the first stage, agent pulls each arm once, receives rewards and initializes and . The variables and are not updated again until , so they contain information that at the end of the stage is delayed by iterations. These values are used to run the upper confidence bound algorithm. The time step used to compute the upper confidence bound is , since there are that number of rewards computed up to time . On the other hand, at agents store all the values at hand in the variables , so at the beginning of the stage it is and . Agents use the running consensus to mix these values for steps, so at the end of the stage the values of and can be assigned to and respectively. Agents cannot add new computed values to and because we want that, at the end of the stage, every value that has been added to them has been mixing for at least steps. Thus, a third couple of variables is needed, and , to store new rewards computed and new number of pulls done during the stage. Agents start mixing the values in and for convenience, but it is not necessary. At the end of the stage, the values in and to and , respectively, can be added due to the linearity of the running consensus operation. The values and are reset with zeros. In this way agents compute upper confidence bounds with an accurate approximation of the values computed, with a delay of at most . As we will see, the regret of UCB does not increase much when working with delayed estimates. In particular, having a delay of steps increases the regret in at most .
To sum up, agents store mixed rewards, rewards that are mixing and new rewards in the variables and , respectively. Similarly for and . For iterations, they use and to decide which arm to pull, while and mix, they store new values in and , and the agents start mixing them for convenience using the running consensus. At the end of the stage agents perform and 0 and similarly for , and .
Pseudocode for DDUCB is given in Algorithm 1. As defined above, we have represented by Greek letters the variables in containing the rewards estimators. Similarly the counters estimators are represented by their corresponding Latin letter.
We now present the regret which the DDUCB algorithm incurs.
Theorem 1 (Regret of DDUCB).
Let be a matrix such that , whose second largest eigenvalue in norm is , with . For the distributed multiarmed bandit problem with nodes, actions and subgaussian rewards with variance proxy , the algorithm DDUCB, using as communication matrix satisfies the following

The finitetime bound on the regret:
where and is the Riemann zeta function.

The corresponding asymptotic bound:

If the time step that is used to compute upper confidence bounds is updated as , the following asymptotic regret bound holds:
Above, means there is a constant such that .
Note that the algorithm needs to know , the second largest eigenvalue of in norm, since it is used to compute , which is a parameter that indicates when values are close enough to be mixed. However, if we use DDUCB with set to any upper bound of the inequality of the finite time analysis above still holds true, substituting by . In the asymptotic bound, would be substituted by .
Remark 1.
In order to interpret the regret obtained in the previous theorem, it is useful to note that running the centralized UCB algorithm for steps incurs a regret bounded above by , up to a constant. Moreover, running separate instances of UCB at each node without allowing communication incurs a regret of . On the other hand, the following is an asymptotic lower bound for any consistent policy [16]:
Thus, we see that the regret obtained in creftypecap 1 improves significantly the dependence on of the regret with respect to the trivial algorithm that does not involve communication and it is asymptotically optimal on . Since in the first iteration of this problem arms have to be pulled and there is no prior information on the arms’ distribution, any asymptotically optimal algorithm on and must pull times each arm, yielding regret of at least , up to a constant. Hence, by the lower bound above and the latter argument, the regret obtained in creftypecap 1 is asymptotically optimal up to a factor of in the second summand of the regret.
We can also use the previous theorem to derive an instance independent analysis of the regret.
Theorem 2 (Instance Independent Regret Analysis of DDUCB).
The regret achieved by the DDUCB algorithm is
where is an upper bound on the gaps , .
Proof.
Define as the set of arms such that their respective gaps are all less than and as the set of arms that are not in . Then we can bound the regret incurred by pulling arms in , in the following way
Using creftypecap 1 we can bound the regret obtained by the pulls done to arms in :
Adding the two bounds above yields the result.
∎
Remark 2.
In [17] the regret of the best algorithm proposed, in terms of regret, is bounded by where
Here, is an exploration parameter that the algorithm receives as input and is a nonnegative graph dependent value, which is only when the graph is the complete graph. Thus is at least . Hence, up to a constant, is always greater than the first summand in the regret of our algorithm in creftypecap 1.3. In fact, can be significantly greater (see example in the supplementary material). Note that and so , where . The factor multiplying in the second summand in creftypecap 1.3, that is, , is no greater than . Indeed
And the latter is true. Depending on the graph, can be much greater than the lower bound we have used. Therefore the regret upper bound of DDUCB is less than, up to a constant, the one of any of the algorithms in [17].
As an example, if we take to be symmetric, it is . Consider the graph to be a cycle with an odd number of nodes (and greater than ) and take as the matrix such that if and otherwise. Then is a circulant matrix and their eigenvalues are , . Then and . As a consequence, is greater than the corresponding summand in our bound by an summand and ours is . In addition, is greater than the corresponding summand in Theorem by a factor of . The bounds above can be proven by a Taylor expansion: , for and . So . The bounds above are the latter for and .
The algorithm can be modified slightly to get better regret. The easiest (and recommended) modification is the following. While waiting for the vectors and , to be mixed, each node adds to the variables and the information of the pulls that are done times . The variable accounting for the time step has to be modified accordingly. It contains the number of pulls made to obtain the approximations of and , so it needs to be increased by one when adding one extra reward. This corresponds to uncommenting lines 2123 in the pseudocode. Since the values of and are overwritten after the for loop, the assignment of after the loop remains unchanged. Note that if the lines are not uncommented then each time the for loop is executed the pulls that are made in a node are taken with respect to the same arm, the one maximizing the upper confidence for the mixed values. Another variant that would provide even more information and therefore better regret, while keeping the communication cost would consist of also sending the information of the new pull, and , to the neighbors of , receiving their respective values of their new pulls and adding these values to and multiplied by , respectively. We analyze the algorithm without any modification for the sake of clarity of exposition. The same asymptotic upper bound on the regret in creftypecap 1 can be computed for these two variations. On the other hand, we can reduce communication and make agents not to run the running consensus for the variables and , i.e. substituting lines 1819 by . In that case, the upper confidence bounds will be computed with worse estimations of average rewards and number of pulls but the bounds in creftypecap 1 would still hold. If each agent could not communicate values per iteration the algorithm can be slightly modified to account for it at the expense of incurring greater regret. Suppose each agent can only communicate values to her neighbors per iteration. Firstly, as before, we would mix neither nor . Let be . If each agent runs the algorithm in stages of iterations, ensuring to send each element of and at least times, then the bounds in creftypecap 1, substituting by , still hold. Again, in the asymptotic bound, would be substituted by . In each iteration, agents have to send values corresponding to the same indices of or . Intuitively, the factor of in the second summand of the regret accounts for the number of rounds of delay since a reward is computed until it is used to compute upper confidence bounds. If we decrease the communication rate and compensate it with a greater delay, the approximations in and satisfy the same properties as in the original algorithm. Only the second summand in the regret increases because of an increment of the delay.
5 Conclusions and Future Work
We have presented an algorithm for the fully decentralized MAB setting that, as seen in Remark 1, is close to be optimal in terms of regret. Future research can include a change in the model to allow asynchronous communication, with some frequency restrictions, and an adaptive protocol to remove the need for having access to the number of nodes and the spectral gap.
Appendix A Proof of Theorem 1
The proof is along the lines of the one for the standard UCB algorithm cf. [3] but requires a couple of key modifications. Firstly, we need to control the error due to the fact that each agent decides which arm to pull with some delay, because only information after it is close to being mixed is used. Secondly, since agents use information that is not completely mixed, we need to control the error due to the fact that we only have approximations of and , that is, to the true sum of rewards and number of times each arm was pulled respectively.
We present two lemmas before the proof. Their proofs can be found in the supplementary material. Note the running consensus operation is linear. This linearity combined with the stochasticity of , allows us to think about each reward as being at each node weighted by a number. For each reward, the sum of the weights across all the nodes is and the weights approach quickly.
Lemma 1.
Fix an arm , a node , and a time . Let be independent random variables coming from the distribution associated to arm , which we assume subgaussian with variance proxy and mean . Let be a number such that , where . Then
where the sums go from to .
Proof.
Since is subgaussian with variance proxy we have that is subgaussian with variance proxy . Therefore, using subgaussianity and the fact that the random variables , for , are independent we obtain
where . Using we obtain
and the result follows. ∎
At time and at node , we want to use the variables and defined in Algorithm 1 to decide the next arm to pull in that node. Consider the rewards computed by all the nodes until steps before the last time and were updated. Let be the number of these rewards that come from arm and let be such rewards. We can see each of the as being in node at time multiplied by a weight . Every weight we are considering corresponds to a reward that has been mixing for at least steps. This ensures by (2), so the previous lemma can be applied to these weights.
Define the empirical mean of arm at node and time as
Remember that the sum is over the weights and rewards computed up to steps before the last time and were updated. Let UCB and let be the random variable that represents the arm pulled at time by node , which is the one that maximizes UCB, for certain value .
Lemma 2.
Let and be an optimal arm and a suboptimal arm respectively. If then with probability at least or, equivalently,
Proof.
Lemma 1 directly implies two bounds: a lower bound for the upper confidence bound , since it yields that with probability at least we have
and an upper bound on . If then with probability at least :
Using these two facts we know that, with high probability, the following holds:
The probability that or do not hold is at most by the union bound.
∎
Now we proceed to prove the theorem.
Proof of creftypecap 1.
For every we can write uniquely as , where and . In such a case let
Then the time step that we use to compute the upper confidence bounds at time is . It is fixed every iterations. For , the value is equal to the last time step in which the variables and were updated. Thus by definition . Remember is the number of times arm is pulled by node up to time , and . Let be the event . Let . Since it is enough to bound for every .
Fixing an arm we have
The bound of for the second summand is straightforward. For the first summand note that and that . Thus
For the bound for the first summand in , note that for can happen times only by definition but
So can be at most times. The bound for the second summand uses Lemma 2. In we substitute by its value and for we bound it by the greatest multiple of that is less than . For , note that the last summand is . Thus, we can hide it along with other constants under . Substituting the value of then yields the bound.
The result follows:
And similarly, we obtain the non asymptotic bound from the penultimate line in the bound for above. For the last part of the theorem, the time step used to compute upper confidence bounds is instead of , where
The analysis of the regret is analogous, but splitting the expectation in the definition of depending on whether or the opposite. The summands are bounded analogously to the proof above and yield the regret claimed in the theorem.
∎
Appendix B Notation
Number of agents.  

Number of time steps.  
Number of actions.  
Communication matrix.  
Eigenvalues of .  
Means of arms’ distributions.  
Gaps.  
,  Delayed sum of rewards and number of pulls that are mixed. 
,  Sum of rewards and number of pulls done that are being mixed. 
,  Sum of new rewards and new number of pulls. 
,  Vectors in with current reward and pull done at time by node . 
Number of steps that define the stages of the algorithm.  
Reward obtained by pulling arm for the time. If arm is played several times in one time step lower indices are assigned to agents with lower indices.  
Number of times arm is pulled by node up to time .  
.  
Action played by agent 