The Gossiping Insert-Eliminate Algorithm for Multi-Agent Bandits

# The Gossiping Insert-Eliminate Algorithm for Multi-Agent Bandits

## Abstract

We consider a decentralized multi-agent Multi Armed Bandit (MAB) setup consisting of agents, solving the same MAB instance to minimize individual cumulative regret. In our model, agents collaborate by exchanging messages through pairwise gossip style communications. We develop two novel algorithms, where each agent only plays from a subset of all the arms. Agents use the communication medium to recommend only arm-IDs (not samples), and thus update the set of arms from which they play. We establish that, if agents communicate times through any connected pairwise gossip mechanism, then every agent’s regret is a factor of order smaller compared to the case of no collaborations. Furthermore, we show that the communication constraints only have a second order effect on the regret of our algorithm. We then analyze this second order term of the regret to derive bounds on the regret-communication tradeoffs. Finally, we empirically evaluate our algorithm and conclude that the insights are fundamental and not artifacts of our bounds. We also show a lower bound which gives that the regret scaling obtained by our algorithm cannot be improved even in the absence of any communication constraints. Our results demonstrate that even a minimal level of collaboration among agents greatly reduces regret for all agents.

\patchcmd\patchcmd

## 1 Introduction

Multi Armed Bandit (MAB) is a classical model ([23],[8]), that captures the explore-exploit trade-off in making online decisions. MAB paradigms have found applications in many large scale systems such as ranking on search engines [35], displaying advertisements on e-commerce web-sites [11], model selection for classification [25] and real-time operation of wireless networks [5]. Oftentimes in these settings, the decision making is distributed among many agents. For example, in the context of web-servers serving either search ranking or placing advertisements, due to the the volume and rate of user requests, multiple servers are deployed to perform the same task [10]. Each server, makes decisions (which can be modeled as a MAB [35]) on rankings or placing advertisements and also collaborate with other servers by communicating over a network [10]. In this paper, we study a multi-agent MAB model in which agents collaborate to reduce individual cumulative regret.

Model Overview - Our model generalizes the problem setting described in [31]. Concretely, our model consists of agents, each playing the same instance of a armed stochastic MAB, to minimize its cumulative regret. At each time, every agent pulls an arm and receives a stochastic reward independent of everything else (including other agents choosing the same arm at the same time). Additionally, an agent can choose after an arm pull, to receive a message from another agent through an information pull. Agents have a communication budget, which limits how many times an agent can pull information. If any agent chooses to receive a message through an information-pull, then it will contact another agent chosen independent of everything else, at random from a distribution (unknown to the agents) over . The agents thus cannot actively choose from whom they can receive information, rather they receive from another randomly chosen agent. The matrix with its row being the distribution is denoted as the gossip matrix. Agents take actions (arm-pulls, information-pulls and messages sent) only as a function of their past history of arm-pulls, rewards and received messages from information-pulls and is hence decentralized.

\color

black Model Motivations - The problem formulation and the communication constraints aim to capture key features of many settings involving multiple agents making distributed decisions. We highlight two examples in which our model is applicable. The first example is a setting consisting of computer servers (or agents), each handling requests for web searches from different users on the internet [9, 24]. For each keyword, one out of a set of M ad-words needs to be displayed, which can be viewed as choosing an arm of a MAB. Here, each server is making decisions on which ad to display (for the chosen keyword) independently of other servers. Further, the rewards obtained by different servers are independent because the search users are different at different servers. The servers can also communicate with each other over a network in order to collaborate in order to maximize revenue (i.e., minimize cumulative regret).

A second example is that of collaborative recommendation systems, e.g., where multiple agents (users) in a social network are jointly exploring restaurants in a city [31]. The users correspond to agents, and each restaurant can be modeled as an arm of a MAB providing stochastic feedback. The users can communicate with each other over a social network, personal contact or a messaging platform to receive recommendation of restaurants (arms) from others to minimize their cumulative regret, where regret corresponds to the loss in utility incurred by each user per restaurant visit. Furthermore, if the restaurants/customers can be categorized into a finite set of contexts (say, e.g. by price: low-cost/mid-price/high-end, type of cuisine: italian, asian, etc.), our model is applicable per context.

Key Contributions:

1. Gossiping Insert-Eliminate (GosInE) Algorithm - In our algorithms (Algorithm 1 and 3), agents only choose to play from among a small subset (of cardinality ) of arms at each time. Agents in our algorithm accept the communication budget as an input and use the communication medium to recommend arms, i.e., agents communicate the arm-ID of their current estimated best arm. Specifically, agents do not exchange samples, but only recommend an arm index. On receiving a recommendation, an agent updates the set of arms to play from: it discards its estimated worst arm in its current set and replaces it by the recommended new arm.

Thus, our algorithm is non monotone with respect to the set of arms an agent plays from, as agents can discard an arm in a phase and then subsequently bring the arm back and play it in a later phase, if this previously discarded arm gets recommended by another agent. This is in contrast to most other bandit algorithms in the literature. On one hand, classical regret minimization algorithms such as UCB- [3] or Thompson sampling [34] allow sampling from any arm at all points in time (no arm ever discarded). On the other hand, pure explore algorithms such as successive rejects [2] are monotone with respect to the arms, i.e., a discarded arm is never subsequently played again. The social learning algorithm in [31] is also monotone, as the subset of arms from which an agent plays at any time is non-decreasing. In contrast, in this paper we show that even if an agent (erroneously) discards the best arm from its playing set, the recommendations ensures that with probability , eventually the best arm is back in the playing set.

2. Regret of GosInE Algorithm - Despite agents playing among a time-varying set of arms of cardinality , we show that the regret of any agent is (Theorems 1 and 3) . Here, is the difference in the mean rewards of the best and second best arm and is a constant depending on communication constraints and independent of time. We show that the regret scaling holds for any connected gossip matrix and communication budget scaling as . Thus, any agent’s asymptotic regret is independent of the the gossip matrix or the communication budget (Corollary 5). If agents never collaborate (communication budget of ), the system is identical to each agent playing a standard arm MAB, in which case the regret scales as [21],[3]. Thus, our algorithms reduce the regret of any agent by a factor of order from the case of no collaborations. Furthermore, a lower bound in Theorem 4 (and the discussion in Section 6) shows that this scaling with respect to and cannot be improved by any algorithm, communication budget or gossip matrix. Specifically, we show that even if an agent has knowledge of the entire system history of arms pulled and rewards obtained by other agents, the regret incurred by every agent is only a factor of order smaller than the case of no collaborations. Moreover, our regret scaling significantly improves over that of [31], which applies only to the complete graph among agents, in which the regret scales as . Thus, despite communication constraints, our algorithm leverages collaboration effectively.

3. Communication/Regret Trade-Off - The second order constant term in our regret bound captures the trade-off between communications and regret. As an example, we show in Corollary 6 that, if the communication budgets scale polynomially, i.e., agents can pull information at-most times over a time horizon of , for some , then if the agents are connected by a ring graph (the graph with poorest connectivity), the constant term in the regret scales as (upto poly-logarithmic factor), whereas the regret scales as , in the case when agents are connected by the complete graph. Thus, we see that there is an exponential improvement (in the additive constant) in the regret incurred, when changing the network among agents from the ring graph to the complete graph. In general, we give through an explicit formula (in Corollary 6) that, if the gossip matrix has smaller conductance (i.e., a poorly connected network), then the regret incurred by any agent is higher. Similarly, we also establish the fact that if the communication budget per agent is higher, then the regret incurred is lower (Corollary 7).We further conduct numerical studies that establish these are fundamental and not artifacts of our bounds.

## 2 Problem Setup

Our model generalizes the setting in [31]. In particular, our model, imposes communication budgets and allows for general gossip matrices , while the model in [31] considered only the complete graph among agents.

Arms of the MAB - We consider agents, each playing the same instance of a armed stochastic MAB to minimize cumulative regret. The arms have unknown average rewards denoted by , where for every , . Without loss of generality, we assume . However, the agents are not aware of this ordering. For all , denote by . The assumption on the arm-means imply that , for all .

Network among Agents - We suppose that the agents are connected by a network denoted by a gossip matrix , where for each , the row is a probability distribution over . This matrix is fixed and unknown to the agents.

Agent Actions - We assume that time is slotted (discrete), with each time slot divided into an arm-pulling phase followed by an information-pulling phase. In the arm-pulling phase, all agents pull one of the arms and observe a stochastic Bernoulli reward, independent of everything else. In the information pulling phase, if an agent has communication budget, it can decide to receive a message from another agent through an information pull. A non-negative and non-decreasing sequence specifies the communication budget, where no agent can pull information for more than times in the first time slots for all . If any agent , chooses to pull information in the information-pulling phase of any time slot, it will contact another agent chosen independently of everything else, according to the probability distribution given by . Thus, agents receive information from a randomly chosen agent according to a fixed distribution, rather than actively choosing the agents based on observed samples. When any agent is contacted by another agent in the information pulling phase of a time-slot, agent can communicate a limited ( number of bits. Crucially, the message length does not depend on the arm-means or on the time index.

Decentralized System - Each action of an agent, i.e., its arm pull, decision to engage in an information pull and the message to send when requested by another agent’s information pull, can only depend on the agent’s past history of arms pulled, rewards obtained and messages received from information pulls. We allow each agent’s actions in the information pulling phase (such as whether to pull information and what message to communicate if asked for), to depend on the agent’s outcome in the arm-pulling phase of that time slot.

Performance Metric - Each agent minimizes their expected cumulative regret. For an agent and time , denote by to be the arm pulled by agent in the arm-pulling phase of time slot . The regret of agent , after time slots (arm-pulls) is defined as and the expected cumulative regret is 6.

## 3 Synchronous GosInE Algorithm

We describe the algorithm by fixing an agent .

Input Parameters - The algorithm has three inputs (i) a communication budget , (ii) and (iii) . From this communication budget, we construct a sequence such that

 Ax=max(min{t∈N,Bt≥x},⌈(1+x)1+ε⌉). (1)

Every agent, only pulls information in time slots . This automatically respects the communication budget constraints. Since agents engage in information pulling at common time slots, we term the algorithm, synchronous. \colorblack The parameter ensures that the time intervals between the instants when agents request for an arm are well separated. In particular, having ensures that the inter-communication times scale at least polynomially in time. As we shall see in the analysis, this only affects the regret scaling in the second order term.

Initialization - Associated with each agent , is a sticky7 set of arms -

 (2)

Notice that the cardinality . In words, we are partitioning the total set of arms, into sets of size with the property that . For instance, if , then for all , . Denote by the set and and

 S(i)0=ˆS(i)∪U(i)0∪L(i)0. (3)

UCB within a phase - The algorithm proceeds in phases with all agents starting in phase . Each phase lasts from time-slots till time-slot , both inclusive8. We shall fix a phase henceforth in the description. For any arm and any time , is the total number of times agent has pulled arm , upto and including time and by , the empirical observed mean9. Agent in phase , chooses arms from according to the UCB- policy of [3] where the arm is selected from .

Pull Information at the end of a phase - The message received (arm-ID in our algorithm) in the information-pulling phase of time slot is denoted by . Every agent, when asked for a message in the information-pulling phase of time-slot , will send the arm-ID it played the most in phase .

Update arms at the beginning of a phase - If , then . Else, agent discards the least played arm in phase from the set and accepts the recommendation , to form the playing set . Observe that the cardinality of remains unchanged. Moreover, the updating ensures that for all agents and all phases , , namely agents never drop arms from the set . Hence, we term the set , sticky.

The pseudo-code of the Algorithm described above is given in Algorithm 1.

### 3.1 Model Assumptions

We make two mild assumptions on the inputs (a discussion is provided in Appendix A).

(A.1) The communication matrix is irreducible. Namely, for any two , with , there exists and , with and such that the product is strictly positive.

(A.2) The communication budget and is such that for all , there exists such that for all , (i.e., ). Furthermore, we shall assume a convexity condition, i.e., for every and , , where the sequence is given in Equation (1). Furthermore, .

\color

black For instance, if , for all . and , then , for all . Similarly, if , for all , i.e., if the budget is adequate to communicate in every time slot, then , for all . One can check that, both these examples satisfy the conditions in assumption A.2

### 3.2 Regret Guarantee

The regret guarantee of Algorithm 1 is given in Theorem 1, which requires a definition. Let and a be a gossip matrix. Denote by the random variable to be the spreading time of a rumor in a pull model, with a rumor initially in node (cf [32]). Formally, consider a discrete time stochastic process where initially, node has a rumor. At each time step, each node that does not posses the rumor, calls another node sampled independently of everything else from the probability distribution . If a node calls on a node possessing the rumor, node will possess the rumor at the end of the call (at the end of current time step). The spreading time is the stopping time when all nodes possess the rumor for the first time.

###### Theorem 1.

Suppose in a system of agents connected by a communication matrix satisfying assumption (A.1) and arms, each agent runs Algorithm 1, with UCB parameter and communication budget and satisfying assumption (A.2). Then the regret of any agent , after time any time is bounded by

 E[R(i)T]≤⎛⎜ ⎜⎝⌈KN⌉+2∑j=21Δj⎞⎟ ⎟⎠4αln(T)+K4{% Collaborative UCB Regret}+g((Ax)x∈N)+E[A2τ(P)spr]{Cost of Infrequent Pairwise Communications }, (4)

where is given in Equation (1) and where

where, , .

\color

black

### 3.3 Discussion

In order to get some intuition from the Theorem, we consider a special case. Recall from Equation (1), that is the time slot when any agent pulls information for the th time. Thus, if for some , the communication budget , then for all small and all large , the sequence . In other words, if communication budget scales polynomially (but sub-linearly) with time, then is also polynomial, but super linear. Similarly, if the gossip matrix corresponded to the complete graph, i.e., , for all and , we will show in the sequel (Corollary 6), that there exists an universal constant such that . Thus, we have the following corollary.

###### Corollary 2.

Suppose the communication budget satisfies , for all , for some . Let be sufficiently small. Then the communication sequence in Equation (1) with is such that for all large . If the gossip matrix connecting the agents corresponded to the complete graph, i.e., , for all , then under the conditions of Theorem 1, the regret of any agent at time satisfies

 E[R(i)T]≤⎛⎜ ⎜⎝⌈KN⌉+2∑j=21Δj⎞⎟ ⎟⎠4αln(T)+K4{Collaborative UCB Regret}+42α−3π263β+4max⎛⎜ ⎜⎝K3(2α−6),⎛⎝16α2+⌈KN⌉Δ22⎞⎠ββ−1⎞⎟ ⎟⎠+(Clog(N))β{Cost of Infrequent Pairwise Communications },

where is an universal constant given in Corollary 6.

The proof is provided in Appendix LABEL:appendix-proof-interpretation. The terms denoting cost of pairwise communications correspond to the average amount of time any agent must wait before the best arm is in the playing set of that agent. This cost can be decomposed into the sum of two dominant terms. The term of order is the expected number of samples needed to identify the best arm by any agent. The term is the amount of time taken by a pure gossip process to spread a message (the best arm in our case) to all agents, if the communication budget is given by .

\color

black

### 3.4 Proof Sketch

The proof of this theorem is carried out in Appendix B and we describe the main ideas here. We deduce in Proposition 2 that there exists a freezing time such that, all agents have the best arm by time and only recommend the best arm from henceforth, i.e., the set of arms of agents do not change after . The technical novelty of our proof is in bounding , as this leads to the final regret bound (Proposition 2).

There are two key challenges in bounding this term. First, the choice of arm recommendation is based on the most played arm in the current phase, while the choice of arm to pull is based on samples even in the past phases, as the UCB considers all samples of an arm thus far. If the phase lengths are large (Equation (1) ensures this), Lemma 8 shows that the probability of an agent recommending a sub-optimal arm at the end of a phase is small, irrespective of the number of times it was played till the beginning of the phase. Second, the events that any agent recommends a sub-optimal arm in different phases are not independent, as the reward samples collected by this agent, leading to those decisions are shared. We show in Proposition 3 by establishing that after a random, almost surely finite time (denoted as in Appendix B), agents never recommend incorrectly.

### 3.5 Initialization without Agent IDs

The initialization in Line of Algorithm 1 relies on each agent knowing its identity. However, in many settings, it may be desirable to have algorithms that do not depend on the agent’s identity. We outline a simple procedure to fix this (with guarantees) in Appendix LABEL:app:random_sticky.

## 4 Asynchronous GosInE Algorithm

A synchronous system is not desirable in many cases as agents could get a large number of message requests during time slots . Consider an example where the gossip matrix is a star graph, i.e., for all , and . In this situation, at time slots , the central node will receive a (large) different requests for messages, which may be infeasible if agents are bandwidth constrained.

We present an asynchronous algorithm to alleviate this problem. This new algorithm is identical to Algorithm 1 with two main differences - (i) each agent chooses the number of time slots it stays in any phase as a random variable independently of everything else, and (ii) when asked for a recommendation, agents recommend the most played arm in the previous phase. The first point, ensures that even in the case of the star graph described above, with high probability, eventually, no two agents will pull information in the same time slot. The second point ensures that even though the phase lengths are random, the quality of recommendations are good as they are based on large number of samples. We give a pseudo-code in Algorithm 3 where lines and are new and lines (agents have different phase lengths) and (arm recommendation from previous phase) are modified from Algorithm 1.

###### Theorem 3.

Suppose in a system of agents connected by a communication matrix satisfying assumption (A.1) and arms, each agent runs Algorithm 3, with UCB parameter , and communication budget and satisfying assumption (A.2). Then the regret of any agent , after any time is bounded by

 E[R(i)T]≤⎛⎜ ⎜⎝⌈KN⌉+2∑j=21Δj⎞⎟ ⎟⎠4αln(T)+K4{% Collaborative UCB Regret}+(1+δ)E[A2⌊2+δ⌋τ(P)spr]+ˆg((Ax)x∈N,δ){Cost of Asynchronous Infrequent Pairwise Communications},

where , where given in Theorem 1 and is given in Equation (1).

### 4.1 Proof Sketch

The proof of this theorem is carried out in Appendices D,E and LABEL:sec:prob_algo_proof. In order to prove this, we find it effective to give a more general algorithm (Algorithm 5 in Appendix D) where the agents choose the phase lengths as a Poisson distributed random variable. This algorithm does not satisfy the budget constraint exactly, but only in expectation, over the randomization used in the algorithm. We analyze this in Theorem 10 stated in Appendix D and proved in Appendix E. The main additional technical challenge is that as the phase lengths of different agents are staggered. We crucially use the convexity of the sequence (Assumption A.2) in Proposition LABEL:prop:convex, along with more involved coupling argument to a rumor spreading process (Proposition 4). The proof of Theorem 3 is a corollary of the proof of Theorem 10 in Appendix LABEL:sec:prob_algo_proof.

## 5 Lower Bound

In order to state the lower bound, we will restrict ourselves to a class of consistent policies [21]. A policy (or algorithm) is consistent, if \colorblackfor any agent , and any sub-optimal arm , the expected number of times agent plays arm up-to time (denoted by ) satisfies for all , .

###### Theorem 4.

The regret of any agent after playing arms for times under any \colorblack consistent policy played by the agents and any communication matrix satisfies

 liminfT→∞E[R(i)T]ln(NT)≥(1NK∑j=2ΔjKL(μj,μ1)), (5)

where for any , is the Kullback-Leibler distance between two Bernoulli distributions with mean and .

The proof of the theorem is carried out in Appendix LABEL:sec:lb_proof. The proof of this lower bound is based on a system where there are no communication constraints. From standard inequalities for KL divergence, we get from Equation (5) that

 liminfT→∞E[R(i)T]ln(NT)≥μ1(1−μ1)(1NK∑j=21Δj). (6)

## 6 Insights

1. Insensitivity to Communication Constraints - The following corollary follows directly from Theorems 1 and 3.

###### Corollary 5.

Suppose in a system of agents each running Algorithm 1 or 3 with parameters satisfying conditions in Theorems 1 and 3 respectively. Then, for every agent and time ,

 limsupT→∞E[R(i)T]ln(T)≤⎛⎜ ⎜⎝⌈KN⌉+2∑j=24αΔj⎞⎟ ⎟⎠.

Thus, as long as the gossip matrix is connected (Assumption A.1) and the communication budget over a horizon of is at-least , (Assumption A.2), the asymptotic regret of any agent, is insensitive to and the communication budget.

2. Benefit of Collaboration - As an example, consider a system where and arm-means such that , . Let be any consistent policy for the agents in the sense of Theorem 4. Then Equation (6) and Corollary 5 implies that , where in the numerator is the regret obtained by our algorithms and the denominator is by the policy . As ratio of asymptotic regret in our algorithm and the lower bound is a constant independent of the size of the system, (does not grow with ), our algorithms benefit from collaboration. Recall that the lower bound is obtained from the full interaction setting where all agents communicate with every other agent, after every arm pull while in our model, every agent pulls information, a total of at most o(T) times over a time horizon of T. Thus, we observe that, despite communication constraints, any agent in our algorithm performs nearly as good as the best possible algorithm when agents have no communication constraints, i.e., the regret ratio is a constant independent of .

3. Impact of Gossip Matrix - The second order constant term in the regret bounds in Theorems 1 and 3 provides a way of quantifying the impact of , based on its conductance, which we define now. Given an undirected finite graph on vertex set , denote for any vertex , to be the degree of vertex in . For any set , denote by . For any two sets , denote by , to be the number of edges in with one end in and the other in . The conductance of , denoted by is defined as

 ϕ:=minH⊂V:0

The following corollary, illustrates the intuitive fact that if the conductance of the gossip matrix is higher, then the regret (the second order constant term) is lower. \colorblack For sake of clarity, we give the corollary in the special case of polynomially scaling communication budgets and provide a general result in the Appendix in Corollary LABEL:cor:regret-comm.

###### Corollary 6.

Suppose agents are connected by a -regular graph with adjacency matrix having conductance and the gossip matrix . Suppose the communication budget scales as , for all , where is arbitrary. If the agents are using Algorithm 1 with parameters satisfying assumptions in Theorem 1, then for any and

 E[R(i)T]≤4αln(T)⎛⎜ ⎜⎝⌈KN⌉+2∑j=21Δj⎞⎟ ⎟⎠+K4\clap{Collaborative UCB Regret}+(2Clog(N)ϕ)β\clap{Impact of Gossip Matrix}+23β2α−3π26+(j∗)β+1\clap{% Constant Independent of P},

where is a constant independent of the gossip matrix , depending only on and (given in Theorem 1).

The proof is provided in Appendix LABEL:appendix_proof_cor_reg_comm_tradeoff. \colorblack Notice, that the only term in the regret that depends on the graph is the conductance . In order to derive some intuition, we consider two examples - one wherein the agents are connected by a complete graph, and one wherein they are connected by the ring graph. The conductance of the complete graph is , while that of the ring graph is . Thus, the cost of communications scales as for the complete graph, but scales as in the ring graph. This shows the reduction in regret that is possible by a ‘more’ connected gossip matrix, where the regret is reduced from order to on moving from the ring graph to the complete graph. This is also demonstrated empirically in Figures 1 and 2.

4. Regret/Communication Trade-off - For a fixed problem instance and gossip matrix , reducing the total number of information pulls, i.e., reducing the rate of growth of increases the per-agent regret. This can be inferred by examining the cost of communications in Equation (4), which we state in the following corollary.

###### Corollary 7.

Suppose, Algorithm 1 is run with arms and agents connected by a gossip matrix , with two different communication schedules and , such that . Then there exist positive constants (depending on the two communication sequences), such that for all and , and , the cost of communications in the regret bound in Equation (4) is ordered as

 g((A(1)x))+E[A(1)2τ(P)spr]≥g((A(2)x))+E[A(2)2τ(P)spr].

The proof of this corollary is provided in the Appendix LABEL:appendix-com-reg-tradeoff. In light of Equation (4) in Theorem 1, the above corollary makes precise the qualitative fact that if agents are allowed more communication budget, then they experience lesser regret. We demonstrate this empirically in Figure 3.

## 7 Numerical Results

We evaluate our algorithm and the insights empirically. Each plot is the regret averaged over all agents, is produced after and random runs for Algorithms 1 and Algorithm 3 (with ) respectively, along with confidence intervals. We also plot the two benchmarks of no interaction among agents (where a single agent is running the UCB- algorithm of [3]) and the system corresponding to complete interaction, where all agents are playing the UCB- algorithm with entire system history of all arms pulled and rewards obtained by all agents as described in Section 5.

Synthetic Experiments - We consider a synthetic setup with , , rest of the arm means sampled uniformly in . In Figures 1 and 2, we consider the impact of gossip matrix by fixing the communication budget () and varying to be the complete and cycle graph among agents. We see that our algorithms are effective in leveraging collaboration in both settings and experiences a lower regret in the complete graph case as opposed to the cycle graph, as predicted by our insights.

In Figure 3, we compare the effect of communication budget by considering two scenarios - polynomial budget () and logarithmic budget (). We see that even under a logarithmic communication budget, our algorithms achieve significant regret reduction.

Real Data - In Figure 4, we run our Algorithms on MovieLens data [14] using the methodology in [31]. This dataset contains movies rated by users. We treat the movies as arms and estimate the arm-means from the data by averaging the ratings of a section of similar users (same age, gender and occupation and have rated at-least movies). We further select only those movies that have at least ratings by users in the chosen user category. We estimate the missing entries in the sub-matrix (of selected users and movies) using matrix completion [15] and choose a random set of and movies, in Figure 4. We compare against [31] (hyperparameter ) for the setting of complete graph among agents and communication budget . We see that in all settings, our algorithm has superior performance and strongly benefits from limited collaboration.

## 8 Related Work

The closest to our work is [31] which introduced a model similar to ours. However, the present paper improves on the algorithm in [31] in three aspects: (i) our algorithm can handle any gossip matrix , while that of [31] can only handle complete graphs and (ii), the algorithm in [31], needs as an input, a lower bound on the arm gap between the best and the second best arm, while our algorithms do not require any such knowledge and (iii), our regret scaling is superior even on complete graphs.

The multi-agent MAB was first introduced in the non-stochastic setting in [6] and further developed in [10]. However, there was no notion of communication budgets in these models. Subsequently, [18] considered the regret/communication trade-off in the non-stochastic setting, different from our stochastic MAB model. In the stochastic setting, the papers of [12],[9], [29], [19], [22] consider a collaborative multi agent model where agents minimize individual regret in a decentralized manner. In these models, communications is not an active decision made by agents, rather agents can observe neighbor’s actions and are, therefore, different from our setup, where agents actively choose to communicate depending on a budget. The papers of [16] and [33] study the benefit of collaboration in reducing simple regret, unlike the cumulative regret considered in our paper. The paper of [20] considers a distributed version of contextual bandits, in which agents could share information, whose length grows with time and thus different from our setup. There has also been a lot of recent interest in ‘competitive’ multi-agent bandits ([1], [26], [30], [4], [17], [7],[28],[27]), where if multiple agents choose the same arm in a time slot, then they experience a ‘collision’ and receive small reward (only a subset (possibly empty) gets a reward). This differs from our setup where even on collision, agents receive independent rewards.

## 9 Conclusions

We introduced novel algorithms for multi agent MAB, where agents play from a subset of arms and recommend arm-IDs. Our algorithms leverage collaboration effectively and in particular, its performance (asymptotic regret) is insensitive to the communication constraints. Furthermore, our algorithm exhibits a regret communication trade-off, namely achieves lower regret (finite time) with increased communications (budget or conductance of ), which we characterize through explicit bounds.

### Acknowledgements

This work was partially supported by ONR Grant N00014-19-1-2566, NSF Grant SATC 1704778, ARO grant W911NF-17-1-0359 and the NSA SoS Lablet H98230-18-D-0007. AS also thanks François Baccelli for the support and generous funding through the Simons Foundation Grant (#197892) awarded to the University of Texas at Austin.

## Appendix A Discussion on Technical Assumptions in Section 3.1

Assumption A.1 states that the graph of communication among agents is connected. Observe that if A.1 is not satisfied, then there exists at-least a pair of agents that can never exchange information among each other, making the setup degenerate. Assumption A.2 implies that, any agent over a time interval of arm-pulls, can engage in information-pulls, at-least times. The convergence of the series in A.2 also hold true for all ‘natural’ examples, such as exponential and polynomial. For instance, the series is convergent if for all large , either or , for all and . Thus, conditions A.1 and A.2 do not impact any practical insights we can draw from our results.

## Appendix B Proof of Theorem 1

In order to give the proof, we first set some notations and definitions. We make explicit a probability space construction from [23], that makes the proof simpler. We assume that there is a sequence of independent valued random variables , where for every , the collection is an i.i.d. Bernoulli random variable of mean . The interpretation being that if an agent pulls arm for the th time, it will receive reward . Additionally, we also have on the probability space a sequence of independent valued random variables , where for each , the sequence is iid distributed as . The interpretation is that when agent wishes to receive a recommendation at the end of phase , it will do so from agent .

### b.1 Definitions and Notations

In order to analyze the algorithm, we set some definitions. Let to be the best arm in , i.e., . Observe that since the set is random, is also a random variable. For every agent and phase , we denote by to be that arm, that agent played the most in phase . Note, from the algorithm, if any agent pulled an arm from agent at the end of phase for a recommendation, it would have received arm .

Fix an agent and phase . Let be a collection of all subsets of cardinality , such that . For any , index the elements in as in increasing order of arm-ids. Let be such that . For every agent , phase and , denote by the event as

Denote by as the union of all such events, i.e.,

 Ξ(i)j:=⋃S∈S(i)⎛⎜ ⎜ ⎜ ⎜ ⎜⎝⋃(a1,⋯a⌈KN⌉+2)∈N⌈KN⌉+2ξ(i)j(S;a1,⋯,a⌈KN⌉)⎞⎟ ⎟ ⎟ ⎟ ⎟⎠,

and by its indicator random variable, i.e.,

 χ(i)j=1Ξ(i)j. (7)

In words, the event is the indicator variable indicating whether agent does not recommend the best arm at the end of phase , under some sample path, i.e., we take an union over all possible set of playing arms that contain arm (i.e., set ) and all possible number of plays of the various arms in until the beginning of phase (i.e., the set of histories in ). In Lemma 8, we provide an upper bound to this quantity. Notice from the construction that for each agent and phase , the random variable is measurable with respect to the reward sequence . Also, trivially by definition, observe that almost-surely. This is so since is an union bound over all possible realizations of the communication sequence and reward sequence of other agents, while considers a particular realization of the communication and rewards of other agents.

We now define certain random times that will be useful in the analysis.

 ˆτ(i)stab =inf{j′≥j∗:∀j≥j′,χ(i)j=0}, ˆτstab =maxi∈[N]ˆτ(i)stab, ˆτ(i)spr =inf{j≥ˆτstab:1∈S(i)j}−ˆτstab, ˆτspr =maxi∈{1,⋯,N}ˆτ(i)spr, τ =ˆτstab+