Social Learning in Multi Agent Multi Armed Bandits
Abstract
Motivated by emerging need of learning algorithms for large scale networked and decentralized systems, we introduce a distributed version of the classical stochastic MultiArm Bandit (MAB) problem. Our setting consists of a large number of agents that collaboratively and simultaneously solve the same instance of armed MAB to minimize the average cumulative regret over all agents. The agents can communicate and collaborate among each other only through a pairwise asynchronous gossip based protocol that exchange a limited number of bits. In our model, agents at each point decide on (i) which arm to play, (ii) whether to, and if so (iii) what and whom to communicate with. Agents in our model are decentralized, namely their actions only depend on their observed history in the past.
We develop a novel algorithm in which agents, whenever they choose, communicate only armids and not samples, with another agent chosen uniformly and independently at random. The peragent regret scaling achieved by our algorithm is . Furthermore, any agent in our algorithm communicates (armids to an uniformly and independently chosen agent) only a total of times over a time interval of .
We compare our results to two benchmarks  one where there is no communication among agents and one corresponding to complete interaction, where an agent has access to the entire system history of arms played and rewards obtained of all agents. We show both theoretically and empirically, that our algorithm experiences a significant reduction both in peragent regret when compared to the case when agents do not collaborate and each agent is playing the standard MAB problem (where regret would scale linearly in ), and in communication complexity when compared to the full interaction setting which requires communication attempts by an agent over arm pulls. Our result thus demonstrates that even a minimal level of collaboration among the different agents enables a significant reduction in peragent regret.
1 Introduction
The Multi Armed Bandit (MAB) problem is a fundamental theoretical model to study online learning and the explorationexploitation trade offs associated with them. In this paper, we introduce a collaborative multiagent version of the classical MAB problem which features a large number of agents playing the same instance of a MAB problem. Our work is motivated by the increasing need to design learning algorithms for several large scale networked systems. Some common examples include (i) social and peertopeer recommendation services catering to large number of users who are in turn connected by a network ([38],[55],[20], [8]), (ii) a collection of distributed sensors or Internet of Things (IoT) devices learning about the underlying environment (such as road traffic conditions) and connected with each other through some communication infrastructure such as the wireless spectrum ([6, 46]), and (iii) online marketplaces with many services catering to the same customer base, where the different services can potentially share data about the users in some privacy compatible form ([11]) and learning in groups ([23], [34]).
A common theme in many of these applications is the presence of a single MAB instance, which many agents are simultaneously playing to minimize their own cumulative regret. Importantly, the agents can collaborate to speed up learning by interacting with each other only in some restricted form. As an example, the number of bits communicated or the frequency of interactions among agents may be limited in settings where either the agents are geographically distributed and communications are expensive or when in a IoT network where the devices performing learning are energy constrained. Our objective in this paper is to understand the benefit of collaboration in speeding up learning under natural communication constraints.
1.1 Model overview
Our setting consists of a large number of agents , that collaboratively solve the same instance of a stochastic armed MAB problem ([13]), where each arm yields a binary valued reward. Each agent is interested in playing the MAB
problem to minimize their own cumulative regret. If there were just one agent, or if the agents were oblivious to each other and did not collaborate, then each agent is playing independently, the classical armed MAB problem. In our model, the agents can potentially collaborate with each other in solving the MAB problem by sending messages over a communication network connecting them. However, there are limitations on the information sharing architecture in our model
– agents are restricted to only pairwise communication.
Formally, agents are equipped with independent Poisson clocks
(asynchronous system), and when an agent’s clock rings, an agent takes
an ‘action’.
Each action of an agent consists of (i) which arm to play to observe a
reward, (ii) whether to communicate, and if so, (iii) what and with whom to
communicate. Our model imposes three constraints on the communications among the agents. Firstly, each agent, whenever it chooses to communicate, can do so with only one other agent and is thus, is ‘gossip style communications’. Secondly, agents can only communicate a limited bits, each time they choose to communicate. The number of bits communicated in each communication attempt cannot scale with time or depend on the problem instance. In particular, this forbids agents from sharing either all their sample history or estimates for arm means upto arbitrary level of precision. Thirdly, each agent can access the communication medium only times over any horizon of pulls of arms. This restriction disallows agents from communicating each time they pull an arm and observe a reward. Thus, agents must aggregate their observed history in some form, where the size of the message does not increase with time and communicate.
The agents are decentralized  namely the actions of each agent (which arm to play, whether to communicate and if so to whom and what to communicate), can only depend on the agents past history of arms played, rewards obtained and messages received.
1.2 Model Motivations
We highlight two instances of our model to motivate our choice of
problem formulation and our restrictions on the communications among
agents.
The first example is the setting of multiple users (aka agents) on a
social network, visiting restaurants in a city. In this case, the
restaurants can be modeled as arms in a MAB providing stochastic
feedback on its quality during each visit. Each visit by an agent to a
restaurant provides a (noisy) score, using which an individual agent
can update her/his opinions of restaurants. Furthermore, the social
network platform enables users or agents to personally communicate to
one another to exchange their experiences. The feedback constraint on
the number of bits translates to only recommending a restaurant
identity, as opposed to the realvalued score for that (and/or any
other) restaurants. If the agent communicates her/his topscoring
restaurant (as is the case in our algorithm later), then agents are
implicitly sharing rankings (their current
topchoice) instead of scores, which is wellknown to be more
interpretable (different people’s scores are hard to compare). Our
framework thus provides a guideline to understand good policies for
the users to explore the city that efficiently leverage the
information exchanged on the underlying social network.
A second example is from robotics, where several robot agents can communicate over a wireless adhoc network in a cooperative foraging task [52]. The robots need to forage for a highreward site from among several possible physically separated locations (these sites constitute the arms of the bandit). Since the communciation network is bandwidth constrained, and the robot agents can only communicate (typically pairwise) with those within their radiorange, the communication constraints we consider are appropriate in this setting. We also refer to [33] for another related robotics example involving collaborative leak detection in a pipe system.
1.3 Main Result
We consider a setting with agents and arms of a MAB problem. The main result in this paper
is that we develop an algorithm that leverages collaboration across
agents, such that
the peragent regret after an agent has played for times
scales^{4}^{4}4All logarithms in this paper are natural logs unless
otherwise specified. as
, where is the armgap
between the best and the second best arm. Moreover, in a time interval
of , an agent communicates for about times, where each
communication is an armid, i.e., uses atmost bits per
communication.
The main idea in our algorithm is to use the communication medium only to recommend arms, rather than to exchange observed scores or rewards. Our policy restricts agents to only play from the set of arms they are aware of at any instant of time. Agents are initially only aware of a small set of arms, and this set increases with time as agents receive recommendations. Agents in our algorithm communicate with another agent chosen uniformly and independently at random, and thus the communications induced by our algorithms is ‘gossip style’ [47]. Qualitatively, our regret scaling occurs due to two
reasons: (i) The (localexplore + gossip) mechanism underlying
our algorithm ensures that the best arm spreads quickly through
the network to all agents. Notice that since agents only play from
among arms they are aware of, it is not apriori clear that all agents
become aware of the bestarm at all. (ii) Nevertheless, our
algorithm ensures that each agent in the network only ever explores a vanishingly small fraction of the
arms. In other words, the suboptimal arms do not spread and thus
not all agents need to learn and discard the suboptimal arms.
Analytically, we introduce several novel coupling arguments and tail estimates to study variants of the classical spreading processes on graphs (cf. Theorem 25, 27), which can be of independent interest in themselves. Furthermore, we employ arguments based on the linearity of expectation to handle the dependencies of the regret among the agents induced by our algorithm (cf. Propositions 11,13,16), which we believe can be useful in studying other algorithms for our model.
1.4 Comparison with Benchmark Systems
Since we are interested in quantifying the effect of collaboration through limited noisy pairwise interactions among the agents, we compare our result with the two extreme opposite scenarios
of collaboration among the agents  a setting with no communication and a one with complete interactions among agents.
1. No Communication regime  If the players are unaware of each other and do not
interact at all, then each player will see a standard MAB
problem consisting of arms. Thus, from well known results (for ex. [4]), each
agent after playing the MAB problem for time steps, must incur a
regret that scales as
.
2. Full Interaction Regime  On the other end is the perfect collaboration model in which every agent, whenever its clock rings, plays an arm, observes a reward and then broadcasts both the arm played and reward obtained to all other agents. In this case, every agent before playing an arm, has access to the entire system history and thus can jointly simulate a single agent optimal scheme. Thus, after a total of clock ticks of tagged agent, the total number of arm pull by all agents is roughly . It is not exact as there is some randomness in the number of times an agent plays in a given time interval determined by the randomness due the clock process of agents. Thus, the total network as a whole will incur an average regret of order . As there are agents in total, the per agent regret in this case scales as which is of order . This is the best possible peragent regret scaling one can hope for in this networked setting and no other collaborative policy can beat this regret scaling. However, to achieve this, each agent must communicate, both its arms and the observed reward (which takes atleast bits) to all other agents, each time it plays an arm. In other words, an agent must communicate times, over plays of the arm, and this communication is broadcast to all other agents.
In our model on the other hand, we are restricted to just pairwise random communications and each agent can participate in communications over times it pulls arms to collect rewards. Nevertheless, we show that our algorithm achieves, both a
significant reduction in the peragent regret compared to the setting of no interactions among agents by bringing the order from to as the leading term in front of . Our algorithm is also only a factor off from the setting of complete interaction among agents, which has a factor of in front of the term. Moreover, our algorithm achieves the reduced regret scaling with a much smaller communication resources where an agent only uses the communication channel of order times over times of play of an agent. We plot in Figure 1, a representative situation showing the regret growth of our algorithm against that of the no communication and full interaction case.
Organization of the Paper  In Section 2, we give a precise mathematical formulation of the problem. We then specify the algorithm in Section 3 and the main theorem statement is given in Section 4. We then give an overview of the proof in Section 5. We evaluate our algorithm and benchmark its performance empirically both in synthetic and real data in Section 6. We then survey related work in Section 7, and then conclude with some discussions and open problems. The full proof of our main result is carried out in Appendices A, B and C.
2 Problem Setting
We have a collection of agents, each of whom is playing the same instance of a MAB problem consisting of arms. The arms have unknown average rewards , where each . Without loss of generality, we assume that . However, the agents are not aware of this ordering of armmeans. Denote by the armgap and we shall assume that . If at any time, any agent plays an arm , it will receive a reward distributed as a Bernoulli random variable of mean , independent of everything else.
2.1 System Model
Clock Process  The system evolves in continuous time, where each agent is
equipped with an unit rate Poisson process on denoted
by , which functions as a clock for agent .
Each agent takes an action only at those random time
instants when the clock ‘ticks’, i.e., only at those
random times such that . The times
when a clock ticks is referred to as an epoch of the clock
process. The processes are all i.i.d.,
and hence the actions of different agents are not
synchronized.
Agent’s Actions  An action by an agent (which it makes at the epochs of its clock
process) consists of three quantities  an arm among the set of
arms to play and obtain a reward (where the observed reward is
either a or a ), the choice of whether to initiate a
pairwise communication, and if so what message and to whom to
communicate to. The message communicated by any agent, each time it does, must be bits long and must not either scale with time or depend on the problem parameters such as arm means or gap . Moreover, over total epochs of an agent where it played arms and collected rewards, it must have communicated only times. From henceforth, we use the term number of epochs to denote the number of times an agent has played arms and collected rewards and time to refer to the continuous time during which the agents’ clocks ring. Our system is decentralized, namely agents’ actions of which arm to pull and whether to communicate and if so what and whom to communicate to must only depend on the agent’s past history or arms pulled, rewards obtained and messages received.
Technical Setup  We suppose there exists a probability space
, which contains i.i.d, unit rate
marked Poisson Point Processes (PPP), corresponding to the clocks for the agents.
Each epoch of each
clock, has associated with it, three independent uniform
valued random variables. The system’s sample path is then a
measurable (i.e., deterministic) function of the set of marked
PPPs. The interpretation of this setup is as follows. Every agent , plays
an arm at the epochs of its clock process and the marks decide actions
(whether to communicate and which arm to play) and their outcomes
(observed rewards and recipients of communication if any). The action of every
agent at every epoch of its clock must be measurable function of
only its arms played, observed rewards and received messages in
the past. In the absence of messaging, every agent is playing a
standard MAB problem, where its action, which is just which arm to
play, is a measurable function of the past arms chosen and rewards
obtained. The key new ingredient in our setup is the active
messaging, where agents can choose, based on the history of chosen
arms, observed rewards and received messages, the arm to play
and the message to communicate, if at all. Thus, our
setting is distributed since an agent is not aware of the arms played
and the rewards obtained by other agents, but only has an indirect
knowledge through the active messages received.
2.2 Performance Metric
The main performance metric of interest is the average regret incurred by all agents. For any agent , and , denote by to be the arm played by agent , in its th epoch. The regret of any agent , after it has played for epochs is defined as
This definition of regret is classical in the study of MAB. In this multiagent scenario, we consider minimizing the average regret experienced by all users or the peruser regret, which is given by , where the expectation is with respect to both the observed randomness and the policy. We want to design algorithms in which agents minimize their own cumulative regret while requiring as minimal a communication resources as possible.
2.3 Model Assumptions
Each agent can agree upon a common protocol to follow prior to execution. This could potentially depend on the agent’s indices. We assume all agents are aware of a common nontrivial lower
bound on this armgap , and
use this information to make decisions. Nevertheless, our proposed algorithm still executes if , and we verify that the degradation in performance of our algorithm is minimal in this case through simulations in Section 6.
Such an assumption of known , but unknown mean rewards (which is the setting in our case), is used in several MAB settings (see the book of [37])  for instance the classical greedy algorithm [53] or the UCBA algorithm [3]. In the networked setting similar to ours, this assumption seems to be standard ([54],[27]). Certain algorithms in [27] require an input parameter , that depends on the armgap . However, it is known from [35],[13], that even if the forecaster knows the armgap , the regret scales atleast as order [35]. Thus, the knowledge of armgap, does not affect the complexity of the problem, atleast from the perspective of regret scaling in time.
3 Algorithm
The algorithm has four parameters, and , the UCB parameter. The algorithm evolves with the different agents being in different states or phases taking values in . At the beginning of execution, all agents start out in state , and as the execution proceeds, they increment their phase by . In other words, the state of every agent is nondecreasing with time. We say that an agent is in Early Phase, if its state is or smaller, and in Late Phase if its state is or larger.
3.1 Notation
For each agent and phase , we denote by to be the set of arms agent is aware of at the beginning of phase . The algorithm is such that in any phase, an agent will only play from among the set of arms it is aware of.
In our algorithm, every agent, if it chooses to communicate, will only communicate arm ids. Thus, during the course of execution of our algorithm, agents will receive arm ids as messages.
For an agent and phases , denote by , the set of arms received by agent , while agent is in phase . At the start of phase , agent updates the set of arms it is aware of as . In other words, agents update the set of arms they are aware of only at the end of a phase. Agents agree upon an initial set of arms, i.e., is chosen before execution of the algorithm. Notice that the set of arms an agent is aware of is nondecreasing, i.e., if for any , any agent and arm , .
For any agent , any arm , and any , denote by the number of times arm was played by agent during its first plays (epochs) in phase . If , denote by the empirical estimate of the mean of arm by agent , using only the samples collected in the first plays of agent in state .
3.2 Algorithm Description
For any agent , its execution is defined as follows.
Initialization  At time (i.e., at the beginning of phase ), agent is aware of arms . Observe that the cardinality .
EarlyPhase  When agent is in any phase , it plays from among the arms in in roundrobin fashion. Agent is in any early phase for precisely times, i.e., for exactly epochs of its clock process , before shifting to state . At the end of ( plays in) phase , agent chooses another agent uniformly and independently at random, and communicates to it the index (id) of the arm from having the highest empirical mean based on the samples collected during phase .
LatePhase  Agent is in this latephase, if its phase . Agent is in phase for exactly epochs, where , before shifting to phase . At any play instant of agent in phase , if there is an arm such that , it plays one such arm, chosen arbitrarily. If no such arm exists, agent then plays an arm chosen according to the UCB policy [4], i.e., the arm is chosen from the set
Furthermore, for all late phases , agent communicates only for the first epochs and after that does not communicate. Agent will communicate in phase , the arm from the previous phase that was played the most number of times, with each communication attempt directed at an uniform random agent.
3.3 Algorithm PseudoCode
For ease of readability, we translate the above description of our algorithm into pseudocode in Algorithm 1. This algorithm assumes access to a function called Communicate, that takes in an armid and an agent as input and sends armid to an agent chosen uniformly at random from and independently of everything else.
3.4 Remarks on the Algorithm
The algorithm is ‘fully asynchronous’ in the sense that agents act independently without keeping track of either a absolute continuous time, or a shared global system clock. Notice that in the earlystage, every agent communicates exactly times, which we will later set to be in the sequel. In each latestage , an agent communicates for exactly times. Since the duration of each latestage phase is doubly exponential, after time steps of play of any agent, it would have communicated order number of times, where each communication is of bits.
One can potentially improve the algorithm, by using a blackbox best arm identification in the earlyphase of an agent instead of playing arms in a round robin fashion. Concretely, if is any bestarm identification algorithm, then in each early phase , each agent , will use the algorithm on the set of arms for atmost total arm pulls. If at the end of arm pulls, either a best arm from is identified which is communicated, or the algorithm fails to terminate within steps, in which case a random arm from will be communicated and agent moves to phase . Similarly, one could use a more sophisticated version of the UCB algorithm ([37]) in the late phase and obtain slightly better results.
4 Main Result
Theorem 1.
Consider a system with agents and arms, with each agent running the above algorithm with parameters , and , UCB parameter and , where . Then for every agent and , the regret after agent has played for epochs is bounded by
(1) 
where . Moreover, in epochs of play, each agent communicates atmost a total of times.
To help parse the result, we consider the case of in the following remarks to understand how effectively our algorithm is leveraging the collaboration among agents.
Remark 2.
In the case and , Theorem 1 states that the expected regret of any agent after epochs is . We can compare this regret scaling with the two benchmark systems of no communication and complete interaction described in Section 1.4. In case agents do not interact at all, the peragent regret is known ([35]) to scale as . In the setting of complete information exchange however, from the discussion in Section 1.4 adapted to the case yields that the per agent regret scales as . Thus, our algorithm is off by only by a logarithmic factor in with respect to full coordination plus an additive constant regret term of .
Remark 3.
Recall that in the fully centralized setting, the total number of times an agent communicates with the centralized server is , if an agent plays for epochs. This follows as for each play of the agent, the centralized entity must communicate an armid for the agent to play which will require atleast bits and the agent reports back its observed samples which takes bit. However, in our algorithm, the total number of communications initiated by an agent in epochs is order , where each communication is atmost (which is equal to in this example) bits, similar to the setting with complete information exchange.
Further, if the arm rewards are drawn from a more general subGaussian distribution, the analysis in this paper will go through with minor modifications, and both the regret scaling and communication scaling remains unchanged. However, this relaxation has implications on the communication complexity with a centralized algorithm. Specifically, each agent needs to encode and communicate the arm reward at a sufficient resolution to distinguish between the best and next best arm mean, which will take an additional bits (assuming is known) per message.
Thus, our algorithm is able to effectively emulate the complete interaction setting using only pairwise anonymous asynchronous gossipstyle communications with much smaller communication complexity.
Remark 4.
Remark 5.
Remark 6.
The initialization in Line of Algorithm 1 requires agents to be aware of their index, which may not be feasible in many scenarios. The following simple modification to Line can make our algorithm fully distributed. Given any positive , each agent , will construct its initial set , by choosing arms from the set , uniformly at random with replacement. Then, with probability atleast , there will exist an agent such that , i.e., the best arm is in some agent’s initial playing set. On this event, the regret of any agent after playing for time steps will satisfy
(2) 
All occurrences of is replaced by .
4.1 Discussion
The peruser regret bound in Equation (1) implies several objectives accomplished by the algorithm.
First, it establishes that every agent will play the best
arm eventually with probability . For if an agent did not play the
best arm ever with some probability , then the peruser
regret has a lower bound of , which for fixed
, the scaling in time is not logarithmic. Second, since an agent
only chooses arms from the set of arms it is aware of, the regret
bound also implies that on average, a typical agent plays
atmost order number of arms. These two properties of
(i) every agent being aware of the best arm, while (ii)
playing a total of order number of distinct
arms illustrates the key benefit of collaborative messaging. In words,
collaboration spreads the best arm to all other agents while not
spreading the poor arms, so that not all agents need to learn and
discard the poorly performing arms.
Furthermore, observe that our regret bound has an additive term that scales as . This additive term can be viewed as a cost of collaboration through the gossip noisy process. As nodes only play from the set of arms they are aware, a node may not play the best arm until it is recommended and will keep incurring a regret linear with time. However, from well known results ([28]), it takes an agent atleast order epochs to identify the best arm with a constant probability and thus to communicate it through the gossip process. Thus, the term is the average time before a typical agent is aware of the best arm and starts playing it. We refer the reader to Appendix D for more discussion.
4.2 Algorithm Intuition and Challenges in Analysis
Our goal is to design an algorithm so that all agents become aware of
the best arm as quickly as possible, since agents will incur a
linearly scaling regret until they become aware of the best arm.
Thus, we conceptually, divide the evolution of the algorithm into two
stages: an early stage and a late stage.
The early stage: In this stage, gossip and best arm
identification dominates, where the goal is to ensure all agents have
identified the best arm, but simultaneously making sure that each
agent is only aware of and has explored a small fraction of the arms.
The tension is the following: When not all agents are even aware of
the best arm, agents must aggressively spread or communicate what they
estimate as their current best arm. However, if agents communicate too
frequently, then their estimates are likely to be poor, as they will
be based on too few samples, thus leading to both increased
communications and bad recommendations (resulting in all agents being
aware of too many arms and leading to poor regret scaling).
The late stage: As time progresses when all agents
are reasonably sure of being aware of the best arm, agents must start
focusing on regret minimization rather than estimating best arms.
However, since we want to ensure that all agents are aware of the best
arm eventually with probability , agents must nevertheless keep
communicating. In particular, almostsurely, all agents must
eventually make infinite recommendations as time progresses,
while only making small and finitely many incorrect
recommendations. Thus, the latestage must be designed to balance two competing objectives. In the rare case that not all agents are aware of the best arm when they shift to the latestage, they must become aware of the best arm quickly and, in the typical case when all agents are aware of the best arm at the beginning of the latephase, the number of new arms an agent becomes aware of in the latestage must be small. The second objective is desirable as all newly aware arms in the latephase, conditioned on agents being aware of the best arm at the end of the earlyphase will necessarily be suboptimal arms.
Recommendations  In our algorithm, we decouple the samples (reward of arm pulls) on which agents make successive recommendations, both in the early as well as late phases. This allows us to claim that the quality of recommendations by an agent are independent across phases, which aids greatly in the analysis. This decoupling also ensures that the quality of recommendations made by agents be
independent of the regret an agent obtains on its samples. We achieve this independence by using the
doubling trick [9] in the latephase and using the performance of an agent in the phase before to make recommendations in the current phase. Contrary to the main
uses of the doubling method in converting a fixed horizon algorithm
into a anytime algorithm, we use this to provide the necessary sample
splitting, between making recommendations and minimizing regret. This decoupling comes at a price however which shows up as an additive term in the regret.
The parameter in our algorithm: Observe that the algorithm needs for the regret guarantees to hold. Furthermore, the closer this parameter is to , the better is our regret bound, evidenced both by our Theorem 1 and simulations in Section 6. However, we show empirically in Section 6, that even if , in practice our algorithm yields good performance and leverages the benefit of collaboration.
The knowledge of is particularly helpful to
agents in deciding when to make recommendations, i.e., the choice of both and . If an agent
recommends too early in the early stage, say much smaller than playing
times in total, then such a recommendation will likely be wrong.
One potential method to remove requiring knowledge of would
be for agents to run a fixedconfidence best arm identification
algorithm (e.g. see [29] and references therein)
before making recommendations. However, such a modification to our
algorithm is not guaranteed to work. To see this, consider a problem
instance where , with all
agents being aware of arms and in the beginning. In the early
phase of the algorithm when not all agents are aware of the best arm,
those that are not aware of the best arm (but have arms 2 and 3) will
spend a large number of samples in order to distinguish between these
two arms. Thereby, these agents will stay in the early phase for a
long time, thus incurring a large regret. However, as neither of these
are the best arm, it does not matter which of these two arms is
recommended, and hence agents could have used fewer samples and have saved
on incurring regret.
We remark here that this assumption seems to be made for many algorithms developed to leverage collaboration in a networked setting. As mentioned, the simple regret counterpart to our cumulative regret in a networked setting considered in [27] and [54] assume knowledge of for their algorithms. For instance, algorithms of [27] require , an input parameter to be larger than a certain function of , while [54] requires an explicit lower bound on similar to ours.
5 Proof Sketch
We identify certain nice behaviour which occurs with high probability (w.h.p). Set . We call the system Good, if the following events occur.

Event  All agents are aware of the best arm by time .

Event  The total number of times any agent is ever contacted by another agent when is in the early phase, i.e., is in state or lower is atmost .

Event  By time , all agents are in phase or lower.
Notice that every agent will play the best arm in phases and beyond, if the Good event holds, as all agents are aware of the best arm by atmost phase . We will show in Lemma 7, that the system is Good w.h.p.
5.1 LateStage Analysis
We split the regret as the sum of three terms  Regret in the early phase, which will be linear as agents are only doing bestarm identification,  Regret in the latephase due to playing the UCB algorithm with the doubling trick and Linear regret in the latephase until an agent becomes aware of the best arm, if it is not aware of the best arm at the beginning of the late phase. The first term is trivial as we will assume that all agents incur a worst case regret of in each of its early phase epoch. The main challenge in computing the second term is that the number of arms an agent is aware of in any latestage is a random variable and not fixed. However, the regret of an agent conditional on the number of arms is easy to compute, as it follows directly from [4]. The key idea here is to notice that conditioning on the number of arms an agent is aware of at the beginning of a phase, has no effect on the regret incurred by an agent during the phase in consideration. This is so as we do not reuse samples across phases to keep track of estimates on arm means. As the regret conditional on the number of arms, scales linearly in the number of arms, it suffices to separately evaluate just the mean number of arms an agent is aware of at the beginning of a phase. This is done in Propositions 16 and 17. To evaluate the third term, we upper bound the time it takes for an agent to learn the best arm by the time it takes by agent to recommend the best arm. We show this in Propositions 12, 13, 14 and 15, that the average number of epochs an agent has to wait in the latestage before being recommended the best arm by agent is ‘small’.
5.2 EarlyStage Analysis
We establish in Lemma 7 proven in Section B, that the system is Good with probability atleast . The probabilities of events and are straightforward to deduce from Chernoff tail bounds which we do in Lemmas 18 and 19 respectively. Concluding about the probability of event is the key technical innovation where the difficulty stems from the following reason. We need to first condition on event , as that will imply that all agents in the early stage make a recommendation from among atmost other arms.
By the choice of , and known results from [14] reproduced as Lemma 20, conditional on event , agents that possess the best arm recommend it with probability atleast . However, conditioning on event induces correlations on the agent ids that receive the messages and hence makes the spreading process difficult to analyze directly, as the recipients are no longer independent conditional on .
We proceed by considering and analyzing a fictitious virtual system which is identical to our algorithm in the early stage with a crucial modification that agents in this fictitious system will drop arms if it at any point it is aware of or more arms. However, agents in this virtual system will not drop the best arm once it becomes aware of it. Note that this is only a mathematical stochastic process under consideration and hence we can assume that the agents in this virtual system know the best arm’s index. We show in Lemma 21, that w.h.p., this virtual system has identical sample paths as our algorithm upto time .
We study the virtual system by a reduction to a discrete time rumor mongering process. Specifically, we will establish in Lemma 24, that agents in this virtual system are ‘in sync’, i.e., for all , no agent makes its th recommendation, before all other agents finish making their th recommendation (See also Figure 3). We notice that the discrete rumor mongering process we obtain is a variation of the classical rumor spreading on [24] and [42], with two important distinctions. First, in our discrete time model, an agent only spreads the rumor after a one time slot delay after receiving the rumor. Second, each spreading attempt of an agent in each time slot, is successful with probability , as opposed to always being successful in [24]. We show in Theorem 25, that the total spreading time for this process is order with high probability. We provide a simple proof of the spreading time in Theorems 25 and 27, which could be of independent interest. This enables us to conclude that event holds w.h.p. for the virtual system, which in turn implies it holds w.h.p. for our algorithm, as the virtual system and our algorithm have identical sample paths upto time w.h.p..
6 Numerical Results
We empirically evaluate the performance of our algorithm and in particular highlight the gains due to collaboration in reducing peragent regret. Throughout this section, we use and . This is different from that mentioned in our Theorem 1 as the constants there arise from certain tail probability bounds that are not tight.
6.1 Synthetic Data
We evaluate the performance of our algorithm in Figure 4. For each case of and , we sample the arm means uniformly in the range and the best arm has mean . To be comprehensive, we test our algorithm with instance settings and the number of arms and agent pairs of . We vary the input parameter of our algorithm and compare the performance of our algorithm against the two benchmarks stated in Section 1.4, namely a system with no interaction and a system with perfect interaction. The no interaction system corresponds to a single agent playing the MAB following the UCB() algorithm of [4]. The perfect interaction benchmark is one wherein when an agent’s clock ticks, it has access to the entire system history and chooses an arm according to the UCB() algorithm using the entire history. In each plot, we first sample the arm means and then do random runs and plot the average over these runs along with % confidence intervals.
Results  We see from Figure 4 that our proposed algorithm, is both practically scalable to large systems and effective in leveraging the collaborations to significantly reduce the peragent regret compared to the case of no collaborations. Even with small , our algorithm has much smaller regret growth eventually compared to the setting of no collaboration. Moreover, there is still substantial performance gain in regret when the input parameter of our algorithm is varied. Note that the theoretical guarantees in Theorem 1 only holds if while in practice (as seen in Figure 4) our algorithm performs well even if .
6.2 Simulations with Real Data
We consider the Movielens data [25] to run our algorithm. This dataset has users and movies. We selected a user category, corresponding to same gender, age and occupation. We ensured that there are atleast users in each category. We then considered a subset of movies such that each user rated atleast of those movies and each movie is rated by atleast of these users. We extract out this submatrix and run standard matrix completion [26] to fill the missing details. We then averaged each column and divided this average by . This then forms the mean rating normalized to of this movie in this user group. This set of normalized scores for movies are used as armmeans, where each movie corresponds to an arm.
In figure 5, we run our algorithm with this armmeans and a common parameter of . In each plot of Figure 5, we randomly sample a collection of movies satisfying the above property, and then do random runs and plot the average over these runs along with % confidence intervals. The confidence bars are smaller than the size of markers on plot.
Results  We see from Figure 5, that even for large systems, our algorithm reaps benefits of collaboration. In particular, since the number of arms is large ( or more), single agent UCB is incurring linear regret in the simulation window, while our algorithm has gone into the late phase and has a sublinear regret growth much earlier. This is because, in all experiments our algorithm is only exploring much smaller number of suboptimal arms (under in all cases as described in Figure 5) compared to the standard UCB. Moreover, the arm gap in all of the plots are or smaller (note the arms were randomly selected for each plot), yet our parameter of performs quite well, implying that our algorithm is quite robust.
7 Related Work
Our work focuses on multiarmed bandit (MAB)
[56, 13]
problems in a multiagent setting, which has received increasing
attention in a number of applications. The earliest
work in this direction is
[7], which consider an
adversarial bandit model with malicious
agents. This setting was
further developed in [17], with delays in communication among agents which were connected by a general graph. However, there are no restrictions on the communications and
agents in these models could communicate after every armpull. Subsequently, [31], studies the communication versus regret tradeoff in a distributed setting with nonstochastic bandits. However, their model does not impose pairwise communication, rather agents communicate via a central coordinator.
In the nonstochastic setting, [45] introduces interactions across agents as limited advice from
experts and thus different from our setting.
In the stochastic bandit setting, the papers [19], [15] studies the tradeoff between communication cost and regret minimization among a team of bandits. However, in these models, agents can simultaneously share information with all others and thus different from the pairwise communication setting of this paper. The models in [33], [40] consider a multiagent setup in which neighbors
a bandit optimization on a social network, where the action and reward of an agent can be observed by neighbors on a graph. However that model does not have a notion of However, there is no notion of communications versus regret tradeoff as agents communicate to their neighbors at all time steps in their model. A recent work of [40] considered a multi agent setup where agents can choose to communicate with all neighbors on an underlying unknown graph. However, agents in their algorithm communicate after each armpull and thus do not have a communications versus regret tradeoff.
There has also been work ([27],[54]) in understanding the communication versus simple regret (pure explore) tradeoff for best arm identification, which is different from the cumulative regret (exploreexploit tradeoff) considered in this paper. Moreover, information sharing in these models are different from ours  the communication model of [27] is one where every node can see every other node’s message, whereas the agents in [54] can communicate at each time step and hence the communications per agent is linear in the number of armpulls. However, similar to our paper, both these papers require some knowledge of the armgap . The algorithm of [27] is guaranteed to work if the time horizon , which is an input parameter, exceeds a function of , while the algorithm in [54] requires an explicit lower bound on .
The paper
[36] considers a distributed
bandit setting where agents communicate arm means using
a consensus algorithm without any communication limitations, unlike our setting. There has also been a line of work
([39], [43], [5],
[30],
[10], [1]) where the agents are competitive, unlike our setting, and interact only indirectly by observing each others’ rewards. The paper of [49] considers a model with different arm means for agents. In each timestep, a single action is taken by the network as a whole through voting process unlike ours where each agent takes an action. The paper [18] considers a single centralized learner that is playing multiple contextual bandit instances, where each instance corresponds to a user on a graph. The graph encodes interactions where ‘nearby users’ on the graph have ‘similar’ contextual bandit instances, different from interactions in our model. Recent works [51],
[16] have considered the social learning problem where agents do bestarm identification (simple regret). In these
setups, the memory of an agent is limited, and hence standard bandit algorithms such as UCB is infeasible. Rather agents resort to simpler algorithms such as the replicator dynamics and thus,
their algorithmic paradigm is not applicable to our setting.
Developments in large scale distributed computing is prompting the study of other learning questions in a decentralized setting. For instance [22], [48], [41], [12], [44], [50], study multiagent convex optimization with gossip style communications. More classically, gossip based computation models has a rich line of history under the name of population protocols [2] and rumor spreading ([21], [32]). We refer the reader to [47] and related references for other applications of the gossip mechanism.
8 Conclusion and Open Problems
In this paper, we study a problem of collaborative learning when there are a
group of agents playing the same instance of the MAB problem. We
demonstrate that even with limited collaboration, the per agent regret
is much smaller when compared to the case when agents do not
collaborate. The paper however motivates several open questions. An immediate
question is how to design an algorithm in which the agents are not
aware of the armgap . This is particularly challenging since
an agent is not aware of when to make recommendations, i.e., agents must balance both bestarm identification as well minimizing simple regret. Even the state of art, bestarm identification algorithms in a networked setting also needs knowledge of ([27],[54]).
Another question that arises from our work is to understand other algorithmic paradigms to
exploit collaboration. In this paper, we considered the scenario where
agents only play from among the arms it is aware of, where
collaboration is key to expanding the set of arms an agent is aware
of. Are there natural protocols, where the set of arms an agent is
aware of can be modeled in a ‘soft’ fashion, where agents
prefer to play those arms that has been recommended to it more
than other arms that have been recommended fewer number of times. This
is a challenging problem, both from an algorithmic design perspective
and also from a mathematical stand point. Third, can Theorem
27 be tightened to get precise limiting
theorems similar to those obtained in [24] and
[42]. Such a result will help in reducing the constants in the
definition of and .
Acknowledgements  This work is partially supported by NSF Grant CNS1704778, ARO grant W911NF1710359 and the US DoT supported DSTOP Tier 1 University Transportation Center. AS acknowledges several stimulating discussions on the model with Rajat Sen, Soumya Basu and Karthik Abinav Sankararaman. AS also thanks François Baccelli for the support and generous funding through the Simons Foundation Grant (197892 to The University of Texas at Austin).
References
 [1] Animashree Anandkumar, Nithin Michael, Ao Kevin Tang, and Ananthram Swami. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications, 29(4):731–745, 2011.
 [2] James Aspnes and Eric Ruppert. An introduction to population protocols. In Middleware for Network Eccentric and Mobile Applications, pages 97–120. Springer, 2009.
 [3] JeanYves Audibert and Sébastien Bubeck. Best arm identification in multiarmed bandits. In COLT23th Conference on Learning Theory2010, pages 13–p, 2010.
 [4] Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 [5] Orly Avner and Shie Mannor. Concurrent bandits and cognitive radio networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 66–81. Springer, 2014.
 [6] Orly Avner and Shie Mannor. Multiuser lax communications: a multiarmed bandit approach. In IEEE INFOCOM 2016The 35th Annual IEEE International Conference on Computer Communications, pages 1–9. IEEE, 2016.
 [7] Baruch Awerbuch and Robert D Kleinberg. Competitive collaborative learning. In International Conference on Computational Learning Theory, pages 233–248. Springer, 2005.
 [8] R. Baraglia, P. Dazzia, M. Mordacchini, and L. Riccia. A peertopeer recommender system for selfemerging user communities based on gossip overlays. Journal of Computer and System Sciences, 79:291 – 308, March 2013.
 [9] Lilian Besson and Emilie Kaufmann. What doubling tricks can and can’t do for multiarmed bandits. arXiv preprint arXiv:1803.06971, 2018.
 [10] Ilai Bistritz and Amir Leshem. Distributed multiplayer banditsa game of thrones approach. In Advances in Neural Information Processing Systems, pages 7222–7232, 2018.
 [11] Edward Boon, Leyland Pitt, and Esmail SalehiSangari. How to manage information sharing in online marketplaces – an exploratory study. In Ideas in Marketing: Finding the New and Polishing the Old, pages 538–541. Springer International Publishing, 2015.
 [12] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
 [13] Sébastien Bubeck, Nicolo CesaBianchi, et al. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
 [14] Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in finitelyarmed and continuousarmed bandits. Theoretical Computer Science, 412:1832–1852, April 2011.
 [15] Swapna Buccapatnam, Jian Tan, and Li Zhang. Information sharing in distributed stochastic bandits. In 2015 IEEE Conference on Computer Communications (INFOCOM), pages 2605–2613. IEEE, 2015.
 [16] L Elisa Celis, Peter M Krafft, and Nisheeth K Vishnoi. A distributed learning dynamics in social groups. arXiv preprint arXiv:1705.03414, 2017.
 [17] Nicolo CesaBianchi, Claudio Gentile, Yishay Mansour, and Alberto Minora. Delay and cooperation in nonstochastic bandits. JOURNAL OF MACHINE LEARNING RESEARCH, 49:605–622, 2016.
 [18] Nicolo CesaBianchi, Claudio Gentile, and Giovanni Zappella. A gang of bandits. In Advances in Neural Information Processing Systems, pages 737–745, 2013.
 [19] Mithun Chakraborty, Kai Yee Phoebe Chua, Sanmay Das, and Brendan Juba. Coordinated versus decentralized exploration in multiagent multiarmed bandits. In IJCAI, pages 164–170, 2017.
 [20] Francesco Colace, Massimo De Santo, Luca Greco, Vincenzo Moscato, and Antonio Picariello. A collaborative usercentered framework for recommending items in online social networks. Computers in Human Behavior, 51:694–704, 2015.
 [21] Alan Demers, Dan Greene, Carl Houser, Wes Irish, John Larson, Scott Shenker, Howard Sturgis, Dan Swinehart, and Doug Terry. Epidemic algorithms for replicated database maintenance. ACM SIGOPS Operating Systems Review, 22(1):8–32, 1988.
 [22] John C Duchi, Sorathan Chaturapruek, and Christopher Ré. Asynchronous stochastic convex optimization. arXiv preprint arXiv:1508.00882, 2015.
 [23] Glenn Ellison and Drew Fudenberg. Wordofmouth communication and social learning. The Quarterly Journal of Economics, 110(1):93–125, 1995.
 [24] Alan M Frieze and Geoffrey R Grimmett. The shortestpath problem for graphs with random arclengths. Discrete Applied Mathematics, 10(1):57–77, 1985.
 [25] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):19, 2016.
 [26] Trevor Hastie, Rahul Mazumder, Jason D Lee, and Reza Zadeh. Matrix completion and lowrank svd via fast alternating least squares. The Journal of Machine Learning Research, 16(1):3367–3402, 2015.
 [27] Eshcar Hillel, Zohar S Karnin, Tomer Koren, Ronny Lempel, and Oren Somekh. Distributed exploration in multiarmed bandits. In Advances in Neural Information Processing Systems, pages 854–862, 2013.
 [28] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb: An optimal exploration algorithm for multiarmed bandits. In Conference on Learning Theory, pages 423–439, 2014.
 [29] Kevin Jamieson and Robert Nowak. Bestarm identification algorithms for multiarmed bandits in the fixed confidence setting. In Information Sciences and Systems (CISS), 2014 48th Annual Conference on, pages 1–6. IEEE, 2014.
 [30] Dileep Kalathil, Naumaan Nayyar, and Rahul Jain. Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory, 60(4):2331–2345, 2014.
 [31] Varun Kanade, Zhenming Liu, and Bozidar Radunovic. Distributed nonstochastic experts. In Advances in Neural Information Processing Systems, pages 260–268, 2012.
 [32] Richard Karp, Christian Schindelhauer, Scott Shenker, and Berthold Vocking. Randomized rumor spreading. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 565–574. IEEE, 2000.
 [33] Ravi Kumar Kolla, Krishna Jagannathan, and Aditya Gopalan. Collaborative learning of stochastic bandits over a social network. IEEE/ACM Trans. Netw., 26(4):1782–1795, August 2018.
 [34] Peter M Krafft, Julia Zheng, Wei Pan, Nicolás Della Penna, Yaniv Altshuler, Erez Shmueli, Joshua B Tenenbaum, and Alex Pentland. Human collective intelligence as distributed bayesian inference. arXiv preprint arXiv:1608.01987, 2016.
 [35] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
 [36] Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. Distributed cooperative decisionmaking in multiarmed bandits: Frequentist and bayesian algorithms. arXiv preprint arXiv:1606.00911, 2016.
 [37] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, 2018.
 [38] Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 539–548. ACM, 2016.
 [39] Haoyang Liu, Keqin Liu, Qing Zhao, et al. Learning in a changing world: Restless multiarmed bandit with unknown dynamics. IEEE Trans. Information Theory, 59(3):1902–1916, 2013.
 [40] David MartínezRubio, Varun Kanade, and Patrick Rebeschini. Decentralized cooperative stochastic multiarmed bandits. arXiv preprint arXiv:1810.04468, 2018.
 [41] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multiagent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
 [42] Boris Pittel. On spreading a rumor. SIAM Journal on Applied Mathematics, 47(1):213–223, 1987.
 [43] Jonathan Rosenski, Ohad Shamir, and Liran Szlak. Multiplayer bandits–a musical chairs approach. In International Conference on Machine Learning, pages 155–163, 2016.
 [44] Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié. Optimal algorithms for smooth and strongly convex distributed optimization in networks. arXiv preprint arXiv:1702.08704, 2017.
 [45] Yevgeny Seldin, Peter L Bartlett, Koby Crammer, and Yasin AbbasiYadkori. Prediction with limited advice and multiarmed bandits with paid observations. In ICML, pages 280–287, 2014.
 [46] Prabodini Semasinghe, Setareh Maghsudi, and Ekram Hossain. Game theoretic mechanisms for resource management in massive wireless iot systems. IEEE Communications Magazine, 55(2):121–127, 2017.
 [47] Devavrat Shah. Gossip algorithms. Foundations and Trends® in Networking, 3(1):1–125, 2009.
 [48] Shahin Shahrampour and Ali Jadbabaie. Distributed online optimization in dynamic environments using mirror descent. IEEE Transactions on Automatic Control, 63(3):714–725, 2018.
 [49] Shahin Shahrampour, Alexander Rakhlin, and Ali Jadbabaie. Multiarmed bandits in multiagent networks. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 2786–2790. IEEE, 2017.
 [50] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact firstorder algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
 [51] Lili Su, Martin Zubeldia, and Nancy Lynch. Collaboratively learning the best option on graphs, using bounded local memory. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(1):11, 2019.
 [52] K. Sugawara, T. Kazama, and T. Watanabe. Foraging behavior of interacting robots with virtual pheromone. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3074 – 3079 vol.3, 11 2004.
 [53] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 [54] Balázs Szörényi, Róbert BusaFekete, István Hegedűs, Róbert Ormándi, Márk Jelasity, and Balázs Kégl. Gossipbased distributed stochastic bandit algorithms. In Journal of Machine Learning Research Workshop and Conference Proceedings, volume 2, pages 1056–1064. International Machine Learning Societ, 2013.
 [55] Cem Tekin, Simpson Zhang, and Mihaela van der Schaar. Distributed online learning in social recommender systems. IEEE Journal of Selected Topics in Signal Processing, 8(4):638–652, 2014.
 [56] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
Appendix A Analysis of the Algorithm
We will analyze the regret experienced by fixing an arbitrary agent . Recall the definition given in Section 5 of when we call the earlyphase of our system Good. Observe that if the system is Good, then every agent will be aware of the best arm, in phase . Thus, conditional on the event Good, all agents will start playing the best arm in phases and above. For ease of notation, denote by in the rest of the proof.
Lemma 7.
A sample path is Good with probability atleast .
The proof of this lemma is deferred to Section B. To carry out the analysis further, we will need two classical results from the study of MultiArmed Bandits (MAB) [4, 14].
Proposition 8.
[4] Consider playing the UCB() algorithm for time steps of a armed MAB. The regret is upper bounded by .
We will also need another result from the literature [14] that we reproduce here for completeness.
Proposition 9.
[14] Consider a MAB problem with arms and playing the UCB strategy. The probability that after time steps the best arm is not the most played arm is atmost , for all such that .
Remark 10.
The constant is chosen such that , and hence the previous error bounds are applicable to all agents in phases and beyond.
To now carry out the analysis, we define a few other random variables. Denote by to be the number of epochs of agent that have elapsed, before agent becomes aware of the best arm (i.e., arm indexed ). Recall that agent , has the best arm in its set at time , i.e., . Denote by the random variable , to be the first phase of agent , when agent communicates the best arm to agent in consideration. In other words, is a random variable denoting the earliest latephase state of agent , such that , i.e., agent has for its opinion the best arm, and agent communicates this opinion to agent , while it is in phase . Denote by , to be the state in which agent receives the best arm for the first time, as a recommendation from another agent.
Proposition 11.
Proof.
The time is clearly upper bounded by the time agent takes to spread the best arm itself to agent . From the definition of the random variable , this happens at some point of time when agent is in state . Conditional on , the number of epochs of agent taken to reach the end of phase (which is also equal to the beginning of phase ) is . Now, in phase , the average number of epochs taken by agent to communicate its opinion to agent is atmost . This is atmost , since conditional on , we know that agent will communicate the best arm within a deterministic number of epochs. Since, the average time of a Geometric random variable conditioned that it is smaller than a fixed deterministic constant is atmost its mean, in an additional average of epochs of agent in phase , it will communicate the best arm to agent . ∎
Proposition 12.
For all , we have
where . Here the empty product .
Proof.
To have the event , in all phases , we must have either had the opinion , or agent does not communicate the best arm to agent in phase . Additionally in phase , both the opinion must correspond to the best arm and agent must have communicated it to agent in its th phase. Since we are interested in an upper bound on the probability, we can assume that agent is aware of all arms in all its latestages. This provides the largest error probability that the opinion of agent in a latephase is different from the bestarm. From Proposition 9, we know the probability that agent has an opinion in phase which is different from the bestarm is atmost . Similarly, the probability that agent fails to communicate the best arm to agent in attempts is atmost . Thus the probability that agent fails to inform agent of the best arm, when agent is in phase is atmost . The result then follows from the independence of opinions and the communication recipients of agent across different phases and epochs. ∎
Notice immediately that we have , and thus the algorithm ensures that agent (and by symmetry) all agents will be aware of the best arm eventually with probability . However, we want to ensure that agents become aware of the best arm ‘soon’ enough on average, which is the subject of the following computations.
Proposition 13.
Proof.
Conditional on the system being Good, we know that all agents are aware of the bestarm before any agent moves into the latephase. Since, every agent moves into the latephase after epochs, the first inequality follows.
For the second Equation, we proceed as follows. We upper bound the number of epochs by the number of epochs that agent takes to spread the best arm to agent in the latephase of agent . Conditional on the event not Good, we assume a worst case upper bound, where agent is playing among all the arms in all its latephases. Agent moves into the latephase after clock epochs. We thus only need to compute the average number of epochs agent takes, before it finishes phase in the latestage. For any latephase , we know a bound on from Proposition 12. In the event , agent takes a total of epochs to move from the beginning of phase to the beginning of phase . Moreover, once in phase , agent will spread its opinion to agent in atmost average epochs. This follows since each recipient of recommendations are chosen uniformly at random independent of everything else, and thus average number of epochs required to contact agent is . Moreover, we know that within epochs, agent will communicate with agent . This conditioning only reduces the average number of epochs required from . Thus, the expected number of epochs of agent to get from the beginning of phase to the beginning of phase is atmost
From Proposition 12, we can bound the last series sum term as
To have this sum convergent is precisely why agents communicate for times in phase in our algorithm. This allows the error probability of , to decay doubly exponential, to make the above sum convergent. ∎
As a corollary of the above statement, we get the following.
Corollary 14.
Denote by to be the number of epochs of agent , before it is aware of the best arm. Then,
Proof.
Observe that the clock processes across agents are i.i.d.. The random variable is independent of the clock process . More importantly, the random variable is independent of the interepoch duration process of , and only depends on the randomness of the independent marks of . Since, for any random variable such that is independent of and , the expected number of epochs in , when epochs occurs in is , the proof follows from Proposition 13. ∎
Proposition 15.
Denote by be the random variable denoting the phase of agent , when agent receives the best arm. Then and .
Proof.
From definition of , we know from Corollary 14, that . For any deterministic , denote by to be the state of ant agent after epochs. From the description of the algorithm, we have
It is easy to verify that for , that . Thus, after a random number of epochs, we have
Inequality follows from the fact that for any nonnegative function . ∎
Proposition 16.
For all agents , we have .
Proof.
Conditional on the event Good, we know that the best arm is played by all agents in the late phase. For any agent and any phase , we can bound the error probability as
The second inequality above follows from the fact that . By setting , we get that . The result follows from a simple series bound. ∎
As a consequence of the above proposition, we obtain the following result.
Proposition 17.
For any , we have .
Proof.
We have the basic decomposition.
where in the second step we use the bound and and the result of Lemma 7 to bound . Thus it remains to compute to complete the proof.
At the beginning of any phase, is (the initial number of arms per agent) plus the sum of distinct arm ids received by agent uptill the end of phase . Conditional on the event Good, we know that agent will receive no more than arms from all other agents, when the other agents were in phase . Furthermore, conditional on the event Good, all agents will have the best arm when they move to phase . It thus remains to compute the expected number of arms received by agent , when the agent recommending the arm is in a phase larger than or equal to . From Proposition 16, we know that with probability atleast , no agent will recommend an arm different from the best arm in any latephase. This then gives by a total probability argument that
where we assume the trivial upper bound of , in the case that any agent in the late phase recommends an arm different from the best arm. ∎
Equipped with the above set of results, we are now ready to prove Theorem 1, on the regret experienced by agent .
Proof.
The regret of agent after epochs can be decomposed into three terms 

The regret of atmost , for the epochs in the early stage of agent .

The regret due to UCB algorithm in the latestage of an agent. Here the number of arms played by agent in different late stage phases is different and random.

An additional regret, if any paid until agent is aware of the best arm in the latestage.
The total regret, by linearity of expectation, is atmost the sum of the above three regret terms.
Term : All agents pay a regret no larger than in their early phase.
Term : To do so, we need some notation. Denote by a sequence , where and , for . Notice that any agent plays for durations in phase numbered . For any , denote by to be the last full phase played by agent , i.e., . It is immediate to observe that . We will thus bound the regret as the sum of regret experienced by agent in the first phases of the latestage.