Multiuser lax communications:
a multiarmed bandit approach
Abstract
Inspired by cognitive radio networks, we consider a setting where multiple users share several channels modeled as a multiuser multiarmed bandit (MAB) problem. The characteristics of each channel are unknown and are different for each user. Each user can choose between the channels, but her success depends on the particular channel chosen as well as on the selections of other users: if two users select the same channel their messages collide and none of them manages to send any data. Our setting is fully distributed, so there is no central control. As in many communication systems, the users cannot set up a direct communication protocol, so information exchange must be limited to a minimum. We develop an algorithm for learning a stable configuration for the multiuser MAB problem. We further offer both convergence guarantees and experiments inspired by real communication networks, including comparison to stateoftheart algorithms.
1 Introduction
The inspiration for this paper comes from the world of distributed multiuser communication networks, such as cognitive radio networks. These networks consist of a set of communication channels with different characteristics, and independent users whose goal is to transmit over these channels as efficiently as possible.
Modern networks, such as cognitive radio networks, must cope with several challenges. First and foremost, the networks’ distributed nature prohibits any form of central control. In addition, many users operate on an “ad hoc” basis, preventing them from forming interuser communication. In fact, they probably do not even know how many users share their network.
On top of these issues of multiuser coordination, the channel characteristics may be initially unknown, and differ between users. Thus, learning must be integrated into the solution.
1.1 Cognitive radio networks
Cognitive Radio Networks (CRNs), introduced in [1], have attracted considerable attention in recent years. The idea that lies at the heart of CRNs is that advanced sensing mechanisms and increased computation power may enable radio devices to dramatically improve their performance in terms of resource utilization, resilience and more. Networks of such users are usually dynamic and stochastic, giving rise to many interesting problems [2, 3]. We focus on developing a sensing and transmission scheme that enables users to learn a stable, orthogonal configuration without communicating directly.
1.2 Multiarmed bandits
A well known framework for learning in CRNs is the classical MultiArmed Bandit (MAB) model. MABs offer a simple, intuitive framework for learning the characteristics of a number of unknown options in an online manner, while balancing exploration and exploitation. A MAB problem consists of a single user repeatedly choosing between arms with different characteristics, that are initially unknown. After every round, the user acquires a reward that depends on the arm she chose. Her goal in most setups is to maximize the expected sum of rewards acquired over time.
As suggested in [4], the channels of a CRN are naturally cast as the arms of a bandit, with different performance measures (bandwidth, ACK signals, bit rate) serving as the reward.
Many papers propose solutions for the stochastic MAB problem (see, e.g., [5, 6, 7]) and its adversarial version (see, e.g., [8]), but they all assume a single user is sampling the arms of the bandit.
However, this assumption does not apply in multiuser networks. In the multiuser MAB model, users compete over the arms of the same bandit. As a result, they are bound to experience collisions (i.e., multiple users sampling the same arm), unless they employ some form of collision avoidance or coordination mechanism. Collisions in communication networks result in performance degradation, corresponding to reward loss in the MAB model. In order to avoid reward loss, the presence of multiple users must be addressed. We survey several approaches to this issue in Section 1.4.
1.3 Extension of the CRNMAB setting
The novelty introduced in our paper lies in the combination of bandit learning, multiple users, different reward distributions for different users and no direct communication. The combination of these last two demands  different distributions and no direct communication, poses a real challenge.
As explained in detail in Section 2.3 and in Section 2.4, the only thing we can guarantee in terms of network behavior in this setup is stability. In a dynamic, distributed network, stability is of great value. Once a network has reached a stable configuration, users can focus on utilizing its resources, rather than engaging in coordination or learning efforts; a stable network is more robust and efficient.
Reaching stability is a nontrivial task, since users must learn their channel characteristics while coordinating their actions with the other users, based on very limited observations.
1.4 Previous work
We now present several approaches to the CRNMAB problem, coming from different areas and disciplines.
Our problem may be viewed as an assignment problem, i.e., maximum weight matching in a weighted bipartite graph. Users correspond to agents, channels to tasks, and rewards are simply the complementary of the costs of graph edges. Several papers have been published on the distributed assignment problem, but to the best of our knowledge none of them offers a solution for our problem. The wellknown Hungarian method [9] requires full knowledge of the graph (i.e., channel characteristics) and assumes the existence of central control. The Bertsekas auction algorithm [10] frees us from the need for central control, at the cost of direct communication between nodes. The classical GaleShapley algorithm [11] solves the problem of finding a stable marriage configuration, but does not take the need to learn into account. Some papers have actually applied it to CRNs, but not in the learning context [12, 13]. Another work on distributed stable marriage, that makes use of a variant of the GaleShapley algorithm, is [14]. While it is quite foreign to our problem, the potential function defined in the paper is helpful in our analysis. Another noteworthy work in this context is [15]. The authors address the challenge of limiting communication between nodes to a minimum, and propose two communication models. Nevertheless, they allow more communication than we would like, and their formulation does not consider learning. Two additional results that deal with distributed stable marriage offer lower bounds and state that some form of information exchange is inevitable when solving such problems [16, 17].
The papers closest to ours in spirit are those dealing with multiuser MABs. There has been work on the case of reward distributions that do not vary between users, such as [18] and [19]. The latter introduces an algorithm that is able to cope with a variable number of users. Another paper, that addresses different reward distributions for different users, is [20]. Here, the authors employ the Bertsekas auction algorithm. This approach enables users to reach a rewardmaximizing solution, at the price of direct, frequent communication between themselves. We further elaborate on the difference between our approach and the approach of [20] in Section 6.
To this end, we would like to point out that communication between users is undesirable not only because of its price in terms of network resources and time. Once users depend on communication, they are more vulnerable to intentional attacks that may disrupt it, as well as noise bursts that are common in CRNs.
2 Model and formulation
We now describe the model, the assumptions accompanying it and our goal.
2.1 System and users
We model a communication network with channels, servicing independent users. Our work is based on the assumption that , which is reasonable since without it, implementing a time division based mechanism is necessary. Once such a mechanism is applied, the assumption that is valid again. Time is slotted and users’ clocks are synchronized, also a mild assumption for modern communication systems.
The communication network consists of channels, where only one user can transmit over a certain channel during a single time slot. Each transmission yields a reward, which we assume to be stochastic.
The users are a group of independent, selfish agents. Their observations are local, consisting only of the history of their actions and rewards. In addition, they do not know the number of users they share a network with. There is no central control managing their use of the network, and they do not have direct communication with each other.
A key characteristic of our model is that the expected reward a channel yields depends not only on the identity of the channel, but also on the identity of the user. Formally, the rewards of the channels are Bernoulli random variables with expected values , where and . This property reflects the fact that in reallife users may experience locationbased disturbances, manifested in different reward distributions for the same channel.
We model the users’ sharing resources through the representation of the communication network by a single bandit. This means that two users attempting to access the same channel at the same time, will experience a collision. In our model, the result of a collision is complete loss of communication for that time slot for the colliding users, i.e., zero reward. A user that accesses a channel alone during a certain time slot will receive a reward drawn i.i.d. from a Bernoulli distribution with expected value . Throughout the paper, we use the term configuration to refer to a mapping of users to channels.
2.2 Limited coordination
In an effort to keep our model faithful to real world CRNs, we limit the coordination between users to a minimum. Thus, users can only transmit in a channel of their choice, or sense the spectrum range and receive binary feedback regarding all channels at time . A “0” represents no transmission in channel, while “1” stands for the opposite.
2.3 Reward maximizing solution
We adopt a systemwide view for characterizing the optimal solution. The optimal configuration must be orthogonal (i.e., no more than one user per channel), in order to avoid collisions and the resulting reward loss. One common approach seeks to maximize the sum of rewards over all users, over time. The assignment of users to channels is chosen accordingly:
where is the set of all possible permutations of subsets of size chosen without replacement from the set .
However, reaching such a solution requires frequent information exchange. Assume channel is optimal for two different users and , but . To maximize the systemwide reward, user must step down and choose a different channel. The lack of central control requires explicit information exchange regarding the values of and , for and to decide which of them should step down. Since the reward estimates are updated as time goes by, such preferences must be communicated repeatedly.
Due to limited information exchange, a rewardmaximizing solution cannot be guaranteed in our setup. We therefore focus on convergence to a stable, orthogonal configuration.
2.4 Stable marriage solution
Our goal is to develop policies that will lead users to a stable configuration. We employ the notion of stable marriage to formally define stability:
Definition 1.
A Stable Marriage Configuration (SMC) is an assignment of users to channels such that no two users would be willing to swap channels, had they known the true values of the expected rewards. Formally, for a pair of users ,:
user n would like to swap  
where and are the users’ current actions. In an SMC,
2.5 Goal
3 Coordination protocol
Our coordination protocol balances the limitations of Section 2.2 with the users’ need for information exchange by introducing a signalling mechanism between pairs of users. At predefined time slots, a user wishing to occupy a channel may transmit in that channel to express her wish. In order to ensure that this signal is received by the user currently occupying the channel, we employ a framebased protocol. We assume users can transmit and sense at the same time, a reasonable requirement in modern communication systems.
The following explanation is best understood by observing Figure 1. Our protocol divides time into super frames of length . Each super frame begins with a pair of time slots, and , during which a single signalling user, the initiator, is coordinated for the entire super frame. The procedure is described in Algorithm 4 and in Figure 2. Next come miniframes of two time slots each, denoted by and . Each of these miniframes corresponds to one channel on the initiator’s list of preferred channels. Thus, a single super frame enables one user to go over her entire preference list and signal other users, suggesting they swap channels with her, as explained in Figure 3.
The time slots marked allow users not participating in the coordinating process during a certain miniframe to sample their current channel and proceed with the learningwhiletransmitting process. Thus, all but two users (initiator and responder) gather a sample during each miniframe, resulting in at least samples for each of the users, except for the initiator, over each super frame.
4 The CSMMAB algorithm
We now turn to a full description of our algorithm, the Coordinated Stable Marriage MultiArmed Bandit (CSMMAB) algorithm. We propose a userlevel algorithm for a fully distributed system, whose goal is described in Section 2.5. When all users in the network apply CSMMAB, the assignment of users to channels is guaranteed to be orthogonal, and converges to an SMC.
Our algorithm begins with a start up phase, during which users transmit and sense to detect collisions, in order to reach an initial orthogonal configuration (line 1). This phase follows the lines of the CFL algorithm introduced in [21], and converges quickly. Once an initial orthogonal configuration has been reached, users start executing the CSMMAB algorithm, described in Figure 4.
At the beginning of each super frame, users execute the rank_channels procedure to individually create a list of channels they prefer over their current action (line 4). Channels are assigned values according to their UCB indices, calculated using the well known formula from [5]:
(1) 
where is the empirical mean of the reward acquired by user on channel up till time and is the number of times she sampled arm up till time .
Next, the users coordinate an initiator according to the scheme in Figure 2. Every user who would like to improve upon her current channel presents herself as the initiator with a probability of (lines 511). An agreed initiator for the SF emerges if and only exactly one user raises her flag (the value of is chosen in order to maximize the probability of this occurring). Once a single initiator is agreed upon, all users take note of her current channel, based on their sensing. They will need this knowledge to decide whether to accept her swapping suggestion.
The initiator proceeds to signal other users, based on her ranking of channels (lines 1321). Signalling is implemented in propose_swap by transmitting in the initiator’s channel of interest. Each responder (i.e., signalled user) checks whether swapping channels with the initiator will improve her situation, based on her own ranking. Once a responder agrees, a swap takes place. No more signalling attempts are made till the end of the SF, and users simply continue sampling their chosen channels. If the responder refuses, the initiator will approach the nextbest channel on her list. She will continue the process until she (a) finds a partner that agrees to swap; or (b) exhausts her list of potential swaps. This part of the algorithm is depicted in Figure 3.
5 Analysis
We will now show that the CSMMAB meets the goals defined in Section 2.5. Our main theoretical result is stated in Theorem 1.
Theorem 1.
Consider a system with channels and users, with channel rewards characterized by the matrix . Applying CSMMAB (Algorithm 4) by all users will result in convergence to an orthogonal SMC: For all there exists such that for all time slots , the probability of the system’s being in an SMC is at least .
The proof of Theorem 1 consists of two aspects: orthogonality and stability. The first part is easy to verify.
Proposition 1.
The actions of users applying CSMMAB are orthogonal (i.e., there is at most one user sampling each channel) for all with probability of at least .
Proof.
Based on Theorem 1 of [21], the initial configuration reached after running the CFL algorithm is orthogonal with probability 1. The authors provide an upper bound on the distribution of stopping times, :
where and are some positive constants. The expected stopping time is therefore upper bounded by . Thus, setting , the probability of not having reached an orthogonal configuration by time is at most . Once the system reaches an orthogonal configuration, a user does not switch to an occupied channel without having coordinated the switch, as defined in Algorithm 4. ∎
5.1 Stability and potential
Showing that our system converges to a stable solution is more involved. We begin by defining a potential function for the problem. For any user , the potential at time is defined as follows:
(2) 
where is the action taken by user in the previous time step. In words, the potential is the number of channels user would prefer over her current choice, had she known their true reward distributions. The systemwide potential is the sum of potentials over all users:
(3) 
An illustration of the potential appears in Tables 1 and 2.
1  1  2  4 

2  2  1  1 
3  4  3  2 
4  3  4  3 
3  1  0 
In terms of potential, a configuration is an SMC if no two users can swap channels and decrease their potential by doing so. We note that a stable configuration does not necessarily correspond to zero systemwide potential, since not all users might be able to achieve zero potential simultaneously, depending on network parameters. Also, a system may have several stable configurations, each characterized by a different potential. Nevertheless, observing a system’s potential does provide an indication regarding stability: once a system reaches a stable configuration, its potential will no longer change.
We prove convergence to an SMC by using the potential function, considering three aspects:

The maximal potential of a system with channels and users is finite and equal to .

The potential is monotonously nonincreasing with high probability.

Until an SMC is reached, changes in potential are bound to happen within finite time.
We formalize and prove these statements in the sequel.
Since users’ decisions are guided by UCB indices, while stability is examined with respect to true reward distributions, users do not always update their choice of channels in a way that matches the ground truth. Thus, the system potential may occasionally increase, due to users’ exploration or inaccurate statistics. In our proof we show that despite this, users ultimately converge to a stable configuration.
5.2 Proof of Theorem 1
We begin by ensuring the monotonicity of the potential.
Lemma 1.
For all times for which , if a change in potential occurs, it is a decrease, with probability of at least .
is a distribution dependent constant. In the appendix we derive an upper bound on the minimal time for which the condition above holds:
(4) 
where . This bound will enable us to use in the proof.
Next, we introduce a lemma that concerns the ability of a single user to reach the position of the initiator.
Lemma 2.
If for some user , then her probability of becoming the next initiator is at least .
Using Lemma 2, we show another result:
Lemma 3.
If the system is not in an SMC at some time , then a change in the potential will occur within no more than time slots with probability of at least .
The exact dependency of on appears in the appendix, as do the proofs of all lemmas.
The probability of the system’s reaching an SMC within time slots after time is at least
We model the convergence to an SMC using a Markov chain. Let denote the state of the system at time :
The following holds for the chain’s transition probability:
and also
Defining completes the proof, and inverting yields
Our next result quantifies the time devoted to signalling.
Proposition 2.
In every superframe learning samples are gathered by all users combined. During this period signalling and sensing actions are performed by all users combined, so the signalling to learning ratio is
Clearly, the effort the users put into coordination is most effective when the number of users is close to the number of channels. This is a result of the frames’ length being dictated by the number of channels rather than the number of users, in order for the userlevel algorithm to be independent of the number of users.
6 Experiments
To demonstrate the merits of our algorithm, we implement a simulation of a distributed multiuser communication network. The users in our network are synchronized, and time is slotted.
In this network, users cannot communicate with each other directly. However, they can sense the entire frequency range (i.e., listen to all channels). They may also transmit over a channel of their choice, updating this choice each time slot.
A user transmitting over a channel receives a binary reward, drawn i.i.d. from a Bernoulli distribution with parameter . This can be viewed as a form of the classic binary symmetric channel. As far as the different values of the reward parameters go, we ran experiments in two different modes:

random: the ’s are drawn uniformly and independently from the interval .

realworld: users are divided into clusters, and each cluster has a preferred group of channels. This represents a scenario in which users sharing a cluster are geographically close, and experience an interference in part of the frequency range. In realworld wireless communication systems, an agent that does not belong to the network but is transmitting in its vicinity will often cause a similar phenomenon.
We present results obtained in an experiment with channels and users. The users are divided into two clusters. Users 15 belong to one cluster, and experience an interference in the frequency range of channels 712. Users 610, on the other hand, experience similar performance over the entire frequency range. Experiments last time slots, and results are averaged over 50 repetitions.
We begin by examining the cumulative number of policy changes per user over time, plotted in Figure 5 and in Figure 6. Since our goal is stability, we would like the number of policy changes to be small, and indeed the rate of changes decreases significantly over time. Another observation, demonstrated by the two figures, is that different users have different patterns, depending on the realization but more importantly on the difficulty of their problem: users that have small differences between channels will need more samples in order to tell them apart, and will therefore experience more policy changes.
Our next result examines the convergence to different SMCs over several repetitions of one setup. In this case, the set of SMCs consists of 305 configurations. Naturally, the size of this set depends on the number of users, , the number of channels, , and also on the specific realization of the ’s. Figure 7 shows that the periods of time users spend in unstable configurations decrease as the experiment advances, and users move between different SMCs, depending on the realization.
To complement our proof, we provide a visualization of the system potential over time, averaged over several repetitions, in Figure 8. As shown in the proof, the potential decays on average. The shaded area around the plot represents the variance over iterations, which also decays over time. As explained in Section 5.1, the potential does not necessarily decay to zero, but rather to a constant value that represents the potential of the SMC.
Our last result examines the reward acquired by users employing the CSMMAB algorithm. While our theoretical guarantees focus on stability, the algorithm incorporates reward maximization implicitly by using UCB indices to rank channels. However, as explained in Section 2.3, reaching a rewardoptimal configuration cannot be guaranteed with the limited form of communication we allow. In Figure 9 we compare the cumulative systemwide reward of two algorithms: our CSMMAB and the dUCB4 algorithm, introduced in [20]. As explained in Section 1.4, dUCB4 incorporates an auction algorithm in order to achieve an orthogonal reward maximizing configuration.
The price of reward maximization is, clearly, communication, which our scheme attempts to bring to a minimum. In order to implement the auction algorithm required by dUCB4, users must have distinct id’s and knowledge of the number of users. This rather technical requirement hinders the ability of the algorithm to deal with a variable number of users. Our algorithm naturally extends to a scenario in which users arrive and leave at random times, that is quite likely in the context of CRNs. In addition, auction algorithms inherently rely on the good will of users, and are therefore more vulnerable to malicious agents (e.g., agents that report false high bids for attractive channels).
The results in Figure 9 demonstrate the tradeoff between communication and reward maximization: the time dUCB4 invests in auctioning is quite dominant. The two variants of the algorithm differ in the accuracy of the auctioning algorithm. The “dUCB4” variant (dotted red) uses 32 bits to encode variables, while the “dUCB4Long” variant (dashed magenta) uses 64 bits. Because of auctioning, it takes the algorithm a long time to turn its focus to reward maximization. In the highaccuracy case, the users exhaust all their time auctioning. In the lowaccuracy case, they only begin acquiring rewards towards the end of the experiment. In realworld networks, with constantly changing conditions, such a long startup phase is difficult to overlook. For the sake of example, let us examine an average 802.11n WLAN network, with a nominal frame size of 2000 bits and typical bit rate of 25 megabits per second. The time slots it takes dUCB4 to start acquiring rewards are translated into a period of . This startup phase doubles to over one minute when 64 bit accuracy is used for the auction algorithm. Of course, lighter schemes than the 802.11 can be used, but these numbers clearly demonstrate the potentially crippling overhead brought on by communication.
We note that when is strictly less than , our algorithm often reaches the reward optimal configuration, or a configuration very similar in reward values. Therefore, the variance of the cumulative reward is very small. Our intuitive explanation is that when users have a certain degree of freedom, increasing their chances of landing in the optimal configuration.
Despite reaching a configuration that is very close to optimal in the presented simulations, our algorithm acquires reward at a slower rate than dUCB4, due to the constant ratio of coordination and exploitation. Decreasing the amount of time devoted to coordination may considerably increase the reward, at the cost of impairing the algorithm’s ability to handle a variable number of users. We plan to address this issue in detail in the future.
7 Discussion
We present an extension of the multiuser MAB problem, for the case of different reward distributions between the users, with limited information exchange. Using a specialized signalling method, our algorithm enables multiple users to learn network characteristics and converge to an orthogonal configuration that is also a stable marriage. We provide a theoretical analysis of our algorithm’s performance, based on the notion of system potential. Finally, we present the results of an experimental setup and examine different aspects of our approach’s performance, including a comparison to the dUCB4 algorithm of [20]. As explained in Section 6 in further detail, the main difference between the algorithms is the way they strike a balance between minimizing communication and maximizing the reward. We argue that our algorithm is better suited for real world problems.
In the future we intend to extend our work to a dynamic scenario, both in terms of channel characteristics and number of users. The latter should be straightforward due to the minimal interdependency of users, while the former will require some adjustment of the learning algorithm. Another interesting variant, applicable to networks with a fixed number of users, alters the ratio between coordination and exploitation as time goes by, to enable better use of network resources.
References
 [1] J. Mitola and G. Maguire, “Cognitive radio: making software radios more personal,” Personal Communications, IEEE, 1999.
 [2] I. Akyildiz, L. WonYeol, M. Vuran, and S. Mohanty, “A survey on spectrum management in cognitive radio networks,” Communications Magazine, IEEE, vol. 46, no. 4, pp. 40–48, April 2008.
 [3] S. Haykin, “Cognitive radio: brainempowered wireless communications,” Selected Areas in Communications, IEEE Journal on, vol. 23, no. 2, pp. 201 – 220, February 2005.
 [4] W. Jouini, D. Ernst, C. Moy, and J. Palicot, “Multiarmed bandit based policies for cognitive radio’s decision making issues,” in Signals, Circuits and Systems (SCS), 2009 3rd International Conference on. IEEE, 2010, pp. 1–6.
 [5] P. Auer, N. CesaBianchi, and P. Fischer, “Finitetime analysis of the multiarmed bandit problem,” Machine learning, vol. 47, no. 2, 2002.
 [6] P. Auer and R. Ortner, “UCB revisited: Improved regret bounds for the stochastic multiarmed bandit problem,” Periodica Mathematica Hungarica, vol. 61, no. 1, pp. 55–65, 2010.
 [7] A. Garivier and O. Cappé, “The KLUCB algorithm for bounded stochastic bandits and beyond,” in Conference On Learning Theory, 2011, pp. 359–376.
 [8] P. Auer, N. CesaBianchi, Y. Freund, and R. Schapire, “The nonstochastic multiarmed bandit problem,” SIAM Journal on Computing, 2002.
 [9] H. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 12, pp. 83–97, 1955.
 [10] D. Bertsekas, “The auction algorithm: A distributed relaxation method for the assignment problem,” Annals of operations research, 1988.
 [11] D. Gale and L. Shapley, “College admissions and the stability of marriage,” American mathematical monthly, pp. 9–15, 1962.
 [12] K. Cohen, A. Leshem, and E. Zehavi, “Game theoretic aspects of the multichannel aloha protocol in cognitive radio networks,” Selected Areas in Communications, IEEE Journal on, vol. 31, 2013.
 [13] A. Leshem, E. Zehavi, and Y. Yaffe, “Multichannel opportunistic carrier sensing for stable channel access control in cognitive radio systems,” Selected Areas in Communications, IEEE Journal on, vol. 30, 2012.
 [14] P. Floréen, P. Kaski, V. Polishchuk, and J. Suomela, “Almost stable matchings by truncating the gale–shapley algorithm,” Algorithmica, vol. 58, no. 1, pp. 102–118, 2010.
 [15] N. Amira, R. Giladi, and Z. Lotker, “Distributed weighted stable marriage problem,” in Structural Information and Communication Complexity. Springer, 2010, pp. 29–40.
 [16] Y. Gonczarowski and N. Nisan, “A stable marriage requires communication,” arXiv preprint arXiv:1405.7709, 2014.
 [17] A. Kipnis and B. PattShamir, “A note on distributed stable matching,” in IEEE International Conference on Distributed Computing Systems, 2009.
 [18] A. Anandkumar, N. Michael, A. Tang, and A. Swami, “Distributed algorithms for learning and cognitive medium access with logarithmic regret,” Selected Areas in Communications, IEEE Journal on, vol. 29, no. 4, pp. 731–745, 2011.
 [19] O. Avner and S. Mannor, “Concurrent bandits and cognitive radio networks,” in European Conference on Machine Learning, 2014.
 [20] D. Kalathil, N. Nayyar, and R. Jain, “Decentralized learning for multiplayer multiarmed bandits,” IEEE Transactions on Information Theory, vol. 60, no. 4, pp. 2331–2345, April 2014.
 [21] D. Leith, P. Clifford, V. Badarla, and D. Malone, “WLAN channel selection without communication,” Computer Networks, 2012.
Appendix A Proof of Lemma 1
We would like to show that for all values of for which , the probability that the potential decreases every time it changes is at least , where .
Given that a change in potential occurs at time , it is guaranteed to result in a potential decrease if it benefits both users. This will happen if both users’ indices, that guide their decisions, are accurate w.r.t the true distribution.
Since we condition on a change in potential,
Let us upper bound . For a user switching from arm to arm at time , ,
where is user ’s UCB index of arm at time , defined in (1). Following the proof of Theorem 1 of [5],
provided that
(5) 
where is the number of times user sampled arm up till time and . If (5) does not hold, then the UCB index “misleads” user , causing her to mistakenly favor arm , despite its lower expected reward. Switching from arm to arm will result in an increase in potential. However, once she acquires another sample of arm , its index will decrease. In the meantime, the index of arm will increase due to the passing time, and the indices will ultimately reflect the correct preference, resulting in a potential decrease.
The extreme value for (5), i.e., the largest number of required samples, corresponds to the minimal value of . Let us define:
Thus, when all arms have been sampled at least
(6) 
times, the probability of an increase in potential is very small.
In order to allow for the coordination protocol, users do not gather informative samples in every time slot. Instead, they gather at least samples in each super frame, whose length is .
Therefore, taking into account the fact that the sampling condition in (6) must apply for all arms, the condition on is
(7) 
For all times for which (7) holds, if a change in potential occurs, it is a decrease, with probability of at least .
When we apply this lemma we will use a quantity , an upper bound on the minimal for which (7) holds. Introducing a wellknown lower bound on the logarithmic function:
We use this lower bound together with (7):
Denoting , we continue:
Our conclusion is that . Since this expression is finite, we may now use it in our proof.
Appendix B Proof of Lemma 2
The probability of a specific user becoming the initiator when there are interested users is
The probability is minimized when all users would like to become the initiator, yielding the bound .
Appendix C Proof of Lemma 3
If the system has not reached an SMC, then according to Definition 1, the conditions , hold for at least one pair of users .
According to the definition of the CSMMAB algorithm, if holds, then user will add the channel user is sampling to her list of preferred channels with a probability of at least . Following arguments similar to those presented in the proof of Lemma 1, . If holds, user will accept user ’s swap proposal, assuming her statistics are accurate. This, once again, happens with a probability of at least . Once users and swap channels, the potential will change.
In the worst case (i.e., largest ), user ’s channel will be the last channel on user ’s list, and all users higher on the list will decline user ’s swap proposals. If user approaches a different user (whose channel is ranked higher than ’s), and that user agrees to swap, the potential will also change.
What is left to prove is that the time it shall take user to receive the privilege of being initiator is finite. Once is appointed the initiator, it will take no more than miniframes, i.e., time slots, until she approaches user and a swap takes place.
There are two different cases  if are the the only unstable pair, then they will be the only ones interested in becoming the initiators. Furthermore, if only one of them is dissatisfied, then there will only be one user interested in initiating. In the notation of Lemma 2, this corresponds to or , respectively. The probability of exactly one of them becoming the initiator is .
If there are additional unstable pairs, there will be more nominees for initiating. However, not all super frames necessarily result in a decrease in potential  if the initiator only targets channels occupied by “satisfied” users, all her attempts will be rejected. Therefore, we need to address the worst case scenario, in which all users attempt to initiate, but only one of them is in a position that will actually result in a swap. Based on Lemma 2, the probability of that user emerging as the single initiator is at least , for a single super frame. This probability is smaller than for all , and is therefore the lower bound for the probability of a single initiator with actual capacity for a decrease in potential.
The number of SFs in a time interval of length is . The probability that a single initiator with actual capacity for a decrease in potential does not emerge in a certain SF is less than , and the probability that a single initiator does not emerge in the interval is less than . As , so does , and decays to zero.
Binding the two aspects of this lemma together, we have that the probability of a single initiator with actual capacity for coordinating a switch emerging in an interval of length is at least . The probability of a swap between users whose actions do not correspond to a stable configuration is at least . The combined result: if the system is not in an SMC at time , then a change in the potential will occur within no more than time slots with probability of at least , where .
Let us rewrite the result for the sake of clarity: if the system is not in an SMC at time , then a change in the potential will occur within no more than time slots with probability of at least . Developing the previous expression for the probability of a change in potential:
From now on, we denote . Using this, we can derive an expression for :