A Nonparametric Multistage Learning Framework for Cognitive Spectrum Access in IoT Networks
Abstract
Given the increasing number of devices that is going to get connected to wireless networks with the advent of Internet of Things, spectrum scarcity will present a major challenge. Application of opportunistic spectrum access mechanisms to IoT networks will become increasingly important to solve this. In this paper, we present a cognitive radio network architecture which uses multistage online learning techniques for spectrum assignment to devices, with the aim of improving the throughput and energy efficiency of the IoT devices. In the first stage, we use an AI technique to learn the quality of a userchannel pairing. The next stage utilizes a nonparametric Bayesian learning algorithm to estimate the Primary User OFF time in each channel. The third stage augments the Bayesian learner with implicit exploration to accelerate the learning procedure. The proposed method leads to significant improvement in throughput and energy efficiency of the IoT devices while keeping the interference to the primary users minimal. We provide comprehensive empirical validation of the method with other learning based approaches.
I Introduction
With the rise of Internet of Things (IoT), more and more devices are going to get connected to the network and most of them are going to rely on wireless solutions to enable connectivity [1]. With large number of devices sharing the same physical location trying to access the network over wireless channels, we need intelligent ways of reusing the available spectrum resources to cater to their needs. Cognitive Radio (CR) is now viewed as a potential solution to the problem of increasing spectrum scarcity [2, 3, 4]. By enabling the coexistence of licensed and unlicensed users in a spectrum band, CR aims to improve the overall spectrum utilization in a wireless environment where spectrum resources are scarce [5, 6, 7, 8]. The unlicensed users, commonly referred as Secondary Users (SUs), leverage holes available in the licensed spectrum, which are the result of spectrum underutilization by Primary Users (PUs), to transmit their data. Since PUs have exclusive right to access the allocated spectrum band, SUs are required to maintain a low interference profile with these PUs during opportunistic spectrum access. This requires the SUs to sense the channel for presence of PU traffic whenever it wants to transmit. Each sensing operation comes with an associated cost of both energy and time spent on sensing the channels. In an IoT ecosystem, most of the devices are going to be either battery powered or rely on energy harvesting for power requirements. In such an energy budgeted scenario, there is need for smart spectrum sensing algorithms which can reduce the time spent by an IoT node on sensing the channels and thereby increase the throughput and energy efficiency [9]. Recently, there has been an increasing interest in utilizing CRN concepts for IoT systems. Authors of [10] consider the problem of reducing the overhead of spectrum sensing and derive optimal set of parameters for maximizing throughput in an IoT scenario. The work in [11] proposes a two step cooperative spectrum sensing method which increases the global accuracy of sensing and improves the energy efficiency of the SUs.
In this paper, we focus on an IoT network architecture which includes a central node and a number of IoT devices and we assume a Cognitive Radio Network (CRN) for the system. The IoT devices are assumed to be the SUs in the system and rely on opportunistic spectrum access for data transmission.
Multiple approaches have been proposed for reducing the time spent by SUs on sensing the channels. Two popular approaches available in the literature are to optimize (a) Channel Selection: rank the channels in an order such that the probability of finding a free channel with reduced number of sensing is high [12, 13, 14] and (b) Optimize intersensing interval: calculate intersensing interval for each of the channels based on the available PU traffic statistics and sense at these intervals instead of sensing the channel at the start of every transmission [15, 16, 17].
The channel selection problem in CRNs has been widely studied by formalizing it as a Reinforcement Learning (RL) problem. This includes posing it as a MultiArmed Bandit (MAB) problem [18, 19, 20, 21], applying QLearning [22, 23] etc. Another popular approach followed for channel selection in CRN is combinatorial bandits [24, 25] where each combination of channel allocation is seen as an action. In [26], a comparison study of different MAB algorithms is presented in the context of spectrum access in IoT networks. Empirical results show that application of MAB algorithms to IoT networks is able to improve the successful transmission probabilities even in the case of dynamically changing channel conditions. However all these works assume that the channel has to be sensed every frame before data transmission. These methods do not leverage the fact that there are multiple SUs and the system can learn about the PU traffic by combining the sensing information from all the SUs and exploit the learned information to optimize the intersensing interval across on each channel.
Another approach to optimize the spectrum allocation problem is by the application of traditional Artificial Intelligence(AI) techniques. Evolutionary algorithms [27] like Genetic Algorithms(GA) [28], Particle Swarm Optimization(PSO) [29], Gravitational Search(GS) [30] etc., have been shown to provide promising solutions to the problem. These algorithms are required to calculate the quality of a resultant channel assignment configuration (fitness) from the observations and the assumption is that the data for calculating the value of fitness is available to the algorithm. However, in this problem, we are given neither the PU traffic characteristics nor the SNR values at SUs. This make it difficult to directly apply evolutionary algorithms to our setting. But, as we show later in the paper, we could use the concepts these techniques to design algorithms such that the estimation of data for fitness calculation is run simultaneously with the evolutionary algorithms to find improved spectrum allocation strategies. However, the AI technique by itself does not optimize the intersensing interval.
An approach to reduce the number of sensing required by the SUs and improve the system throughput is to try to optimize the intersensing interval by estimating the idle period and skip the sensing phase accordingly. The work in [15] proposes a framework for calculating the optimal frame duration for SUs to maximize the throughput while keeping the collision probability to PUs within a limit for an exponential traffic model. Later [17] showed that PU traffic patterns can be best approximated with heavy tailed distributions and provided an optimal intersensing interval policy for HED traffic model. However, both these works were limited to developing a policy optimized for intersensing interval and were not dealing with the channel ordering for sensing and were dependent on apriori information of channel parameters. The requirement of PU traffic parameters for the optimally predicting intersensing interval severely limits the application of these algorithms to an IoT network.
Until the recent work in [31], the idea of jointly optimizing both intersensing interval and channel selection without assuming any apriori knowledge of the PU channel traffic was not exploited^{1}^{1}1We use the term residual OFF time to denote the time period for which the PU channel stays idle once an SU senses it to be free.. In [31], a twostage reinforcement learning method which combines the residual OFF time estimation and channel ordering without the knowledge of channel parameters is proposed in a single SU scenario. By using a parametric Bayesian learning method to estimate the residual OFF time, they were able to learn an intersensing interval policy and combine it with a channel ordering policy based on MAB concepts. However, applying this to an IoT network presents a few challenges. The channel ordering method in [31] cannot be trivially extended to multiuser scenario. It also assumes that the SU always has data to transmit and in an IoT network, this assumption does not hold true. Further, a classical parametric Bayesian approach is employed for learning the primary traffic; this limits its extension to new unseen traffic models. It also limits the performance of the method when actual traffic model differs substantially from the assumed model.
In this paper, we introduce a multistage nonparametric learning based approach for opportunistic spectrum access of IoT devices. It works by combining AI and RL techniques for channel selection and nonparametric Bayesian method for estimating the residual OFF time PUs in a multiuser cognitive radio environment when PU traffic information is not available. We propose a centralized solution where a central hub is responsible for resource allocation for the devices in the network. We list the major contributions of this paper below:

In the first stage, by leveraging the information that the central node can obtain from all SU devices in the network, we propose a RL/AI based algorithm to efficiently estimate the quality of the channels for each user and predict which channels will be idle with high probability.

At the next stage. to efficiently estimate the residual OFF time distribution of PUs by combining observations from multiple devices in the network, we introduce a nonparametric Bayesian online learning algorithm. The learned nonparametric model is used to predict how long a channel will stay idle once it is sensed to be free. This part helps the devices to skip the channel sensing part for multiple frames.

In the third stage, we augment the output from the nonparametric Bayesian learner for residual OFF time prediction with an exploration factor and present a way to implicitly incorporate exploration into the learning agent. Based on stochastic approximation paradigm, we introduce a method to adaptively vary the exploration factor such that the observed PU collision remains below the allowed threshold for collisions. Typically, the use of nonparametric distribution estimation techniques is limited since they require more number of samples. Our method of exploration mitigates this limitation by exploiting the structure of the problem and hence can work well even with limited number of samples.

We performed extensive empirical validation of proposed method and the results are provided for different PU and SU traffic scenarios.
Ii System Model
We consider an IoT network where denotes the set of PUs and denotes the set of SUs, with and . Each PU has its own licensed channel; there are channels available for IoT devices in the network for opportunistic access. At any time, we have two sets of primary users, denoting the set of active PUs and denoting the set of idle PUs with and . The state transition diagram of PU is given in Figure 1.
Once the PU has data to transmit, it moves from idle to active state and directly accesses the channel without sensing for any ongoing traffic since it has the exclusive right over the use of the channel. However, some cognitive IoT device may be using the channel at that point of time, which can result in collision. Since PU is the licensed user for the channel, the PU retransmits immediately after collision. Hence PU stays in the active state until it successfully sends its data. Upon successful transmission, PU goes back to idle state, where it waits until new data needs to be transmitted.
In the network, there is a central node which takes care of channel assignment for IoT devices which are the SUs. The centralized node communicates with all the IoT devices in the network and assigns channels to devices. In the case of IoT devices, we have four disjoint set of users, idle IoT devices denoted by , IoT devices waiting for channel access denoted by , devices in channel sensing phase denoted by and IoT devices which are active (transmitting data) denoted by . At any point of time, for collisionfree transmission we need . At every time instant , checks for IoT devices in the wait state (). If any device is in the wait state, assigns one of the channels, , to that device to sense. The device senses the channel and reports the observation back to . If the channel is not free, either because PU is using it or another IoT device is using it, the IoT device will move back to wait state, and wait until it is given another channel to sense. It can also encounter a collision from PU during the transmission phase. If this happens, the IoT device moves to wait state and again the sense cycle starts. If the channel is sensed to be free by , it can access the channel and try to send data through it and receive a throughput of on successful transmission. Upon successful transmission, the SU moves to idle state and stays until new data is generated. In case a transmission is unsuccessful, the device goes back to wait state with zero throughput and the central node considers it at the next channel allocation cycle. The state transition diagram of an IoT device is given in Figure 2.
For primary user traffic we consider two continuous time traffic models based on the recent empirical studies [31]: Generalized Pareto Distributed (GPD) model and Hyper Exponential Distributed (HED) model.

Generalized Pareto Model: Both the ON time and OFF time of PU is distributed as Generalized Pareto distribution. The probability density function is given by
(1) where and . Here and are shape, scale and location parameters respectively. Different traffic characteristics are captured by varying the value of parameters. For example, the percentage occupancy in a band by PU can be modelled by varying the location parameter of the ON and OFF distributions.

Hyper Exponential Model: HED traffic model is based on the observation that PUs will have long OFF periods with short ON periods. To capture this behaviour, HED model uses Exponential distribution to model ON time and HED distribution to model OFF times. Thus the ON time distribution of HED model with mean ON time as is given by
(2) and OFF time distribution with mean OFF period is given by
(3)
In the simulation, we chose the parameters of the models to closely match with the empirical observations which reflect real life PU traffic use cases. It should also be noted that Exponential traffic can be generated as a special case of HED by changing the OFF time distribution to have only one component with .
For modelling IoT device traffic, we use multiple models. Incorporating observations from machine type communications (MTC) and analyzing the traffic patterns of majority of applications, [32] classifies IoT traffic into three elementary classes:

Periodic Update (PU): When the IoT device sends data at regular intervals of time, the traffic generated can be seen as Periodic Update. This type of traffic is nonreal type and is usually of fixed data size. An example will be the temperature sensor from a machine shop floor which sends temperature updated to central server at regular intervals.

Event Driven (ED): When an IoT node needs to transmit data in response to the event it sensed, the traffic generated is classified as EventDriven. This type of traffic is irregular and usually realtime servicing. An example is the firealarm sensor in the machine shop floor responding to the fire in one of the local stations.

Payload Exchange (PE): This traffic type comprises of all the high volume transmissions from the IoT node to the server. This could be the response to an independent request or a follow up of one of the above mentioned traffic events. This can also include data streaming events.
For the purpose of our algorithm validation, we use first two traffic models for IoT devices in conjunction with the traffic models discussed for primary user traffic.
Iii Proposed Approach
The proposed approach comprises of algorithms for (a) channel order selection for sensing and (b) residual OFF time prediction for each channel. In this section, we first present the general framework for interaction between the central hub and IoT devices and then provide the proposed algorithms.
Iiia Sensing and transmitting at IoT device
Reiterating, reducing the number of sensing required by the IoT device will improve both throughput and energy efficiency. From [31], we make the observation that if we can predict the time for which a channel is likely to stay free, the device can skip sensing the channel for multiple frames/packets. In a single SU scenario, [31] proposes a parametric Bayesian method to predict how long a PU channel remains idle once it is sensed to be free and uses it to skip sensing over an appropriate number of frames. It is also assumed that SU device will always have data to transmit. However, the work in [31] cannot be trivially extended and applied to the IoT setting. Typically, in an IoT network, the number of devices is large and the SU traffic is not always ON. Hence the number of sense/send actions taken by a single SU will be small which inturn will reduce the number of samples the SU sees and learns from. This will present a problem to the SU learner as it will require long time periods to accumulate enough channel samples to learn a model with high accuracy. In order to circumvent this problem, we exploit the fact that though each SU may see a channel only for a short period, there are usually many SUs in an IoT network and the total number of times the channel is seen is large enough to build/estimate the traffic distribution on that channel.
Motivated by the fact that central node , which has access to observation from all the IoT devices, can learn about the traffic characteristics faster than individual nodes, we propose to move the learning algorithm to the central node and make the IoT device a passive node which responds to the commands from central node . This architecture also brings in the additional advantage that the IoT node does not have to be of significant compute capability, as the learning and channel allocation takes place in the central node.
The algorithm that runs at each IoT device is given in Algorithm 1. Whenever the device needs to send data, it will move to where it will wait for the central node to assign a channel for the device to sense. It will sense the channel and update the central node with the sensed traffic occupancy. If the channel is found to be idle and the predicted residual OFF time () is given by , the IoT device can occupy that band and start transmission. The transmission ends either when the payload is over, or the predicted residual OFF time is over or a collision occurs. Upon successful transmission, the IoT device will update the obtained throughput to the central node. Otherwise, it will update the central node with a transmission failure and go back to the wait state. The IoT device also communicates the number of frames sent successfully to the central node.
IiiB Online learning and Resource Allocation at Central Node
In the proposed method, the central node assigns one channel at a time to each IoT device for sensing ^{2}^{2}2We can modify the method to accommodate multiple channel sensing by each of the devices if required.; thereby reducing the energy spent on sensing all the available channels. This approach also has an added advantage that the IoT device can immediately start sending data after finding a free channel and obtain a better throughput/latency. Since the central node is aware of the actions taken by each of the SUs, this will also mitigate interSU collisions. We need the central hub to learn about the channel characteristics fast and be able to pair an IoT device to a channel where it sees better throughput characteristics and also to predict how long the device can transmit on the channel without sensing the channel again. The main algorithm to run on the central hub in Algorithm 2.
We depend on five functions in the main algorithm for assigning channels and predicting residual OFF times. We first provide a brief description of each below.

GetChannel(): This function is responsible for assigning a channel to each of the devices in wait state, .

UpdateChannel(): This function is the interface for devices to update the observations to the central node. When each of the devices returns an observation to the central node, this function will update the observation to corresponding channeldevice quality matrix^{3}^{3}3This metric maintains a relative score of how suitable each channel is for each device. (denoted by ) maintained at the central hub.

PredictResidualOffTime(): This function is responsible for predicting the residual OFF time of each of the channel, once it is sensed to be free.

UpdateResidualTimePredictor(): Observation from the IoT devices that how long the device was able to use the channel before a collision happened is used by the residual time predictor to build the residual OFF time distribution and predict the number of frames for which one can skip sensing.

UpdateExplorationFactor(): For estimating the residual time for each PU, we build a discrete distribution of quantized residual OFF time based on observed OFF times using a nonparametric Bayesian technique. This function is used to update the exploration scheme to be used by the central node.
If GetChannel() can assign a channel which is good (in terms of both occupancy and capacity) for an IoT device, the device will not have to sense multiple channels before finding a free channel. Further, if PredictResidualTime() is able to predict the residual time with good accuracy, the IoT device can skip sensing the channel in every frame and at the same time not increase the interference to the PU when compared to a method which sense the channel in every frame. The pictorial representation in Figure 3 depicts the interactions between the central hub and the IoT nodes in the CRN. The interactions happen in the numbered order given in the figure and the arrowheads show the direction of information flow. The variable listed alongside each arrow refers to the input/output from each module or the action. We now proceed with the details of our proposed approach in the succeeding subsections.
IiiC Channel selection using Learning
In order for the central node to assign channels for each requesting device, it requires to know the quality of a channel with respect to an IoT device. This will be a function of (a) what capacity the channel can offer the device and (b) the PU traffic characteristic on the channel. However this information is unavailable to the central node at the start of the algorithm and needs to be learned.
In the case of single SU, channel selection using MAB is quite popular[18, 21]. However, we deal with multiple SUs that demand for a channel at the same instant. This problem reduces to assigning the best userchannel permutation in case we know the value of each userchannel pairing. However, we do not have access to that value and hence learn that from data. A similar problem is dealt with in case of [25] using combinatorial bandits; however, their solution is restricted to the case where the number of channels is greater than the number of users, both of which do not change with time. In our formulation, the number of active users and the number of available channels change with time. Hence, we need to search over all possible permutations to arrive at a channel assignment. This is very computationally demanding. For example, then we have 5 free channels and 20 SUs requesting for channels, the search space is . Therefore we propose to use an AI technique called hill climbing which has substantially low complexity.
We employ a learning technique which combines the ideas of AI method, hill climbing [33], and reinforcement learning technique called greedy [34]. The algorithm proceeds by estimating a value table for each of the channeldevice pairs. The central node maintains the value table for each channel and each device . We represent each entry of this table by . Whenever a feedback on throughput, , is available from the device, the corresponding entry in the value table is updated according to the update equation
(4) 
Here is a problem dependent parameter, also known as learning rate. When we set , the central hub gives importance to only last observation and completely discards any of the past learning. Conversely, if is very close to , the central hub will take long time to build up the value table as it give very less weight to new observations.
With the value table being a proxy for the quality of each channel for each device, we can calculate the quality of each channel assignment configuration based on the individual entries in the table. Let denote a channel allocation configuration. Then the quality for the configuration can be calculated as the sum of the individual quality values from the value table. Then the hill climbing proceeds by randomly swapping some entries in the assignment and recalculating the quality of resultant configuration. If the new configuration is having a better quality value that last configuration, we can discard the last configuration and use the new configuration to proceed. This process can be continued until there are no new swaps possible which will improve the quality value of the channel assignment configuration.
Even though the above mentioned method will search and find a high value channel assignment configuration with less complexity, the value table maintained at the central node needs to be estimated correctly for hill climbing to work. However, we don’t assume the availability of this knowledge at start and needs an exploration strategy to build the value table similar to what MultiArmed Bandits also require. Hence we employ an greedy strategy to randomly explore different configurations. By trying random configurations fraction of the time, the central node can improve the accuracy of the value table over time. This, in turn, makes the results of hill climbing better. The methods for channel assignment in central node is provided in Algorithm 3. Here, the method GetChannel takes as input the set of SUs waiting for channel allocation and outputs a channel allocation configuration for them. The method UpdateChannel takes the channel at which the device has achieved a throughput and updates the value table.
IiiD Residual OFF Time prediction using Online Nonparametric Bayesian Learning
To accurately predict the residual OFF time of each channel, the central hub requires the traffic characteristics of each of the primary users which we propose to learn online. To the best of our knowledge, there is no available literature which thoroughly evaluates the PU network traffic characteristics seen in an IoT system. Faced by the challenge to design an algorithm which has to work on an yet unseen system model, we are base our algorithm design on the popular non parametric Bayesian estimation paradigm.
One of the main changes of this work when compared to [31] is that, here we exploit the fact that one is only interested in the quantized values of the time periods that the SUs can skip and not actually in the continuous distribution of the residual OFF time. Since the SUs only transmit in intervals of their frame size, even if we have a continuous distribution estimator, we will have to quantize the predicted values to work with the SUs frame period. Hence, we can map the problem of estimating the residual OFF time to estimating a discrete distribution. In this discrete distribution, each point corresponds to the number of frame periods a SU can skip. However, the total number of points in this distribution is unknown to the central hub and will depend on the PU traffic^{4}^{4}4The total number of points depend on the maximum PU OFF time which we do not know.. In an unstructured learning environment, the problem would be of building a discrete distribution where the number of discrete values in the support is unknown and this general problem is fairly difficult to handle[35, 36]. However, for us, the problem requires only the largest residual OFF time only to define the support for the Dirichlet prior. ^{5}^{5}5Because of the structure of our problem, occurrence of any previously unseen residual OFF time also means that any residual OFF time lower than the observed value are also possible in the network.
In our problem, we can resort to assuming a very high number as the maximum possible OFF time, for example frame periods. It should also be noted that because of the structure of the problem, if we observe quantized residual OFF time of , then quantized values less than are also possible candidates for residual OFF time. Let denote the class of highest possible quantized residual OFF time^{6}^{6}6For all practical senarios, this number can be fixed as a high value depending on the problem.. With the problem of estimating the residual OFF time reduced to estimating a discrete distribution with a known support , we can now use a nonparametric Bayesian method to estimate the underlying distribution using Dirichlet distribution as prior since it is the conjugate prior for categorical distributions^{7}^{7}7The Discrete distribution that is being built for PU residual OFF time is a categorical distribution with each OFF period as a class..
The Dirichlet distribution, , is parameterized by positive scalars for , with . The support of is a dimensional simplex . The probability density function of when is given by
Let the categorical distribution of residual OFF times be denoted by and denote the observed sample of residual OFF time. After observing the samples of residual OFF time, the posterior of can be calculated as
Simplifying, we get the parameter for posterior update as
where is the indicator function. The update can also be done in an online setting by updating one sample observation at a time. Note that this sample () is either the time takes to sent the SU payload successfully, or the time it was able to transmit until is see a collision or the time duration predicted by to be skipped. Further, if the same channel is selected again by the same device or by another device within a specific time period^{8}^{8}8We used a period of two frames in our simulations, we consider this as a sample of single residual OFF time and update the parameter corresponding to the sum of the residual OFF times observed in both the samples together. We denote this time period as hold time. Therefore, if an update to a channel comes within the hold time after last update, then both the samples will be combined into one sample and the prior corresponding to the sum will get updated. This will help in updating samples corresponding to the long residual OFF times which may be spread across multiple transmissions of the devices. Hence, the categorical distribution of residual OFF times which is of our interest has the posterior distribution as the Dirichlet distribution with updated parameters. Further, we augment it with an additional exploration probability to derive the final predictor for residual OFF time. The functions for residual time predictor is given in Algorithm 4. Here the method PredictResidualTime takes in a channel as input and returns the predicted residual OFF time () for the corresponding PU channel. The method UpdateResidualTimePredictor updates the parameters of the nonparametric model with the observed value . denotes the standard impulse function which puts a mass of at location . For a detailed explanation of exploration strategies, please see subsections IIIE and IIIF.
Note that if one wants to use a continuous time distribution estimator, the Dirichlet process with appropriate smoothing to estimate the distribution from the observation can be applied and one can then obtain a nonparametric estimate of the continuous value of PU residual OFF time. However, for the continuous case, one requires Markov Chain Monte Carlo (MCMC) methods which are quite computationally demanding. Since our problem requires only quantized residual OFF time estimates, we can avoid the complex MCMC methods and use the simple DirichletCategorical conjugate prior relationship to build the predictor.
In our setting, the central node is responsible for predicting the residual OFF time for each of the channels. For predicting the residual OFF time of channel , we first sample a categorical distribution with parameter p from the maintained Dirichlet prior and then sample a point , , from p, which corresponds to the discrete quantized time to skip. This is sent to the SUs to indicate the number of frames it can send without sensing. By sampling from the prior distribution and then sampling from the categorical p, one ensures that with nonzero probability the central node will try to explore various skip periods. Since the central node is building the distribution by also taking actions based on the past observed values of residual OFF time, it needs to try transmit for longer times than what it has already observed to build the tail part of posterior distributions. Rather than using only the Bayesian sampling technique to explore various residual OFF periods, we can make the central node explicitly try high values of skip periods to build the tail of the distribution. We can sample according to the maintained prior for fraction of the time and for fraction of the time, we can try to transmit SU data for really high values of skip period to explore longer OFF periods. Since we assume a high value for the support of the categorical distribution, , we can modify the sampled categorical distribution p itself to achieve this. By scaling p with and adding a mass of at , the exploration is made implicit to the residual time predictor. We denote this augmented distribution by . Hence by using a nonparametric Bayesian method to estimate the residual OFF time and augmenting it by appropriately scaling and adding a mass to tail, we arrive at a simple algorithm with implicit exploration for accelerated learning. Using instead of p will cause higher collision. However, in a CRN, SUs are allowed to collide with the PU traffic as long as the fraction of such collisions are maintained under a prespecified threshold. Hence, the exploration factor can be selected such that the experienced collisions is within the allowed threshold.
IiiE Various approaches for setting exploration factor
A main research problem in reinforcement learning is addressing how to control the exploratory behaviour of the agent without losing the ability to learn. In the main algorithm, this part is handled by the UpdateExplorationFactor() method. Below, we discuss three different ways of controlling the exploratory behaviour of the learning agent.
Even though it may appear naive, one of the most popular method is to keep the parameter a constant throughout the time so that the learning agent will always explore with a constant probability. With a usergiven value , the strategy for UpdateExplorationFactor() can be
(5) 
One of the main disadvantage of constant exploration is that the cumulative penalty associated with exploratory actions will increase linearly over time; an undesired characteristic for any learning algorithm. If we can appropriately decay the exploration factor over time, then we can counter this linearly increasing cumulative regret. Exponentially decaying the exploration factor with time is also a popular approach [34]. With a user provided value for , the strategy for UpdateExplorationFactor() can be
(6) 
A high value of can lead to suboptimal exploration whereas a low value can lead to very slow learning process. The optimal value of decaying parameter is problem dependent. This brings up the question that can we adaptively calculate the exploration factor based on the observed PU traffic behaviour? Below we provide an affirmative answer to this question by drawing insights from the recent developments in stochastic optimization methods.
IiiF Adapting the exploration factor
In a CRN, when the PUs allow SUs to opportunistically access the spectrum, there is a need to introduce a threshold for collision. Let denote maximum collisions SUs are collectively allowed on any given channel. On a heavy PU traffic scenario, the SUs will have to behave conservatively (sense more often) to maintain the collisions below this threshold. However, in medium and low traffic scenarios, the SUs can forgo sensing every frame and exploit the allowed collision threshold to achieve better performance. Since we do not assume any knowledge of the PU traffic characteristics, the exploration factor needs to be learned from the observed data itself. Note that different channels may encounter different percentages of collision; therefore, we learn vary the exploration factor individually for each channel.
The collision seen by a PU in our model has two sources: (a) the traditional SUPU collision which can happen even if SU senses every frame (this is caused when the PU starts transmitting after the SU’s sensing period or if the energy detector makes an error) and (b) the collision because the SU skipped sensing the channel. The first contributor depends on factors like the burstiness of the PU traffic and the probability of missed detection of the energy detector whereas the second is directly related to the nonparametric Bayesian estimator and the exploration factor . We are interested in the effect of varying as it is the parameter under the control of the algorithm. We assume that the other factors remain constant while we vary ; therefore, we can infer that variation in collision is a function of the current exploration value . Let denote the number of observed collisions; it is a function of the exploration factor and other above mentioned factors which are denoted by vector . Let denote a loss function we like to optimize to achieve a as close to as possible. As is the only variable parameter, we consider the loss function as a function of alone. Since we have noisy observations about and an online learning setting, we could use Stochastic Gradient Descent (SGD) to optimize our objective. This require us to calculate the gradient of loss function w.r.t to as
(7) 
This presents a problem as we do not have the functional relationship to calculate the gradient. However, we do have access to samples of directly for known values of . This particular observation about the problem enables us to make use of the stochastic approximation techniques to calculate the gradients, without the knowledge of the functional relationship.
Simultaneous Perturbation Stochastic Approximation [37] (SPSA) is a stochastic approximation method that lets us perform gradient descent even when the functional relationship between the objective and the parameter to optimize is unavailable in the model. The gradient is estimated by querying the system with slightly perturbed parameters. The algorithm consists of four tunable parameters which determine the performance; the parameters and correspond to the step size of the gradient descent update. indicates the initial value and denotes the rate at which the step size should be decreased with each iteration. The parameters and deal with the magnitude of the perturbation provided to the input. Here, denotes the initial value of perturbation and controls the rate of decay. These parameters are tuned for one kind of application and need not be retuned for each instance. Interested readers are referred to [37] for detailed explanation as well as practical tips for setting these values.
For our setting, we perturb our input parameter to the system, , and we then have access to the number of collisions encountered on that channel using the specified ; we wish to minimize the loss function . The function , is chosen as loss function due to its convex behaviour and simplicity in conveying the objective. The SPSA updation strategy for each channel to vary the exploration factor in given in Algorithm 5.
Here, denotes the number of updates performed on the channel whereas denotes the number of times the subroutine is called for a specific channel . Every time the subroutine is called, we can assign an exploration factor and observe , the number of collisions caused. This in turn gives us a sample of the loss function, . From the algorithm, we can see that we need two such samples to do a single update for the exploration factor . At step 9, we use these two samples to calculate the psuedogradient information for the function and at step 10, we update the exploration factor.
IiiG Discussion
We would like to reemphasize that we present a broad framework by which multiple cognitive users access the unlicensed channels to maximize their own throughput and at the same time try to reduce the number of sensing operations, without causing significant interference to the PUs. The proposed multistage approach is such that it allows the framework to replace or extend any of the stages without affecting the other parts of the framework. As an example, in case a better algorithm is proposed for channel selection for the requesting SUs, the new algorithm can replace Algorithm 3 without disrupting the rest of the framework. The action taken at each stage and the observations from the system are fed to the next stage. Hence, if the channel selection algorithm wrongly estimates the quality of a channel, the following residual OFF time predictor stage will correct it by using the throughput seen during the skip interval. On the other hand, if the residual time predictor is in error, the channel selection stage will receive more collision updates which will in turn reduce the probability of picking that channel. Further, if the exploration stage picks a larger than appropriate, the penalization in the form of collisions will lead to correction in all stages. In this way, all the stages help in correcting one another and jointly improve the performance.
Iv Simulation Results
In this section, we present our simulation setting and provide results for the proposed algorithm. Traditionally, to simulate multiple users in CRNs, an assumption that the number of available primary channels is greater than the number of SUs is made [25, 20]. Now, in the era of IoT, the above assumption does not hold true; we are dealing with more of devices than the number of available channels. Hence, we consider a scenario in which there are primary channels () and IoT devices () that are competing for secondary access. It has been suggested through the study of reallife traces that heavytailed distributions like GPD are suited to model the distribution of the idle times of primary traffic [38]. Also, in notable works like [15], the exponential distribution is used to model the primary traffic. Therefore, for our simulations, we show results in two different PU traffic models  GPD and Exponential. We model each channel independently where the ON times and idle times of the PU are independent and identically distributed (iid) samples from the respective distributions. The distribution for each channel is modelled with parameters randomly selected from the range mentioned in Table I. In order to make our simulations more realistic, we also account for the probability with which the SU’s transmission might fail due to channel error. This implies that the failure in secondary transmission is not due to collision with the PU alone, a fraction of the failures is also due to channel error. Note that the central node cannot distinguish between these failures and hence treats all failed transmissions as collisions with the PU.
As stated in Section II, the IoT device transmissions could be periodic updates or eventdriven transmissions. For our simulations, we consider periodic SUs that transmit once in frames for a duration of frames. The SUs that are event driven turn on with an alarm probability, and they remain in the transmitting state for an exponentially distributed amount of time with parameter [32]. The setting in which the device transmissions are eventdriven represents a scenario where the payload of the secondary users is more. The parameters are set such that there is a heavy demand for the primary channels in this case. Parameters used for the simulation are listed in Table I.
Our multistage learning algorithm consists of a set of tunable parameters. The parameter is set to and it corresponds to the fraction of times channel is selected at random instead of performing the hill climbing algorithm. This is done to ensure that all the channels are sampled enough while building the value table. , set to , denotes the rate at which the value table is built, i.e., the weight given to a newly observed sample in comparison with the previously maintained estimate for the value of the devicechannel pairing, as specified in Algorithm 3. The parameters for SPSA determine the convergence of the gradient descent algorithm and their significance is mentioned in Section IIIF; they are set as and for all the simulations. , the threshold for collision on each channel as seen by the SU, can be chosen based on a variety of factors such as the nature of primary traffic, the reliability and latency requirements of the SUs, etc. In our formulation, we choose to be .
Parameter  Value 

Continuous Traffic ModelGPD  , , 
Exponential Traffic Model  
0.95  
0.05  
Frame duration ()  
Sensing duration ()  
SNR of PU at SU receiver  
Number of Channels  
Number of Secondary Users  
for periodic devices  frames 
for periodic devices  frames 
for event driven devices  
ON Time for event driven devices  
Channel error  0.05 
As mentioned in Section I, works like [39, 11, 21] do not exploit the distribution of the primary traffic to skip sensing. Therefore, we compare the proposed algorithm with the following

Traditional  Here, the channel is sensed every frame before the data is transmitted. This setting also makes use of the channel selection algorithm given in Algorithm 3.

Genie  This method represents a channel skipping method which has perfect knowledge of exact ON and OFF times once a channel is chosen for transmission by the channel selection algorithm given in Algorithm 3.

Raj2018  The twostage algorithm presented in [31] is to address the case of a single cognitive user. However, the second stage, i.e., the parametric Bayesian learning to estimate the residual OFF time can be employed in our setting in place of Algorithm 4. Note that the channel selection framework is still according to Algorithm 3.
Comparison with [31] highlights the impact of using nonparametric approach for the estimation of the residual off time as opposed to a parametric approach. We note that [31] itself outperforms [15, 17]. Hence, outperforming [31] indicates that we also outperform the other algorithms. We now discuss the metrics for evaluation of our algorithm. To quantify the performance of our algorithm, we consider the following metrics throughput, the average number of frame collisions encountered and the number of sensing operations that are performed. As the SUs can achieve different maximum throughput on different channels, we model the capacity as a random number which is fixed for each userchannel pairing. All simulations are performed for a fixed set of values for the capacity. For all the metrics, i.e., throughput, number of sensing and frame collisions, we plot the cumulative values normalized to the number of frames the SU attempts to transmit till time , say per active SU. Let refer to the number of SUs that are ON at a given time instant . Then, for a metric , we plot
(8) 
We now present the results of our simulations in the subsequent subsection in GPD and exponential traffic for periodic as well as the event driven SUs. All the presented results are averaged over independent iterations. Please note that in our figures, the legend in provided in one plot and the same markers are used in the other plots as well.
Iva Results for Periodic Traffic SUs
In Figure 4, the results for the proposed algorithm in GPD primary traffic are shown. As the GPD model is heavytailed, we see long primary idle periods. This explains the low fraction of collisions that are observed. We can see that the SPSA method explores more often since the total number of collisions is much below . By doing so, it achieves higher throughput as compared to the other variants of the proposed algorithm. The gain achieved when algorithms to skip sensing a channel are employed is also evident from both the throughput and the number of sensing plots. We outperform the traditional algorithm in both the metrics by a nontrivial margin. We can also see that we outperform the parametric Bayesian learning method proposed in [31] in terms of both throughput and number of sensing operations. The SPSA variant of the proposed method performs as well as a genie which has exact knowledge of the underlying channel characteristics in terms of throughput and the number of sensings although the overall performance still depends on the channel selection algorithm.
The results in an exponential traffic model are shown in Figure 5. The advantage over a traditional algorithm which performs channel sensing for every frame is evident from these curves. In terms of the number of frame collisions, we can see that the algorithm learns over time to pick the channel that reduces collisions. This trend is universally observed as they all employ the same channel selection algorithm. The percentage of frame collisions encountered is lower than the permissible threshold of 0.1 over time. We can see that the throughput for all the variants are almost equal. The number of sensings is the least for [31] as it assumes the case of exponential primary traffic in its model. All the proposed variants perform equally well and is comparable to an allknowing genie in case of throughput and number of sensing operations.
IvB Results for event driven traffic in SUs
We present the results for event driven traffic in the form of a table which features the metrics obtained at the end of the simulation period in Table II. We can see that the SPSA variant of the proposed algorithm achieves the least number of collisions as compared to the other algorithms and is comparable to the genie method. It also achieves the highest throughput among the other methods that learn the primary traffic. Although the frame collisions by the proposed SPSA method is marginally higher than the other methods, the number of collisions is below the allowed threshold, . A similar trend is observed in the case of GPD traffic model. To assign the channels to the SUs, if a brute force search over all the possible combinations is performed, a higher throughput can be achieved. This is at the cost of computational complexity and the corresponding latency involved during channel assignment at the central node.
Avg Sensing  Avg Throughput  Avg No. of  

per frame  Frame collisions  
Traditional  0.78  2.82  0.058 
Proposed Fixed  0.42  3.3  0.073 
Proposed SPSA  0.31  3.42  0.08 
Raj2018  0.49  3.3  0.069 
Genie  0.28  3.5  0.03 
IvC Variation over different number of SUs
In this section, we discuss the performance of our algorithm with varying number of SUs being present in the network. In case of periodic traffic, as the number of SUs in the network are increased from 5 to 30, more primary channels are sensed. The number of sensing operations per frame increases with increase in the number of SUs from 0.1 to 0.5 per SU in the case of exponential traffic. The throughput obtained also increases with increase in the number of SUs as we now have better estimates of the underlying channel due to more samples obtained. We also observe an increased number of frame collisions per frame per SU from to . All the proposed variants perform similarly.
In the case of SUs that are eventdriven, a decrease in throughput and sensings is observed with an increase in the number of SUs. This is because the number of SUs that are ON is typically greater than the number of available channels. Therefore, when the channels are being used by other SUs or is already sensed to be occupied by the PU, the remaining SUs do not sense that channel. This explains why the number of sensings of the SUs that are ON reduces with increase in the number of devices in the network. We noticed that the SPSA variant of the proposed method outperforms other variants in terms of the number of sensing operations required by . A similar argument can be extended to the case of throughput. As the number of SUs increases, the throughput obtained at each SU is lesser. Event driven traffic usually has a longer payload; this helps leverage the continuous idle times in the primary traffic. Also, if our predicted residual time is longer than the payload of the transmitting SU, the channel is assigned to the next SU in line without sensing. This strategy helps reduce both sensings and frame collisions, especially in the event driven scenario, as the secondary traffic is dense.
IvD Evolution of exploration factor in SPSA
To illustrate the working of adaptively changing the exploration factor , we plot the evolution of over time for a single realization in high, low and medium primary traffic scenarios as shown in Figure 6. A realization of the heavy exponential traffic is considered, where the percentage of collisions on a specific channel is high. The exploration factor is then expected to learn that this is a busy channel from the data and restrict from exploring much. We can observe over time goes close to zero as the percentage collisions experienced on that channel is high. When a lighter exponential is considered, the algorithm learns to adopt medium values for the exploration factor. When a relatively free channel such as one experiencing GPD traffic is considered, the percentage of observed collisions is much below the threshold. In this case, we can afford to explore more to leverage the allowed threshold for collision. We can see that in this case, the SPSA prompts the exploration factor towards higher values.
V Conclusion
In this paper, we proposed a multistage nonparamteric learning method for spectrum access in a cognitive radio network for IoT devices. For assigning channels to the IoT devices, we combined a traditional AI technique, hill climbing with an greedy exploration strategy. Then, a nonparametric Bayesian learning method using the Dirichlet prior was employed to estimatethe distribution of the residual primary OFF time, which in turn was used to predict the number of frames for which one can skip sensing the channel. Further, to leverage a given threshold for collision, we adaptively tradeoff transmitting until collision and choosing the time to skip from the learnt OFF time distribution by employing a stochastic approximation method, SPSA. We show through exhaustive simulations that the proposed method requires significantly lesser number of channel sensings and achieves comparable throughput while adhering to the collision threshold imposed when compared to the traditional method. In an energy constrained scenario, this helps in improving the energy efficiency of the IoT devices.
As the IoT ecosystem grows, we will see more resource constrained devices getting into the network. Cognitive capabilities should be built into these networks, either into the devices themselves or as a central entity to respond to the rapidly evolving requirements of the heterogeneous collection of devices. In this paper, we presented a centralized learning algorithm where the intelligence is embedded in a central entity. Application of distributed learning techniques which are energy efficient could further help the devices to be more autonomous and the network to be more flexible. We believe that further effort towards the development of energy efficient AI/RL techniques for edge devices can substantially contribute to the improvement of IoT networks.
References
 [1] R. R. Yager and J. P. Espada, New Advances in the Internet of Things. Springer, 2018.
 [2] J. Mitola and G. Q. Maguire, “Cognitive radio: making software radios more personal,” IEEE personal communications, vol. 6, no. 4, pp. 13–18, 1999.
 [3] A. A. Khan, M. H. Rehmani, and A. Rachedi, “When cognitive radio meets the internet of things?” in Wireless Communications and Mobile Computing Conference (IWCMC), 2016 International. IEEE, 2016, pp. 469–474.
 [4] ——, “Cognitiveradiobased internet of things: Applications, architectures, spectrum related functionalities, and future research directions,” IEEE wireless communications, vol. 24, no. 3, pp. 17–25, 2017.
 [5] Y. Liao, L. Song, Z. Han, and Y. Li, “Full duplex cognitive radio: a new design paradigm for enhancing spectrum usage,” IEEE Communications Magazine, vol. 53, no. 5, pp. 138–145, 2015.
 [6] S. K. Sharma, T. E. Bogale, S. Chatzinotas, B. Ottersten, L. B. Le, and X. Wang, “Cognitive radio techniques under practical imperfections: A survey,” IEEE communications surveys and tutorials, 2015.
 [7] G. I. Tsiropoulos, O. A. Dobre, M. H. Ahmed, and K. E. Baddour, “Radio resource allocation techniques for efficient spectrum access in cognitive radio networks,” IEEE Communications Surveys & Tutorials, vol. 18, no. 1, pp. 824–847, 2016.
 [8] H. Ding, Y. Fang, X. Huang, M. Pan, P. Li, and S. Glisic, “Cognitive capacity harvesting networks: Architectural evolution toward future cognitive radio networks,” IEEE Communications Surveys & Tutorials, vol. 19, no. 3, pp. 1902–1923, 2017.
 [9] G. Stamatakis, E. Z. Tragos, and A. Traganitis, “Energy efficient collection of spectrum occupancy data in wireless cognitive sensor networks,” in Wireless Communications, Vehicular Technology, Information Theory and Aerospace & Electronic Systems (VITAE), 2014 4th International Conference on. IEEE, 2014, pp. 1–5.
 [10] T. Li, J. Yuan, and M. Torlak, “Network throughput optimization for random access narrowband cognitive radio internet of things (NBCRIoT),” IEEE Internet of Things Journal, 2018.
 [11] W. Ejaz and M. Ibnkahla, “Multiband spectrum sensing and resource allocation for iot in cognitive 5G networks,” IEEE Internet of Things Journal, vol. 5, no. 1, pp. 150–163, 2018.
 [12] H. Jiang, L. Lai, R. Fan, and H. V. Poor, “Optimal selection of channel sensing order in cognitive radio,” IEEE Transactions on Wireless Communications, vol. 8, no. 1, pp. 297–307, 2009.
 [13] A. Canavitsas, L. S. Mello, and M. Grivet, “White space prediction technique for cognitive radio applications,” in Microwave & Optoelectronics Conference (IMOC), 2013 SBMO/IEEE MTTS International. IEEE, 2013, pp. 1–5.
 [14] Z. Khan, J. J. Lehtomäki, L. A. DaSilva, and M. Latvaaho, “Autonomous sensing order selection strategies exploiting channel access information,” IEEE Transactions on Mobile Computing, vol. 12, no. 2, pp. 274–288, 2013.
 [15] Y. Pei, A. T. Hoang, and Y.C. Liang, “Sensingthroughput tradeoff in cognitive radio networks: How frequently should spectrum sensing be carried out?” in Personal, Indoor and Mobile Radio Communications, 2007. PIMRC 2007. IEEE 18th International Symposium on. IEEE, 2007, pp. 1–5.
 [16] J. Oksanen and V. Koivunen, “An order optimal policy for exploiting idle spectrum in cognitive radio networks,” IEEE Transactions on Signal Processing, vol. 63, no. 5, pp. 1214–1227, 2015.
 [17] S. Senthilmurugan and T. Venkatesh, “Optimal channel sensing strategy for cognitive radio networks with heavytailed idle times,” IEEE Transactions on Cognitive Communications and Networking, vol. 3, no. 1, pp. 26–36, 2017.
 [18] W. Jouini, D. Ernst, C. Moy, and J. Palicot, “Multiarmed bandit based policies for cognitive radio’s decision making issues,” in Signals, Circuits and Systems (SCS), 2009 3rd International Conference on. IEEE, 2009, pp. 1–6.
 [19] ——, “Upper confidence bound based decision making strategies and dynamic spectrum access,” in Communications (ICC), 2010 IEEE International Conference on. IEEE, 2010, pp. 1–5.
 [20] K. Liu and Q. Zhao, “Distributed learning in cognitive radio networks: Multiarmed bandit with distributed multiple players,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010, pp. 3010–3013.
 [21] J. Zhu, Y. Song, D. Jiang, and H. Song, “Multiarmed bandit channel access scheme with cognitive radio technology in wireless sensor networks for the internet of things,” IEEE access, vol. 4, pp. 4609–4617, 2016.
 [22] O. Van Den Biggelaar, J.M. Dricot, P. De Doncker, and F. Horlin, “Cooperative spectrum sensing for cognitive radios using distributed Qlearning,” in Vehicular Technology Conference (VTC Fall), 2011 IEEE. IEEE, 2011, pp. 1–5.
 [23] N. Hosey, S. Bergin, I. Macaluso, and D. P. O’Donoghue, “Qlearning for cognitive radios,” in Proceedings of the ChinaIreland Information and Communications Technology Conference (CIICT 2009). ISBN 9780901519672. National University of Ireland Maynooth, 2009.
 [24] Y. Gai, B. Krishnamachari, and R. Jain, “Learning multiuser channel allocations in cognitive radio networks: A combinatorial multiarmed bandit formulation,” in New Frontiers in Dynamic Spectrum, 2010 IEEE Symposium on. IEEE, 2010, pp. 1–9.
 [25] Y. Gai, B. Krishnamachari, and M. Liu, “On the combinatorial multiarmed bandit problem with markovian rewards,” in Global Telecommunications Conference (GLOBECOM 2011), 2011 IEEE. IEEE, 2011, pp. 1–6.
 [26] R. Bonnefoi, L. Besson, C. Moy, E. Kaufmann, and J. Palicot, “Multiarmed bandit learning in iot networks: Learning helps even in nonstationary settings,” 2017.
 [27] Z. Zhao, Z. Peng, S. Zheng, and J. Shang, “Cognitive radio spectrum allocation using evolutionary algorithms,” IEEE Transactions on Wireless Communications, vol. 8, no. 9, 2009.
 [28] Y. Jiao and I. Joe, “Energyefficient resource allocation for heterogeneous cognitive radio network based on twotier crossover genetic algorithm,” Journal of Communications and Networks, vol. 18, no. 1, pp. 112–122, 2016.
 [29] S. Zheng, C. Lou, and X. Yang, “Cooperative spectrum sensing using particle swarm optimisation,” Electronics Letters, vol. 46, no. 22, pp. 1525–1526, 2010.
 [30] L. Guo, Z. Chen, and L. Huang, “A novel cognitive radio spectrum allocation scheme with chaotic gravitational search algorithm,” International Journal of Embedded Systems, vol. 10, no. 2, pp. 161–167, 2018.
 [31] V. Raj, I. Dias, T. Tholeti, and S. Kalyani, “Spectrum access in cognitive radio using a twostage reinforcement learning approach,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 1, pp. 20–34, 2018.
 [32] N. Nikaein, M. Laner, K. Zhou, P. Svoboda, D. Drajic, M. Popovic, and S. Krco, “Simple traffic modeling framework for machine type communication,” in Wireless Communication Systems (ISWCS 2013), Proceedings of the Tenth International Symposium on. VDE, 2013, pp. 1–5.
 [33] S. Russel and P. Norvig, “Artificial intelligence: A modern approach, 2003,” EUA: Prentice Hall, vol. 178.
 [34] P. Auer, “Using confidence bounds for exploitationexploration tradeoffs,” Journal of Machine Learning Research, vol. 3, no. Nov, pp. 397–422, 2002.
 [35] A. Agresti and D. B. Hitchcock, “Bayesian inference for categorical data analysis,” Statistical Methods and Applications, vol. 14, no. 3, pp. 297–330, 2005.
 [36] B. Nandram, D. Bhatta, D. Bhadra, and G. Shen, “Bayesian predictive inference of a finite population proportion under selection bias,” Statistical Methodology, vol. 11, pp. 1–21, 2013.
 [37] J. C. Spall, “Multivariate stochastic approximation using a simultaneous perturbation gradient approximation,” IEEE transactions on automatic control, vol. 37, no. 3, pp. 332–341, 1992.
 [38] L. Stabellini, “Quantifying and modeling spectrum opportunities in a real wireless environment,” in Wireless Communications and Networking Conference (WCNC), 2010 IEEE. IEEE, 2010, pp. 1–6.
 [39] L. Besson and E. Kaufmann, “Multiplayer bandits revisited,” in Algorithmic Learning Theory, 2018.