Multiarmed Bandits with Application to
5G Small Cells
Abstract
Due to the pervasive demand for mobile services, next generation wireless networks are expected to be able to deliver high date rates while wireless resources become more and more scarce. This requires the next generation wireless networks to move towards new networking paradigms that are able to efficiently support resourcedemanding applications such as personalized mobile services. Examples of such paradigms foreseen for the emerging fifth generation (5G) cellular networks include very densely deployed small cells and devicetodevice communications. For 5G networks, it will be imperative to search for spectrum and energyefficient solutions to the resource allocation problems that i) are amenable to distributed implementation, ii) are capable of dealing with uncertainty and lack of information, and iii) can cope with users’ selfishness. The core objective of this article is to investigate and to establish the potential of multiarmed bandit (MAB) framework to address this challenge. In particular, we provide a brief tutorial on bandit problems, including different variations and solution approaches. Furthermore, we discuss recent applications as well as future research directions. In addition, we provide a detailed example of using an MAB model for energyefficient small cell planning in 5G networks.
^{†}^{†}footnotetext: The authors are with the Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, MB, Canada (email: maghsudi.setareh@gmail.com, ekram.hossain@umanitoba.ca).Keywords: 5G cellular networks, small cells, singleagent and multiagent multiarmed bandit (MAB) problems, correlated equilibrium, Nash equilibrium.
I Introduction
Due to the everincreasing need for mobile services, we expect a massive growth in demand for wireless services in the years to come. As a result, future mobile networks are expected to accommodate new communications, networking, and energyefficiency concepts, for instance, small cells, devicetodevice (D2D) communications, and energy harvesting. In order to realize such novel concepts, system designers face some fundamental challenges. For instance, while in traditional cellular networks some channel state information (CSI) is assumed to be available at some base station (BS) and wireless devices, such information is not likely to be affordable in emerging hierarchical or distributed networks. Centralized resource management in such networks (especially in a very dense deployment scenario) will incur excessive complexity, and therefore, distributed approaches will need to be developed, which might trigger competitive user behavior. Moreover, while any planning and scheduling in distributed networks depends heavily on the available energy at network nodes, energy levels are not observable in general. Therefore, it is evident that in future, efficient and robust wireless communications design needs to deal with inherent uncertainty and lack of information, as well as possible competition for resources, in a distributed manner. As a result, it becomes imperative to search for new mathematical tools to deal with networking problems, which in general involve both uncertainty and conflict.
Multiarmed bandit (MAB) is a class of sequential optimization problems, where, in the most seminal form, given a set of arms (actions), a player pulls an arm at each round in order to receive some reward. The rewards are not known to the player in advance; however, upon pulling any arm, the instantaneous reward of that arm is revealed. In such an unknown setting, at each trial, the player may lose some reward (or incur some cost) due to not selecting the best arm instead of the played arm. This loss is referred to as regret. The player decides which arm to pull in a sequence of trials so that its accumulated regret over the horizon is minimized, or its discounted reward over the horizon is maximized.
MAB modeling brings a variety of advantages to the research in the area of wireless communications and networking; some of them are briefly described in the following.

As mentioned before, every MAB model includes uncertainty, resulted from lack of prior knowledge as well as strictly limited feedback. This corresponds to a primal and natural feature of future distributed wireless networks, where providing nodes with information yields excessive feedback and overhead cost.

In recent years, multiagent MAB has attracted great attention, which allows to introduce the concept of conflict to the seminal singleagent MAB. Multiagent MAB models can be beneficially applied to solve wireless networking problems such as distributed resource management or interference coordination problems in a distributed manner.

MAB benefits from many variations in the model and therefore is able to accommodate many distinct conditions in wireless networks, for instance different levels of information availability and different types of randomness.
In this article, we provide a brief tutorial overview of both singleagent and multiagent MABs, including different variations in the model. Furthermore, we study stateoftheart of applications of MAB models in wireless resource allocation, and discuss future research directions and open problems. Finally, we describe a new bandittheoretical model for energyefficient small cells planning in 5G networks.
Ii SingleAgent Multiarmed Bandit
Singleagent multiarmed bandit (SAMAB) problem was first introduced in [1]. In the most seminal setting, the problem models an agent facing the challenge of sequentially selecting an arm from a set of arms in order to receive an a priori unknown reward, drawn from some unknown reward generating process. As a result of lack of prior information, at each trial, the player may choose some inferior arm in terms of reward, yielding some regret that is quantified by the difference between the reward that would have been achieved had the player selected the best arm and the actual achieved reward. In such an unknown setting, the player decides which arm to pull in a sequence of trials so that its accumulated regret over the game horizon is minimized. This problem is an instance of explorationexploitation dilemma, i.e., the tradeoff between taking actions that yield immediate large rewards on the one hand and taking actions that might result in larger reward only in future, for instance activating an inferior arm only to acquire information, on the other hand. Therefore, SAMAB is mainly concerned with a sequential online decision making problem in an unknown environment. A solution of a bandit problem is thus a decision making strategy called policy or allocation rule, which determines which arm should be played at successive rounds. Policies are in general evaluated based on their regret or discounted reward performance.
As mentioned before, MAB benefits from a wide range of variations in the setting. In the event that arms have different states, the reward depends on arms’ states, and the bandit model is referred to as stateful (Markovian). Otherwise, the model is stateless, which itself can be divided to few subsets. As stateful and stateless bandits are inherently different, we discuss them separately in the following.
Iia Stateful (Markovian) Bandits
In a stateful bandit model, every arm is associated with some finite state space. Upon being pulled, each arm pays some positive reward that is drawn from some stationary distribution associated to the current state of that arm. Arms’ states change over time according to some stochastic model, which is considered to be a Markov process; that is, it satisfies the Markov property. Roughly speaking, a process satisfies the Markov property if its future depends solely on its present state rather than its full history. This type of MAB is also referred to as Markovian bandit. Note that in this formulation, after each round, the reward as well as the state of only the played arm is revealed, and the mechanism of state transition is unknown. A Markovian MAB model can be thus defined formally as , where , and

is the set of arms,

is the finite set of states of arm ,

is the stationary reward distribution of arm in state with mean ,

is the state transition matrix of arm ,

is the average reward of arm .
For Markovian bandits, the mean reward of arm , , is its expected reward under the rewards’ stationary distribution in different states. That is, . Define . Also, let denote the state of some arm at time . The selected arm at each time is denoted by . Then, for Markovian bandit, the cumulative regret up to time , denoted by , is defined as
(1) 
where the expectation, , is taken with respect to the random draw of both states and the random actions of the policy. The general problem is to find the optimal policy, where optimality is defined in the sense of minimum regret.
Stateful bandit models are conventionally divided into the two following groups: i) rested (frozen) bandits, where at each round, only the state of the played arm changes, and ii) restless bandits, where the state of every arm is subject to change after each round of play.
Markovian bandit problems are often solved using indexing policies. Roughly speaking, at each round, a real scalar value, referred to as index, is associated to each arm. The index of any arm is counted as a measure of the reward that can be achieved by activating that arm in the current state. The arm with the highest index is played at each round. The optimal indexing policy is not unique for different bandit models; it depends on a variety of factors such as reward generating processes and arms being restless or frozen. An example of an indexing policy can be found in [2].
Note that if the arms’ states are entirely observable and the state transition matrices are known, then the criterion for action selection is conventionally to maximize the discounted reward, instead of minimizing the regret. Such problems belong to bandit models, and are solved by using indexing policies as well, with an example being Gittins index [3]. It should be however noted that although the Gittins index has been calculated in closed form for a variety of stochastic reward processes, before using any indexing policy, the indexability of the problem under investigation has to be verified, which is in general not trivial. In addition to the Gittins indices, Whittle’s indexing policy [3] is wellknown and often used to solve restless bandits. This solution however does not hold in general, and restless bandit problems are intractable in many cases. Important approximation algorithms for some families of restless bandit problems can be found in [4] and [5].
IiB Stateless Bandits
In stateless bandit model, arms do not have any specific state. The arms’ reward generating processes, however, are not necessarily stationary. A stateless MAB model is thus formally defined as , where , and

is the set of arms, and

is the average reward of arm at time .
Based on the nature of reward generating process, stateless MABs can be divided into few categories. Before proceeding to describe these categories, however, we provide a notion of regret that is widely used to analyze almost every stateless bandit problem, regardless of the category to which it belongs. Let and denote, respectively the instantaneous reward of arm and the selected arm, both at time . Then, the regret of any policy for stateless bandits up to time , referred to as external regret, which is denoted by , is defined as
(2) 
Given the definition of external regret, the optimal solution of a stateless bandit problem is a policy that minimizes the external regret. Now we are in a position to study two important branches of stateless bandit problems, namely, stochastic and adversarial bandits.
IiB1 Stochastic Bandits
Stochastic MAB is a stateless bandit model, where arms are not associated with states, and the reward generating processes are stochastic. In other words, the series of rewards of each arm are stateindependent, drawn from some specific density function, which can be stationary or nonstationary. For the stationary case, for all . In this case, the first term on the right hand side of (2) yields where . For the nonstationary case, the first term yields , with .
In order to solve the stochastic bandit problem, many methods have been developed so far that are based on the upper confidence bound (UCB) policy [6]. The basic idea to deal with the explorationexploitation dilemma here is to estimate an upper bound of the mean reward of each arm at some fixed confidence level. The arm with the highest estimated bound is then played. Such methods, however, are only applicable when the rewards of each arm are independent and identically distributed, or in other words, the arm is stationary. In order to deal with nonstationarity, algorithms mostly use some statistical test in order to detect changes in the distribution. To elaborate, consider a sequence of independent random variables with some density that depends on some scalar parameter . The initial value of this parameter is , and after an unknown point of time, its value changes to . Changepoint detection is performed by using some statistical test, for example, generalized likelihood ratio or PageHinkly test, which not only identifies the changes, but also estimates the change time. Note that in general it is assumed that not too many of such change points exist. There are also some adapted versions of the seminal UCB algorithm that detect changes in the sample mean reward of arms in order to deal with timevariant statistical characteristics. For example, in discounted UCB, the rewards are weighted so that recent outcomes are emphasized when calculating the sample mean rewards, based on which the next arm is selected. Or, in slidingwindow UCB, the sample mean reward of each arm is calculated only over a fixedlength window of recent rewards, and not by using the entire reward history.
IiB2 Adversarial (Nonstochastic) Bandits
Adversarial bandit models are similar to stochastic ones in being stateless, with the difference that the series of rewards of each arm cannot be attributed to any specific distribution; in other words, reward are nonstochastic. Generally, in an adversarial setting, mixed strategies are used to select an arm. That is, at each time , the agent selects a probability distribution over arms, and plays arm with probability . In such a setting, in addition to external regret given by (2), we define another important notion of regret, namely internal regret, as follows:
(3) 
By definition, external regret compares the expected reward of the current mixed strategy with that of the best fixed action in the hindsight, but fails to compare the rewards achieved by changing actions in a pairwise manner. Internal regret, in contrast, compares actions in pairs. Later, we shall see that the notion of internal regret is in close relation with correlated equilibria in games.
Adversarial bandits are often solved by using potentialbased or weighted average algorithms [6]. In the most seminal form of this approach, at each trial, a mixed strategy is calculated over the set of actions. The selection probability of each action is proportional to its average regret performance in the past, possibly weighted by a specific potential function (for example the exponential function). Accordingly, the actions with better past performance (lower average regret) are more likely to become activated in the future, and vice versa.
IiC Other Important Bandit Models
IiC1 Contextual (Covariate) Bandits
In the basic bandit model, at each round, the agent only observes the reward of the played actions, and hence has to select its future actions only based on its past performance. In contextual bandits (also called bandits with side information or covariate bandits), in contrast, at each round of decision making, some side information is revealed to the learning agent. The agent therefore learns the best mapping of contexts to arms. Note that this type of bandit models is distinguished only based on information availability, and the reward process can be still Markovian, stochastic or adversarial. In any case, aforementioned algorithms are adapted to solve contextual bandit problems as well.
IiC2 Mortal Bandits
While in the basic bandit model it is assumed that all arms are infinitely available, there are also some variants that distinguish the problems where this assumption does not hold. For instance, mortal bandits assume that arms are available only for a finite time. The availability time can be either deterministic or stochastic, known or unknown. An example of solution approaches can be found in [7]. The algorithm is called stochastic with early stopping, and the main idea behind it is to abandon an arm as soon as realizing that its maximum possible future reward may not justify its retention. The analysis of such schemes is inherently different from the general bandit model, since in such problems the achievable accumulated reward from each arm is bounded even over infinite horizon.
IiC3 Sleeping Bandits
Sleeping bandits refers to bandit problems where action set is timevarying. Clearly, in such models, not only the reward process, but also the arm availability at each round can be Markovian, adversarial or stochastic. Some solution approaches for different types of this problem can be found in [8]. The core idea of these algorithms is to change the notion of regret from the basic one we have defined before. While the basic notion is based on the performance loss with respect to the best action in the hindsight, here the best ordering of actions serves as the benchmark, as the best action might not be available in some rounds.
Iii MultiAgent (GameTheoretical) Multiarmed Bandits
In multiagent multiarmed bandits (MAMAB), each player is assigned an action set . Similar to the singleagent model, each agent selects an action in successive trials to receive an initially unknown reward; if multiple agents select an arm, the achieved reward is shared in an arbitrary manner. Therefore, in multiagent setting, the reward of each agent depends on the joint action profile of all agents. Note that the action set, the played action and the reward achieved by each agent can be regarded as either private or public information, based on the specific system model and problem formulation. From the view point of each agent , an MAMAB model can be seen as a game with two players: the first player is agent itself, and the second player is the set of all other agents, whose joint action profile affects the reward achieved by agent . This game might belong to any conventional category of games; for instance, it can be a potential or zerosum game.
Given no prior information, every agent requires to interact with the random environment in order to solve the problems that arise in an unknown reactive model, for instance longterm accumulated reward maximization, or average regret minimization. As a result of competition, which is inherent in noncooperative multiagent systems, singleagent bandit models that ignore the possible conflicts among multiple agents do not yield satisfactory outcomes; in particular, equilibrium is not guaranteed to be achieved. This is when game theory and multiarmed bandits meet and complement each other. Thereby, equilibrium arises as an asymptotic outcome of repeated interactions in an unknown environment among learning agents with bounded rationality that are provided with strictly limited information and aim at achieving longterm optimality in some sense. The aforementioned problem, i.e., efficient and convergent learning in a multiagent setting, is currently under intensive investigation, mainly in computer science and mathematics. In the theory of wireless communications, there is also some ongoing research on gametheoretical bandits. In the following, we describe some important results in this area, with an emphasis on convergence to equilibrium.
Iiia Convergence to Correlated Equilibrium
Recall the definition of internal regret given by (3) in Section (IIB2). The following (simplified) theorem describes the relation between internal regret and the concept of correlated equilibrium in games.^{1}^{1}1The definition of correlated and Nash equilibrium are standard and therefore omitted here for the brevity of the article.
Theorem 1 ([9]).
Consider a multiagent bandit game with a set of players , and let denote the internal regret of player at round . If all agents play according to some policy that exhibits perround vanishing internal regret (i.e., ), then the game converges to the set of correlated equilibrium in a timeaverage sense.
This concept is used for instance in [10] to develop joint power control and channel selection strategies in distributed D2D networks.
Besides internal regret, there is another concept, namely calibrated forecasting, which is related to correlated equilibria in games. Before explaining this relation, we briefly describe calibrated forecasting in the following. Consider a random process with a set of outcomes. A forecaster predicts the outcome of the random process sequentially, and is referred to as calibrated if, in the limit, the predicted outcome is equal to the true one. Now, consider a game with players, where each player selects its action from action set (pure strategies). Then, for any player , the joint action profile of its opponent is a random process with its set of outcomes being , where denotes the Cartesian product. In such games, each player might apply a forecaster in order to predict the next joint action profile of its opponents, and uses that prediction to move strategically, for instance, by playing the best response to the predicted joint action profile. Note, however, such prediction naturally involves learning, and therefore, requires the knowledge of past actions of opponents; in other words, players should be able to observe the actions of each other. In the following theorem, we describe the relation between calibrated forecasting and correlated equilibria.
Theorem 2 ([9]).
In any game, if every player plays by best responding to a calibrated forecast of its opponents’ joint action profile, then the game converges to the set of correlated equilibrium in a timeaverage sense.
Note that Theorem 2 declares the convergence for general fullinformation games, where the game matrix is known to all players a priori. The theorem is generalized in [11] to bandit games, and is applied as a basis to develop a convergent solution approach for nonstationary stochastic bandit models. The core idea is to combine calibrated forecasting with nonparametric regression, so that after some time, each player has some accurate estimate of the reward process of actions as a function of joint action profile and the next move of opponents.
IiiB Convergence to Nash Equilibrium
While they are only few solution approaches for multiagent bandit games that guarantee a convergence to correlated equilibria, converging to Nash equilibria seems to be an even more challenging task. There are some convergent approaches for special game classes such as potential games. For instance, in [12], algorithms are proposed for Nash convergence in potential games and games with more general forms of acyclicity. The basic idea therein is to combine Qlearning with stochastic betterreply or stochastic adaptiveplay dynamics in order to develop a multiagent version of Qlearning, which estimates the reward functions using novel forms of the greedy learning policy. Details can be found in [12]. While multiagent Qlearning converges only for some specific game classes, there are also some algorithms that converge in general games. A concept based on which multiple convergent bandit algorithms are proposed for general games is regret testing [9]. Here the basic idea is to perform some sort of exhaustive search, in the sense that each player first selects a mixed strategy according to some predefined protocol, and then checks whether i) the incurred regret is less than a specific threshold; and ii) the selected mixed strategy is an approximate best response to the others’ mixed strategies. Clearly, none of these can be concluded in one round of play; but, if the same mixed strategy is played during sufficiently many rounds, then each player can simply test the corresponding hypothesis. Different variants of this concept exhibit different convergence characteristics, for instance, convergence to Nash equilibrium in a timeaverage sense or convergence to the set of Nash equilibria, both for every generic game.
In the area of wireless communications, most works consider a specific game (defined by a specific utility function) and propose a gametheoretical learning algorithm, whose convergence is established by some analysis based on the defined utility function. Such solutions, however, suffer from limited applicability, as they are tailored to converge only for the predefined games.
Iv StateoftheArt and Future Research Directions
In what follows, we first briefly review the current applications of MABs (in both singleagent and multiagent settings). Afterwards, we discuss some open problems and potential research directions. Note that our focus in this article is on the applications of MAB models in solving resource allocation problems.
Iva StateoftheArt
IvA1 Distributed Channel Selection
Distributed channel selection is a widelyconsidered application of MABs. When cast as a MAB, channels are considered as arms, and reward processes are some functions of signaltonoise ratio (SNR) or signaltointerferenceplusnoise ratio (SINR). Reward functions might evolve as Markovian, stochastic or adversarial, and any multiple access protocol, orthogonal and nonorthogonal, can be addressed by using bandit models. While singleuser problem has been investigated intensively so far, multiuser scenario is still under investigation. As described in Section III, in a multiuser setting, it is important to address the conflict among users by converging to equilibrium in some sense. This can be done by using multiagent bandit algorithms, described before. Also, a key point is that the selected channel of users might be sometimes observable. Such information can be used as side information to build contextual bandit models. See [11] for an example.
IvA2 Opportunistic Spectrum Access
Opportunistic spectrum access in cognitive radio networks can be formulated as a (rested or restless) Bernoulli bandit problem. In doing so, each channel is an arm that is either in state 0, interpreted as busy, with probability , or in state 1, interpreted as idle, with probability . Since a secondary user is only allowed to transmit through idle channels, a reward is solely associated with state 1. Channel states, as well as parameter are unknown, and a secondary user, modeled as agent, observes the state of the selected channel only. Through transmission time, a secondary user aims at maximizing its reward by selecting some channel that is available with high probability. The problem can be then solved by using indexing policies. An example can be found in [2].
IvA3 Sensor Scheduling
In sensor networks, the number of sites to be surveyed is very often larger than that of available sensors. For instance, for military surveillance and in smart grids, it is important to find the places where there is attack or failure. Such problems can be addressed by using MABs, where sites are modeled as arms with two possible states, similar to the opportunistic spectrum access problem. Furthermore, if sensors are used for transmission, and the set of residual energy of each sensor is defined on a discrete space, sensor scheduling can be formulated as a MAB problem, aiming at minimizing the energy consumption, while maximizing the transmission performance. Reference [13] is an example.
IvA4 Transmission Mode/Relay Selection
A relay selection problem arises in twohop (or multihop) transmission, where multiple relays are available to be used in order to improve the transmission performance. To formulate the relay selection problem as a MAB, the reward of each relay, modeled as an arm, is defined as the achievable transmission rate through that relay, which is the minimum achievable rate of first and second hops. Similarly, a problem of transmission mode selection is a natural challenge in distributed hierarchical networks, for example, in a D2D communications system underlaying a cellular infrastructure. In such a hierarchy, any pair (or group) of users can choose to use the traditional cellular transmission mode via a BS, or to establish direct communication link. This problem can be simply cast and solved by a twoarmed bandit model, as proposed in [14].
IvA5 Power Control
Power control is one of the lessexplored applications of MAMABs. In order to model such problem as a bandit game, a finite discrete set of power levels is considered as the set of arms, and the reward process is defined to be some function of SINR, where, as in channel selection, interference represents the mutual impact of agents. In comparison to channel selection, in a power control problem, the assumption of full or partial monitoring is not realistic. That is, agents might not be able to observe the actions, i.e., the transmission power levels, of each other. Hence, algorithms should cope with the socalled no monitoring, where agents’ actions are regarded as private. An example can be found in [10].
IvA6 Energy Harvesting
Energy harvesting is a relatively new concept attracting increasing interest. In a network where nodes are equipped with energy harvesting units, electricity is generated from surrounding environment (e.g., solar energy, ambient RF signals). Implementing this concept gives rise to some challenges, in particular with respect to scheduling, as the energy arrival is in general not deterministic. Under the assumption that the energy harvesting process is Markovian, MAB models can be used to address the scheduling problem. In the model, similar to the cases before, the reward function is the achievable transmission rate while the unknown variant is the energy harvesting process and/or the battery state. See [15] as an example. Application of MAB theory to this area is relatively new and a variety of open problems exist.
IvA7 Few Other Applications
Although in this article we restrict our attention to wireless resource management, it should be noted that MAB models find applications in a variety of other networking problems. An important application is routing, where on the network graph, every endtoend path consists of multiple edges, and the cost of each edge, sometimes timevarying, is unknown a priori. Moreover, after transmitting through each path, the only feedback is assumed to be the endtoend cost of the selected path. Such a problem can be modeled and solved using various MAB settings in accordance with the network characteristics. See [16] for an example. Another field of research benefiting from bandit theory is security. One example is to combat jamming attacks, where an adversary transmits signals to interfere with normal communications and temporarily disables the network. Under the assumption that the transmitter and the receiver do not preshare any secrets with each other, the channel selection problem can be modeled and solved by using MAB. An example can be found in [17].
IvB Future Research Directions
As described in the previous section, a variety of resource management problems have been investigated by using bandit models; nonetheless, in most research studies, the development of a model and solution is focused on using some infinitehorizon variants of bandit problems, with an emphasis on the type of reward evolution, namely, Markovian, stochastic or adversarial. In other words, models are often distinguished only based on arms’ reward generating processes, disregarding many other important and practical aspects such as budget constraints, information availability, number of arms, arms’ dependencies and number of arms to be selected at each round, to name just a few. To clarify this shortcoming, some examples are provided in the following.
IvB1 Multimedia Transmission
Most current bandittheoretical models for wireless networking problems are analyzed asymptotically, assuming that the transmission may continue for infinite time. Thereby, optimality is defined in terms of longrun discounted reward maximization and/or average regret minimization. There are, however, many scenarios such as multimedia transmission, where transmission has to be completed under strict delay and/or energy constraints, which restricts the number of possible transmission rounds. In such scenarios, algorithms that achieve optimality in some longrun sense are not applicable; indeed, here the goal is merely to perform some task successfully under some budget constraint rather than optimizing the asymptotic performance. Thus, it is imperative to develop new optimality conditions, as well as new solution approaches, for applied bandit models in finite horizon.
IvB2 Wireless Sensor Networks
To the best of our knowledge, current research studies, which apply bandit models to solve networking problems, assume that arms, representing resources such as channels or relays, are available infinitely often during networking tasks such as channel selection or routing. This assumption, however, is not always valid. For instance, in a sensor network, a sensor might become unavailable as soon as the energy supply is exhausted. In cognitive radio networks, as another example, a channel becomes busy (unavailable) when a primary user starts to transmit through that channel. In a bandit model, this effect corresponds to restricted arm availability. In other words, each arm might be available only for some limited number of trials. Therefore, new definitions of regret and new algorithms need to be developed.
IvB3 Correlated Resource Selection
Another practical issue, currently being neglected, is arms’ dependencies. While in a great majority of models it is assumed that arms are independent, it is not always the case in practice. Indeed, resources, such as relays and channels, might be correlated in many cases. For instance, in a cognitive radio with fixed number of primary users, primary channels can be considered as dependent since a channel being mostly occupied implies that other channels might be less busy. Exploiting such dependencies as a source of information about arms yields efficient algorithmic solutions to the bandit, and thereby, the resource management problem.
IvB4 MultiAgent Setting
Besides modelrelated inefficiencies to be alleviated, there are some open problems in particular in multiagent settings. Although in such settings achieving an equilibrium is beneficial for the entire network, there exist a variety of antiequilibrium reasonings that should be addressed. For instance, a general problem is to cope with slow convergence to equilibrium, in particular for large number of arms and/or players. Another important problem is equilibrium selection. The problem is concerned with guiding the learning agents to the most efficient equilibrium, when multiple equilibria exist. In cases, where such convergence is not feasible, an analysis of equilibrium efficiency is highly desired. Moreover, in some scenarios, it is desired to have a fair solution rather than an efficient one. Note that the problems mentioned above exist also in fullinformation case, but become aggravated in the bandit setting, where only the reward of the activated arm is revealed to the player.
V Efficient 5G small cells with combinatorial (multiplay) bandits
In Section IV, we briefly studied stateoftheart, thereby clarifying the application of MABs to solve a variety of problems that arise in wireless networks. Moreover, we discussed some issues and open problems. In this section, we introduce a new application of MABs in next generation wireless networks, to design energyefficient 5G small cells. Primarily, we consider a singleagent model with independent arms. In contrast to stateoftheart, however, in our bandit model, multiple arms are selected at each round. We afterwards provide some sketch to generalize the model to a multiagent scenario, and also to the case, where arms are dependent, i.e., they affect each other in some specific sense.
Facing an everincreasing influx in data traffic, 5G networks are foreseen to alleviate this problem by deploying dense small cells to underlay the legacy macorcellular networks. This takes advantage from lowpower and shortrange base stations that operate (preferably) using the same radio spectrum as the macro base stations and offload macro cell traffic [18]. In order to maintain low cost, small cells are desired to be selforganized and energyefficient. As a result, a small cell is activated only if it improves the overall network performance. Efficient small cell activation can be performed through dynamic small cell on/off by a macro cell. Making smart decisions, however, requires some information; for instance, the available energy or the number of potential users at each small cell, which are both random in nature. Such information is however notoriously difficult to acquire specifically when numerous potential small cells exist. Therefore, it is reasonable to search for a solution so that a macro cell activates small cells in an efficient manner given only limited information.
Formally, consider a macro BS operating in conjunction with a set of potential small cells. Every small cell acquires its required energy (at least partially) through energy harvesting. That is, while sleeping, it gains energy from the environment through energy harvesting units, and uses the stored energy to provide services in its active periods. The amount of energy that is gained during each sleeping period depends on environmental factors (e.g., the weather) and can change adversarially. The amount of energy spent during each servicing period depends on the number of users, which can change as well due to user mobility.^{2}^{2}2Clearly, both harvested energy and number of users can be modeled as stochastic (stationary or nonstationary) variables, as well. Activating a small cell costs energy even if it does not serve any user; thus, at each time (or for every timeperiod), a macro BS aims at activating a set of small cells. Every activated small cell should be able to afford the required energy to provide services to a large enough number of users, so that the initial fixed activation cost diminishes. Moreover, it is desired to select the subset of small cells in the sense that the overall network performance is optimized. The available energy and the number of users in small cells are however unknown to the macro BS, which complicates the decision making. We formulate this problem as an SAMAB, by modeling small cells as arms.
We consider a simple model, where any small cell requires energy units to serve a user with data rate . The cost per energy unit is denoted by . For initial activation, units of energy is necessary, even if no user is served. Let and be the available energy and the number of users in small cell at time . Define . Then the utility (reward) of selecting a small cell at time is defined as . The total utility of selecting some subset therefore yields .
Despite similarities, this problem is not identical to the adversarial bandit problem described in Section IIB2, due to the following reason. Unlike the standard adversarial bandit problem, here multiple arms, and not a single arm, are selected at each round of decision making. This type of bandit problems is referred to as combinatorial or multiple play bandits, which stands in contrast to the standard single play problems. For combinatorial bandits, the conventional definition of regret given by (2) does not hold. In fact, in order to calculate the regret where a subset of arms is selected at each round, the reward achieved by the selected subset should be compared with the best subset of arms in terms of average reward. The problem can therefore be solved by using algorithms described in Section IIB2, for instance, weightedaverage strategy, by considering each set of actions as a super action, and some other simple adaptations; Nevertheless, such approach is computationally inefficient, and therefore, new algorithms are developed that are specifically tailored for combinatorial bandit problems. An examples, EXP3.M, can be found in [19]. It should be however mentioned that most of algorithms are based on similar concepts and hence exhibit similar performances.
In order to briefly evaluate the proposed model, we consider three networks with small cells, from which a macro BS selects to activate, correspondingly. Note that, with and for example, there are 60 different 4tuples of small cells that can be selected by the macro BS, which we call action and label as . Number of users and residual energy of each small cell is selected randomly at each trial, without assuming any specific density function. Moreover, the macro BS does not have any prior information. To perform selection, the macro BS uses the bandit model described before. The average utility versus the best possible subset (in an average sense), which is selected through exhaustive search, is shown in Fig. 3. It can be seen that given enough time, the utility of the proposed model converges to that of the best selection.
For and , Fig. 4 shows the percentage of time, in which each one of the 3tuples is selected by the macro BS, in trials. From the figure, the best action (here action 19) is played almost all the time. Figures for the other two settings are similar, i.e., the best action is played very often. A large fraction of the time that is spent to play other actions is the exploration time. In general, the number of trials dedicated to exploration depends on the algorithm; nonetheless, in most algorithms, exploration time is mostly determined by an exploration parameter, say . Roughly speaking, a trial is an exploration trial with probability and an exploitation trial with probability . Therefore, a larger value of yields a larger exploration time on average.
The performance evaluation might be improved for instance by investigating the amount of saved energy (cost) in comparison with number of users not being served by small cells, both of which depend on the number of activated small cells, . Thereby, can be regarded as a variable to be optimized according to system’s statistical characteristics, instead of being fixed and predefined. The question whether such optimization can be included in a bandit model or not is an open problem to be investigated in future works.
It is worth mentioning that the complexity of the proposed small cell planning model depends on the algorithm being used to solve the formulated bandit problem. For EXP3.M [19], the complexity is of and , in time and space, respectively. The growth of complexity is shown in Fig. 5. Due to linear/logarithmic complexity, the proposed approach can be considered to be scalable with the number of small cells; nonetheless, there are scenarios in which distributed solutions to the problem of efficient small cell planning are desired. In such scenarios, each small cell decides whether to start the active mode or to remain passive. Similar to the centralized case, the decision is based on available energy and expected number of users. In case such information is not available at small cells, the problem can be modeled as a twoarmed bandit problem, with one safe arm (passive mode), and one risky arm (active mode). While the passive mode results in no reward, the reward of active mode can be modeled as an adversarial or stochastic process, which depends on the energy level as well as number of users arriving in each small cell. The problem might have different solutions depending on the type of interactions among small cells, which can be competitive or cooperative, for instance. In the event small cells are fully decoupled, the problem reduces to independent twoarm bandit problems. A complete investigation of distributed small cell on/off is out of the scope of this article and left for our future work.
In addition to considering a distributed model, another direction to improve the current model is to take the dependencies of small cells into account. In the formulated problem, such dependencies can mainly be concluded based on geographical reasons. In fact, nearby small cells are expected to experience similar energy harvesting opportunities (for example, during sunny weather in a neighborhood), as well as similar user arrival statistics. Therefore, by selecting each small cell and observing the reward, some information can be also derived about its neighboring cells. This information is beneficial to shorten the exploration period, thereby increasing the accumulated reward or achieving optimal average reward in a shorter time. Detailed study of such dependencies is left for our future work.
As the last remark, it should be mentioned that the application of MAB models to 5G small cell goes beyond the efficient small cell activation problem studied here. An instance is the user association problem. While the existence of small cells improves the coverage and capacity performances, user association becomes more challenging due to following reasons. In sharp contrast to conventional cellular networks, small cells are not always active. In addition, the size of each small cell is subject to change as a function of user density, for instance. Moreover, a user may associate to a macro BS or a nearby small cell or even to both a macro BS and a small cell (or even to multiple small cells). The user association problem in a small cell network can be also modeled as a bandit game, either in singleagent (centralized) or multiagent (distributed) settings. While a centralized approach improves the possibility of interference coordination, a distributed approach reduces the complexity and overhead. Another problem that can be approached by using MAB models is intercell interference coordination problem in small cell networks, which arises due to a dense cell deployment of small cells.
Vi Summary and Conclusion
MAB is a class of sequential decision making problems under strictly limited prior information as well as feedback. By providing an overview of MABs, we have argued that a wide range of wireless networking problems, including resource management, security, routing, scheduling and energy harvesting, can be formulated and solved as a bandit problem. We have also reviewed stateoftheart and applications of MABs, with an emphasis on wireless resource allocation. There are a number of open research directions to be explored, which have been briefly outlined. Finally, we have provided a detailed example of an application of MAB in energyefficient 5G small cells, where the problem of optimal small cell planning is cast as a multiplay (combinatorial) bandit problem. Preliminary performance evaluation results have established the effectiveness of the proposed model and approach.
References
 [1] H. Robbins, “Some aspects of the sequential design of experiments,” Bulletin of the American Mathematical Society, vol. 58, no. 5, pp. 527–535, 1952.
 [2] H. Liu, K. Liu, and Q. Zhao, “Learning in a changing world: Restless multiarmed bandit with unknown dynamics,” IEEE Transactions on Information Theory, vol. 59, no. 3, pp. 1902–1916, March 2013.
 [3] J. Gittins, K. Glazebrook, and R. Weber, MultiArmed Bandit Allocation Indices, Wiley, 2 edition, 2011.
 [4] S. Guha, K. Munagala, and P. Shi, “Approximation algorithms for restless bandit problems,” Journal of ACM, vol. 58, no. 1, pp. 3, 2010.
 [5] D. Bertsimas and J. NinoMora, “Restless bandits, linear programming relaxations, and a primaldual index heuristic,” Operations Research, vol. 48, no. 1, pp. 2000, 2000.
 [6] S. Bubeck and N. CesaBianchi, “Regret analysis of stochastic and nonstochastic multiarmed bandit problems,” Foundations and Trends in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.
 [7] D. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal, “Mortal multiarmed bandits,” in Neural Information Processing Systems Conference, 2008, pp. 273–280.
 [8] R.D. Kleinberg, A. Niculescumizil, and Y. Sharma, “Regret bounds for sleeping experts and bandits,” in Conference of Learning Theory, 2008, pp. 425–436.
 [9] N. CesaBianchi and G. Lugosi, Prediction, Learning, and Games, Cambridge University Press, 2006.
 [10] S. Maghsudi and S. Stanczak, “Joint channel selection and power control in infrastructureless wireless networks: A multiplayer multiarmed bandit framework,” IEEE Transactions on Vehicular Technology, vol. PP, 2014.
 [11] S. Maghsudi and S. Stanczak, “Channel selection for networkassisted D2D communication via noregret bandit learning with calibrated forecasting,” IEEE Transactions on Wireless Communications, vol. 14, no. 3, pp. 1309–1322, March 2015.
 [12] A.C. Chapman, D.S. Leslie, A. Rogers, and N.R. Jennings, “Convergent learning algorithms for unknown reward games,” SIAM Journal on Control and Optimization, vol. 51, no. 4, pp. 3154–3180, 2013.
 [13] Y. Chen, Q. Zhao, V. Krishnamurthy, and D. Djonin, “Transmission scheduling for sensor network lifetime maximization: A shortest path bandit formulation,” in IEEE International Conference on Acoustics, Speech and Signal Processing, May 2006, vol. 4, pp. IV–IV.
 [14] S. Maghsudi and S. Stanczak, “Transmission mode selection for networkassisted device to device communication: A levybandit approach,” in IEEE International Conference on Acoustics, Speech and Signal Processing, May 2014, pp. 7009–7013.
 [15] P. Blasco and D. Gunduz, “Multiaccess communications with energy harvesting: A multiarmed bandit model and the optimality of the myopic policy,” IEEE Journal on Selected Areas in Communications, vol. 33, no. 3, pp. 585–597, March 2015.
 [16] R. Kleinberg B. Awerbuch, “Online linear optimization and adaptive routing,” Elsevier Journal of Computer and System Sciences, vol. 74, pp. 97–104, 2008.
 [17] Q. Wang, P. Xu, K. Ren, and X.Y. Li, “Towards optimal adaptive UFHbased antijamming wireless communication,” IEEE Journal on Selected Areas in Communications, vol. 30, no. 1, pp. 16–30, January 2012.
 [18] E. Hossain, M. Rasti, H. Tabassum, and A. Abdelnasser, “Evolution toward 5G multitier cellular wireless networks: An interference management perspective,” IEEE Wireless Communications, vol. 21, no. 3, pp. 118–127, June 2014.
 [19] T. Uchiya, A. Nakamura, and M. Kudo, “Algorithms for adversarial bandit problems with multiple plays,” in Algorithmic Learning Theory, Oct 2010, pp. 375–389.