Exploiting Correlation in Finite-Armed Structured Bandits
We consider a correlated multi-armed bandit problem in which rewards of arms are correlated through a hidden parameter. Our approach exploits the correlation among arms to identify some arms as sub-optimal and pulls them only times. This results in significant reduction in cumulative regret, and in fact our algorithm achieves bounded (i.e., ) regret whenever possible; explicit conditions needed for bounded regret to be possible are also provided by analyzing regret lower bounds. We propose several variants of our approach that generalize classical bandit algorithms such as UCB, Thompson sampling, KL-UCB to the structured bandit setting, and empirically demonstrate their superiority via simulations.
The Multi-armed bandit problem  (MAB) falls under the umbrella of sequential decision-making problems. In the classical -armed bandit formulation, a player is presented with arms. At each time step , she decides to pull an arm and receives a random reward with unknown mean . The goal of the player is to maximize her cumulative reward. In order to do so, the player must balance exploration and exploitation of arms. This classical -armed bandit formulation assumes independence of the rewards of different arms. However in many online learning problems, such as dynamic pricing and drug dosage optimization, there is a correlation between rewards of different actions.
Motivated by this, we consider a correlated multi-armed bandit problem, in which the mean reward of different arms are related through a common hidden parameter . Specifically, the expected return of arm , i.e., . In the setting considered, the mean reward functions, , are known to the player but the true value of shared parameter is unknown. The dependence on the common parameter introduces a structure in this MAB problem. This makes the model interesting as the rewards observed from an arm can provide information about mean rewards from other arms. Similar models have been considered in [2, 3, 4], but as explained in Section 2.2, we consider a more general setting that subsumes the models in [2, 3, 5].
There are many applications where the structured bandit problem described above can be useful. For instance, in the dynamic pricing problem , a player needs to select the price of a product from a finite set of prices , and the average revenue in time slot is a function of the selected price and the market size . These functions are typically known from literature , but the pricing decisions need to be made without knowing the market size such that the total revenue is maximized; hence, this problem fits perfectly in our setting. Authors in  provide a similar example for the purpose of advertising, in which a company needs to decide what form of advertising to purchase so as to maximize its profit. The problem set-up is also relevant in system diagnosis, where represents the unknown cause of a failure in the system, and ’s represent the response of system to different actions. Other applications of this model include cellular coverage optimization  and drug dosage optimization . Our general treatment of the structured bandit setting will allow our work to be helpful in all these problems.
2) We develop a novel approach that exploits the structure of the bandit problem to identify sub-optimal arms. In particular, we generate an estimate of at each round to identify competitive and non-competitive arms. The non-competitive arms are identified as sub-optimal without having to pull them. We refer to this identification as implicit exploration. This implicit exploration is combined with traditional bandit algorithms such as UCB and Thompson sampling to design UCB-C and TS-C algorithms.
3) Our finite-time regret analysis reveals how this idea leads to a smaller regret as compared to UCB. In fact, the proposed UCB-C algorithm ends up pulling non-competitive arms only times. Due to this only out of the arms are pulled times. The value of can be much less than and can even be , in which case our proposed algorithm achieves bounded regret! Our analysis reveals that proposed algorithm achieves bounded regret whenever possible.
4) The design of UCB-C makes it easy to extend other classical bandit algorithms (such as Thompson sampling , KL-UCB  etc.) in the structured bandit setting. This extension was deemed to be not possible easily for the UCB-S algorithm proposed in .
5) We design two variants of UCB-C, namely UCB-int and UCB-min, and demonstrate the empirical superiority of the proposed algorithms in different scenarios.
2 Problem Formulation
2.1 System Model
We consider a -armed bandit setting in which the rewards of arms are correlated. As shown in Figure 1, we assume that the mean reward of each arm is dependent on a common hidden parameter . At each time step , the player pulls arm and observes the reward .
The reward obtained at time step is a random variable with mean , where is a fixed unknown parameter which lies in a known set ; our formulation allows the set to be a countable or uncountable. The functions are known to the player but the true value of parameter , i.e., , is unknown. The parameter can also be a vector.
The objective of the player is to maximize her cumulative reward in rounds. If a player had known the true value , then she would always pull the arm having the highest mean reward for the parameter , as that would lead to maximum cumulative reward in expectation. Motivated by this, we call the optimal arm as , i.e., the best arm for the true parameter . The sub-optimality gap of arm , , is defined as the difference between mean reward of the optimal arm and of arm ; i.e., . The performance of a player is evaluated by the cumulative regret defined as:
Here is a random variable denoting the number of times arm is pulled in a total of time slots. The cumulative regret quantifies the performance of a player in comparison to an oracle that pulls the optimal arm at each time slot. Thus, the smaller is the regret, the better is the performance of the player.
As mentioned earlier, the player only knows the mean reward functions and not the conditional distribution of rewards i.e., is not known. Throughout the paper, we assume that the rewards are sub-Gaussian with variance proxy , i.e., , and is known to the player. Both of these assumptions are common in the multi-armed bandit literature [4, 10, 11, 12, 13]. In particular, the sub-Gaussianity of rewards enables us to apply Hoeffding’s inequality, which is essential for the regret analysis.
We would like to highlight that we make no assumptions on the functions , unlike some previous works ([2, 3, 5]) that place restrictive assumptions on the functions. Due to the general nature of our setup, our model subsumes these previously studied frameworks and is applicable to much more general scenarios as well. The similarities and differences between our model and existing studies are discussed next.
2.2 Connections with Previously Studied Bandit Models
Classical MAB. Under the classical Multi-armed bandit setting, the rewards obtained from each arm are independent. By considering and , our setting reduces to the classical MAB setting. Our proposed algorithm will in fact perform UCB/Thompson sampling ([1, 8]) in this special case.
Global Bandits . In , a model where mean reward functions are dependent on a common scalar parameter in studied. A key assumption in  is that the mean reward functions are invertible and Hölder-continuous. Under these assumptions, they demonstrate that it is possible to achieve bounded regret through a greedy policy. In contrast, our work makes no assumptions on the nature of the functions . In fact, when reward functions are invertible, our proposed algorithm also achieves bounded regret. Hence, our formulation covers the setting described in .
Regional Bandits . The paper  studies a setting in which there are common unknown parameters. The mean reward function of each arm depends on one of these parameters, . These mean reward functions of each arm are assumed to be invertible and Hölder-continuous as a function of . The setting described in  is captured in our formulation by setting . In fact, our problem setup allows for the mean reward function of arm to be a function of a combination of all of these parameters and these mean reward functions need not be invertible.
Structured bandits with linear functions . In , the authors consider a model in which rewards of all arms depend on a common parameter. However, they assume that the mean reward functions, are linear functions of . Under this assumption, they design a greedy policy that achieves bounded regret. Our formulation places no such restriction on the reward functions, and thus is more general. In the specific cases where reward functions are linear, our proposed algorithm also achieves bounded regret.
Finite-armed generalized linear bandits . Under the finite-armed linear bandit setting , the reward function of arm is . Here, is the shared unknown parameter. Similarly, when , this becomes the generalized linear bandit setting , for some known function . One can easily see that, our setting perfectly captures both of them.
Minimal Exploration in Structured Bandits . Authors in  consider a problem formulation that is more general than the setting described in this paper. However, the focus of  is to obtain asymptotically optimal regret for the regimes when regret scales as . When all arms are non-competitive, the solution to the optimization problem described in [16, Theorem 1] becomes , causing the algorithm to get stuck in the exploitation phase and not perform properly in such settings. Also,  assume the knowledge of the shape of reward distribution (For example, gaussian with unknown mean), we only assume that the rewards are sub-gaussian. Moreover, they assume is continuous, while we make no such assumption.
Finite-armed structured bandits . The work closest to ours is . The authors in  consider the same model that we consider and propose the UCB-S algorithm, which is a UCB-style algorithm for this setting. We take a different approach to this problem, and propose a novel algorithm that separates implicit and explicit exploration, that allows us to extend our UCB style algorithm to other classical bandit algorithms such as Thompson sampling. A Thompson sampling style algorithm was not proposed in . Through simulations, we make comparisons of our proposed algorithms against the UCB-S algorithm proposed in .
2.3 Intuitions for developing an algorithm
Classic multi-armed bandit algorithms such as UCB  and Thompson sampling  rely on explicit exploration of empirically sub-optimal arms to learn the optimal action. In our framework, since mean rewards of all arms are dependent on a common parameter, obtaining an estimate of from the samples observed till slot can give us some information on the mean rewards of all arms. This additional knowledge can then be used to reduce the exploration needed when designing bandit algorithms. Identifying sub-optimal arms through this estimate of can be thought of as implicit exploration.
Consider the example shown in Figure 2. In this case, the true parameter, , is equal to 3, and the mean reward of Arm 1 is , Arm 2 is , and that of Arm 3 is . Thus, the optimal arm in this setup is Arm 2. Assume now that the player has obtained a large number of samples of arm 2 at a given time step. Based on the samples observed from Arm 2, the player has an empirical estimate of the mean reward as
Using this empirical estimate, the player can construct a region in which lies with high probability. Figure 3 illustrates such a region in shaded pink color. This region can then be used to identify the set of values within which the true parameter lies with high probability. For example, in Figure 3 that region is the set . Upon identifying this set, we can now see that if indeed lies in this set, then Arm 3 cannot be optimal as it is sub-optimal compared to Arm 2 for all values of . However, Arm 1 may still be better than Arm 2 as it has higher mean reward than Arm 2 for some values of . This provides an example where we implicitly explore Arm 3 without pulling it. As Arm 3 cannot be optimal in the set , we refer to it as non-competitive with respect to the set . On the other hand, we call Arm 1 and 2 competitive with respect to as they are optimal for at least one in this set.
We formalize this idea of identifying non-competitive arms into an online algorithm that performs both implicit and explicit exploration. The proposed algorithm, presented in the next section, successfully reduces a armed bandit problem to a armed bandit problem, where is the number of competitive arms, defined formally in Section 3. More interestingly, this algorithm can lead to bounded (i.e., not scaling with ) regret in certain regimes as we will show in Section 4.
3 Proposed Algorithms: UCB-C and TS-C
Classical bandit algorithms such as Thompson sampling and Upper Confidence Bound (UCB) are often termed as index-based policies. At every time instant, these policies maintain an index for each arm, and select the arm with the highest index in the next time slot. More specifically, at each round , UCB selects the arm
where is the empirical mean of arm obtained from the samples obtained till . Under Thompson sampling, we select the arm at time step . Here, is the sample obtained from the posterior distribution of , i.e.,
Since mean rewards are correlated through the hidden parameter in the structured bandit model, obtaining an estimate of can help identify the optimal arm. In our approach, we will identify subset of arms, called the competitive arms, through the estimate of and then perform UCB or TS over that set of arms. We now define the notion of competitive and non-competitive arms, which are a key component in the design of UCB-C and TS-C Algorithms.
3.1 Competitive and Non-Competitive Arms
From the samples observed till time step , one can construct a confidence set . The set represents the set of values in which the true parameter lies with high confidence, based on rewards observed until time . Next, we define the notions of -Competitive and -Non-competitive arms.
Definition 1 (-Competitive arm).
An arm is said to be -Competitive if for some .
Intuitively, an arm is -Competitive if it is optimal for some in the confidence set . Similarly, we define a -Non-competitive arm as follows.
Definition 2 (-Non-competitive arm).
An arm is said to be -Non-competitive if , for all .
Intuitively, if an arm is -Non-competitive, it means that it cannot be optimal if the true parameter lies inside the confidence set . This allows us to identify the -Non-competitive arm as sub-optimal under the assumption that the true parameter is in the set .
We now introduce the notion of -non-competitive arm.
Definition 3 (-non-competitive arm).
We call an arm as -non-competitive if
Informally, this means that if an arm is -non-competitive, then it is -Non-competitive, with being the set of that do not change the true mean of the optimal arm by more than .
Throughout, we say that an arm is competitive if there is no for which it is -non-competitive. The set of all competitive arms is denoted by and the number of competitive arms by .
Let be defined as the set . We can view as the confidence set after the optimal arm is sampled infinitely many times. This definition ensures that if an arm is -Competitive, then it is competitive.
3.2 Components of Our Algorithm
Motivated with the above discussion, we propose the following algorithm. At each step , we:
Construct a confidence set from the samples observed till time step .
Identify -Non-competitive arms.
Play a bandit algorithm (say UCB or Thompson sampling) among arms which are -Competitive and choose the next arm accordingly.
The formal description of this algorithm with UCB and Thompson sampling as final steps is given in Algorithm 1 and Algorithm 2, respectively. Below, we explain the three key components of these algorithms.
Constructing a confidence set, . From the samples observed till time step , we identify the arm that has been selected the maximum number of times so far, namely . We define the confidence set as follows:
Here is the empirical mean of rewards obtained in samples of arm . We construct the arm with samples of as it has smallest variance in its empirical estimate among all arms. In our regret analysis, we show that using samples of just one arm suffices to achieve the desired dimension reduction and bounded regret properties. In Section 5, we present and discuss the UCB-int algorithm that constructs the confidence set using samples of all arms.
Identifying -Non-competitive arms. At each time step , we define the set as the set of -Competitive arms, that includes all arms that satisfy for some . The rest of the arms, termed as -Non-competitive, are eliminated for round and are not considered in the next part of the algorithm.
Play bandit algorithm among -Competitive arms. After identifying the -Competitive arms, we use classical bandit algorithms such as UCB and Thompson sampling to decide which arm to play at time step . For example, in the case of UCB-C, the next arm is selected as , with
It is important to note that the last step of our algorithm can utilize any one of the classical bandit algorithms. This allows us to easily define a Thompson sampling algorithm which has attracted great attention [17, 18, 19, 20, 21] for the structured bandits problem considered in this paper. The ability to employ any bandit algorithm in its last step is an important advantage of our algorithm. For instance, the extension to Thompson sampling was deemed to be not possible for the UCB-S algorithm proposed in .
The idea of eliminating non-competitive arms was initially proposed in  for studying multi-armed bandits with a latent random source. However, given the different nature of the problem studied in , an entirely different definition of arm competitiveness was used.
4 Regret Analysis and Bounds
In this section, we analyze the performance of the UCB-C algorithm through a finite-time analysis of the cumulative expected regret defined as
Here, and is the number of times arm is pulled in a total of time steps.
To analyze the expected regret of a proposed algorithm (as given by (2) above), we need to determine for each sub-optimal arm . Our first result shows that expected pulls for any arm is .
Theorem 1 (Expected pulls for any arm).
The expected number of times any arm is pulled by UCB-C Algorithm is upper bounded as
Our next result shows that the expected number of pulls for an -non-competitive arm are bounded.
Theorem 2 (Expected pulls of Non-competitive Arms).
If an arm is -non-competitive, then the number of times it is pulled by UCB-C is upper bounded as
Theorem 3 (Regret upper bound).
Dimension Reduction. The classic UCB algorithm that is agnostic to the structure of the problem pulls each of the sub-optimal arms times. In contrast, our algorithm pulls only sub-optimal arms times, where . In fact, when , all sub-optimal arms are pulled only times, leading to a bounded regret. Such cases can arise quite often in practical settings. For example, when the optimal arm is invertible around , the set becomes a singleton; i.e., there is just a single that leads to . In that case, all sub-optimal arms become non-competitive and our UCB-C algorithm returns bounded (i.e., ) regret.
We now show that the UCB-C algorithm achieves bounded regret whenever possible. We do so by analyzing a lower bound obtained in .
Proposition 1 (Lower bound).
For any uniformly good algorithm , and for any , we have:
An algorithm is uniformly good if for all and all .
The proof of this proposition, given in Appendix, follows from a lower bound derived in . This lower bound leads us to the following observation.
Remark 1 (Bounded regret whenever possible).
The result on lower bound in Proposition 1 shows that sub-logarithmic regret is possible only when . In the case when , our proposed algorithm achieves a bounded regret (see Theorem 3). This implies that the UCB-C algorithm is able to achieve bounded regret whenever possible and have dimensionality reduction for cases when regret is logarithmic.
5 Variants of UCB-C
Recall that the design of UCB-C and TS-C algorithms involved the construction of the confidence set . This confidence set was constructed by selecting all for which is inside a ball of size around the empirical mean . In this section, we discuss two other methods of constructing the confidence set to show that our idea of separating implicit and explicit exploration can be easily extended to design new algorithms. These extensions can lead to lower empirical regret than the UCB-C algorithm in some cases. We now describe the two algorithms and evaluate their regret bounds.
5.1 The UCB-int Algorithm
The Algorithm UCB-C, constructs the confidence set using just the samples of arm . We now present the UCB-int Algorithm, which differs from UCB-C in this aspect. At an additional computation cost, it constructs the confidence set by using samples of all the arms pulled so far. More specifically, UCB-int constructs as follows:
Similar to UCB-C, the UCB-int Algorithm works for any functions and any set . The UCB-int also has performance guarantees similar to UCB-C, i.e., . It also enjoys the same property of achieving bounded regret whenever possible. We present the exact regret bound of UCB-int in the Appendix. Since we consider samples of all arms in constructing the confidence set for UCB-int, the confidence set is smaller for UCB-int than for the UCB-C. Hence, UCB-int is more aggressive in removal of non-competitive arms and can obtain better empirical performance than the UCB-C algorithm.
5.2 The UCB-min Algorithm
The Algorithm UCB-C, was designed to accommodate all type of functions and all sets . In this Section we design an aggressive algorithm, named UCB-min, for a class of structured bandit settings. The problem settings for which UCB-min is designed includes all cases where is a countable set.
The UCB-min Algorithm has the same three components as that of UCB-C. It differs from UCB-C only in the first component, i.e., the construction of the confidence set. Instead of considering a ball of size around , we directly choose by selecting corresponding to the closest value. More specifically, we construct as follows:
The UCB-min algorithm is designed for structured bandit settings that satisfy the following assumption.
There exists such that for all sub-optimal arms ,
Informally this means that the optimal arm at remains -Competitive as long as lies at a distance of away from . creftypecap 1 is always satisfied when is a countable set. Under this assumption, UCB-min enjoys similar regret guarantees as that of UCB-C. We have , with denoting the number of competitive arms. As was the case with UCB-C, it also achieves bounded regret whenever possible. The regret upper bound for UCB-min, which depends on , is given in the appendix.
6 Simulation Results
We now study the empirical performance of the proposed algorithms. For all simulations we choose . Rewards are drawn from the distribution , i.e., . In each result we average the regret over experiments. We first show how UCB-C is able to achieve dimension reduction.
Dimension Reduction. In Figure 4 we compare the regret of the UCB-C algorithm with classic UCB for the example considered in Figure 4. For , Arm 2 is optimal and Arms 1 and 3 are non-competitive. As expected from our regret analysis, in Figure 5(a) the Algorithms UCB-C and UCB-int achieve bounded regret, while the regret of UCB grows logarithmically in the number of time steps . When , Arm 3 is optimal, Arm 2 becomes competitive and Arm 1 is non-competitive. In this case, we expect UCB-C and UCB-int to pull Arm 1 only times due to which we notice significantly reduced regret with UCB-C and UCB-int as compared to UCB in Figure 5(b). Figure 5(c) shows the case where , leading to Arm 1 being optimal and all the arms being competitive. Since UCB-C performs UCB over the set of -Competitive arms at each round, its performance is similar to that of the UCB algorithm in this case. The UCB-int algorithm uses samples from all arms to generate the confidence set , which helps in achieving empirically smaller regret for this setting.
Comparison with UCB-S. We now compare the performance of our UCB-C and TS-C algorithms against the UCB and UCB-S Algorithm proposed in . We consider the example shown in Figure 7. For the situations in which the parameter , Arm 2 is optimal and Arm 1 is non-competitive. When takes values in , Arm 1 is the optimal arm while Arm 2 is competitive. We plot the cumulative regret of the UCB, UCB-S, UCB-C and TS-C algorithms over time steps for the values of between and in Figure 8. When is below , UCB-S, UCB-C and TS-C all obtain lesser regret than UCB as they are able to identify the sub-optimal arm as non-competitive. When , we see that UCB-C has a performance similar to that of UCB as sub-optimal arm is competitive.
We see that when , UCB-S achieves a regret which is quite large compared to even UCB. This is because the algorithm in UCB-S selects . The causes the algorithm to prefer certain arms over others, this is the reason for its large regret at and relatively smaller regret at . We notice that TS-C achieves significantly less regret as compared to other algorithms as Thompson sampling can offer significant empirical improvement over UCB. This highlights that the possibility to incorporate Thompson sampling in algorithm is beneficial, this extension to Thompson sampling was not possible in .
Performance of UCB-min. We now compare the performance of the UCB-min algorithm against the UCB-C and UCB algorithm for the example considered in Figure 7. In this example, is a countable set, which allows us to use the UCB-min algorithm here. The aggressive nature of UCB-min in selection of helps it to achieve smaller bounded regret than UCB-C, as shown in Figure 9 where .
7 Concluding Remarks
In this work, we studied a correlated bandit problem in which the rewards of different arms are correlated through a common shared parameter. By using reward samples of an arm, we were able to generate estimates on mean reward of other arms. This approach allowed us to identify some sub-optimal arms without having to explore them explicitly. The finite time regret analysis of the proposed UCB-C algorithm reveals that it is able to reduce the -armed bandit problem to a -armed bandit problem. In addition, we showed that UCB-C achieves bounded regret whenever possible. Ongoing work includes the finite-time regret analysis of TS-C algorithm. An interesting future direction is to study this problem when the number of arms is large. We also plan to study the best-arm identification version of this problem.
Appendix A Lower bound
Theorem 4 (Lower bound, Theorem 1 in .).
For any uniformly good algorithm , and for any , we have:
where is the solution of the optimization problem:
Here, is the KL-Divergence between distributions and . An algorithm, , is uniformly good if for all and all .
We see that the solution to the optimization problem (7) is only when the set is empty. The set being empty corresponds to a case where all sub-optimal arms are non-competitive. This implies that sub-logarithmic regret is possible only when , i.e there is only one competitive arm, which is the optimal arm, and all other arms are non-competitive.
Appendix B Results for UCB-int and UCB-min
b.1 Regret bounds for UCB-int
We state the regret bounds for UCB-int Algorithm here.
Theorem 5 (Expected pulls for any arm).
Expected number of times any arm is pulled by UCB-int Algorithm is upper bounded as
Theorem 6 (Expected pulls for non-competitive arms).
The Expected number of times an non-competitive arm is pulled by UCB-int Algorithm is upper bounded as
Combining the above two results gives us the following regret bound for the UCB-int Algorithm.
Theorem 7 (Regret upper bound).
Observe that this regret bound is similar to the regret bound for UCB-C algorithm and enjoys the same properties of achieving bounded regret whenever possible and performing dimension reduction.
b.2 Regret bounds for UCB-min
We now present the performance guarantees of the UCB-min Algorithm. Under the creftypecap 1 and defined as
We have the following result on expected pulls of arms,
Theorem 8 (Expected pulls for any arm).
Expected number of times any arm is pulled is upper bounded in the UCB-min Algorithm as
Theorem 9 (Expected pulls for -non-competitive arms).
Expected number of times an -non-competitive arm is pulled by the UCB-min Algorithm is upper bounded as
Combining the above two result yields us the following regret bound for UCB-min Algorithm.
Theorem 10 (Regret upper bound).
Observe that this regret bound is similar to the regret bound for UCB-C algorithm and enjoys the same properties of achieving bounded regret whenever possible and performing dimension reduction.
Appendix C Proof for the UCB-C Algorithm
Fact 1 (Hoeffding’s inequality).
Let be i.i.d. random variables, where is sub-gaussian with mean , then
Here is the empirical mean of the .
Probability that true mean lies outside the confidence set decays with total number of pulls,i.e , as,
We have (19) from union bound and is a standard trick to deal with the random variable as it can take values from to . We use this trick repeatedly in the paper, whenever we encounter such expressions. The true mean of arm is . Therefore, if denotes the empirical mean of arm taken over pulls of arm then, (20) follows from Fact 1 with in Fact 1 being equal to . ∎
Define to be the event that arm is -non-competitive for the round , then,
If for some constant then,
where is a suboptimal arm.
The probability that arm is pulled at step , given it has been pulled times can be bounded as follows: