” \TheoremsNumberedThrough\EquationsNumberedThrough \RUNAUTHOROzbay and Kamble \RUNTITLETraining a Single Bandit Arm
Training a Single Bandit Arm
Eren Ozbay
\AFFDepartment of Information and Decision Sciences
University of Illinois at Chicago
\EMAILeozbay3@uic.edu
\AUTHORVijay Kamble
\AFFDepartment of Information and Decision Sciences
University of Illinois at Chicago
\EMAILkamble@uic.edu
The stochastic multiarmed bandit problem captures the fundamental exploration vs. exploitation tradeoff inherent in online decisionmaking in uncertain settings. However, in several applications, the traditional objective of maximizing the expected sum of rewards obtained can be inappropriate. Motivated by the problem of optimizing job assignments to train novice workers with unknown trainability in labor platforms, we consider a new objective in the classical setup. Instead of maximizing the expected total reward from pulls, we consider the vector of cumulative rewards earned from each of the arms at the end of pulls, and aim to maximize the expected value of the highest cumulative reward across the arms. This corresponds to the objective of grooming a single, highly skilled worker using a limited supply of training jobs.
For this new objective, we show that any policy must incur a regret of in the worst case. We design an explorethencommit policy featuring exploration based on finely tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and guarantees a regret of in the worst case. Our numerical experiments demonstrate that this policy improves upon several natural candidate policies for this setting.
1 Introduction
The stochastic multiarmed bandit (MAB) problem (Lai and Robbins 1985, Auer et al. 2002) presents a basic formal framework to study the exploration vs. exploitation tradeoff fundamental to online decisionmaking in uncertain settings. Given a set of arms, each of which yields independent and identically distributed (i.i.d.) rewards over successive pulls, the goal is to adaptively choose a sequence of arms to maximize the expected value of the total reward attained at the end of pulls. The critical assumption here is that the reward distributions of the different arms are a priori unknown. Any good policy must hence, over time, optimize the tradeoff between choosing arms that are known to yield high rewards (exploitation) and choosing arms whose reward distributions are yet relatively unknown (exploration). Over several years of extensive theoretical and algorithmic analysis, this classical problem is now quite well understood (see Lattimore and Szepesvári (2018), Slivkins (2019), Bubeck and CesaBianchi (2012) for a survey).
In this paper, we revisit this classical setup; however, we address a new objective. We consider the vector of cumulative rewards that have been earned from the different arms at the end of pulls, and instead of maximizing the expectation of its sum, we aim to maximize the expected value of the maximum of these cumulative rewards across the arms. This problem is motivated by several practical settings, as we discuss below.

Training workers in online labor platforms. Online labor platforms seek to develop and maintain a reliable pool of highquality workers in steadystate to satisfy the demand for jobs. This is a challenging problem since, a) workers continuously leave the platform and hence new talent must be tended and trained, and b) the number of “training” jobs available to the incoming talent is limited (this could, for instance, be because of a limit on the budget for the discounts offered to the clients for choosing novice workers). At the core of this challenging operational question is the following problem. Given the limited availability of training jobs, the platform must determine a policy to allocate these jobs to a set of novice workers to maximize some appropriate functional of the distribution of their terminal skill levels. For a platform that seeks to offer robust service guarantees to its clients, simply maximizing the sum of the terminal skill levels across all workers may not be appropriate, and a more natural functional to maximize is the percentile skill level amongst the workers ordered by their terminal skills, where is determined by the volume of demand for regular jobs.
To address this problem, we can use the MAB framework: the set of arms is the set of novice workers, the reward of an arm is the random increment in the skill level of the worker after allocation of a job, and the number of training jobs available is . Assuming the number of training jobs available per worker is not too large, the random increments may be assumed to be i.i.d. over time. The mean of these increments can be interpreted as the unknown learning rate or the “trainability” of a worker. Given workers, the goal is to adaptively allocate the jobs to these workers to maximize the smallest terminal skill level amongst the top most terminally skilled workers (where ). Our objective corresponds to the case where , and is a step towards solving this general problem.

Grooming an “attractor” product on ecommerce platforms. Ecommerce platforms typically feature very similar substitutes within a product category. For instance, consider a product like a tablet cover (e.g., for an iPad). Once the utility of a new product of this type becomes established (e.g., the size specifications of a new version of the iPad becomes available), several brands offering close to identical products serving the same purpose proliferate the marketplace. This proliferation is problematic to the platform for two reasons: a) customers are inundated by choices and may unnecessarily delay their purchase decision, thereby increasing the possibility of leaving the platform altogether (Settle and Golden 1974, Gourville and Soman 2005), and b) the heterogeneity in the purchase behavior resulting from the lack of a clear choice may complicate the problem of effectively managing inventory and delivery logistics. Given a budget for incentivizing customers to pick different products in the early exploratory phase where the qualities of the different products are being discovered, a natural objective for the platform is to “groom” a product to have the highest volume of positive ratings at the end of this phase. This product then becomes a clear choice for the customers. Our objective effectively captures this goal.

Training for external competitions. The objective we consider is also relevant to the problem of developing advanced talent within a region for participation in external competitions like Science Olympiads, the Olympic games, etc., with limited training resources. In these settings, only the terminal skill levels of those finally chosen to represent the region matter. The resources spent on others, despite resulting in skill advancement, are effectively wasteful. This feature is not captured by the “sum” objective, while it is effectively captured by the “max” objective, particularly in situations where one individual will finally be chosen to represent the region.
A standard approach in MAB problems is to design a policy that minimizes regret, i.e., the quantity of loss relative to the optimal decision for a given objective over time. In the classical setting with the “sum” objective, it is well known that any policy must incur a regret of in the worstcase over the set of possible bandit instances (Auer et al. 2002). A key feature of our new objective is that the rewards earned from arms that do not eventually turn out to be the one yielding the highest cumulative reward are effectively a waste. Owing to this, we show that in our case, a regret of is inevitable (Theorem 3).
For the traditional objective, wellperforming policies are typically based on the principle of optimism in the face of uncertainty. A popular policyclass is the Upper Confidence Bound (UCB) class of policies (Agrawal 1995, Auer et al. 2002, Auer and Ortner 2010), in which a confidence interval is maintained for the mean reward of each arm and at each time, the arm with the highest upper confidence bound is chosen. For a standard tuning of these intervals, this policy – termed UCB1 in literature due to Auer et al. (2002) – guarantees a regret of in the worst case. With a more refined tuning, can be achieved (Audibert and Bubeck 2009, Lattimore 2018).
For our objective, directly using one of the above UCB policies can prove to be disastrous. To see this, suppose that all arms have an identical distribution for their rewards with bounded support. Then a UCB policy will continue to switch between the arms throughout the pulls, resulting in the highest terminal cumulative reward of ; whereas, a reward of is feasible by simply committing to an arbitrary arm from the start. Hence, the regret is in the worst case.
This observation suggests that any good policy must, at some point, stop exploring and permanently commit to a single arm. A natural candidate is the basic explorethencommit (ETC) strategy, which uniformly explores all arms until some time that is fixed in advance, and then commits to the empirically best arm (Lattimore and Szepesvári 2018, Slivkins 2019). When each arm is chosen times in the exploration phase, this strategy can be shown to achieve a regret of relative to the traditional objective (Slivkins 2019). It is easy to argue that it achieves the same regret relative to our “max” objective. However, this policy is excessively optimized for the worst case where the means of all the arms are within of each other. When the arms are easier to distinguish, this policy’s performance is quite poor due to excessive exploration. For example, consider a two armed bandit problem with Bernoulli rewards and means , where . For this fixed instance, ETC will pull both arms times and hence incur a regret of as (relative to our “max” objective). However, it is well known that UCB1 will not pull the suboptimal arm more than times with high probability (Auer et al. 2002) and hence for this instance, UCB1 will incur a regret of only . Thus, although the worst case regret of UCB1 is due to perpetual exploration, for a fixed bandit instance, its asymptotic performance is significantly better than ETC. This observation motivates us to seek a practical policy with a graceful dependence of performance on the difficulty of the bandit instance, and which will achieve both: the worstcase bound of ETC and the instancedependent asymptotic bound of .
We propose a new policy with an explorethencommit structure, in which appropriately defined confidence bounds on the means of the arms are utilized to guide exploration, as well as to decide when to stop exploring. We call this policy Adaptive ExplorethenCommit (ADAETC). Compared to the classical UCB1 way of defining the confidence intervals, our policy’s confidence bounds are finely tuned to eliminate wasteful exploration and encourage stopping early if appropriate. We derive rigorous instancedependent as well as worstcase bounds on the regret guaranteed by this policy. Our bounds show that ADAETC adapts to the problem difficulty by exploring less if appropriate, while attaining the same regret guarantee of attained by vanilla ETC in the worst case (Theorem 4). In particular, ADAETC also guarantees an instancedependent asymptotic regret of as . Finally, our numerical experiments demonstrate that ADAETC results in significant improvements over the performance of vanilla ETC in easier settings, while never performing worse in difficult ones, thus corroborating our theoretical results. Our numerical results also demonstrate that naive ways of introducing adaptive exploration based on upper confidence bounds, e.g., simply using the upper confidence bounds of UCB1, may lead to no improvement over vanilla ETC.
We finally note that buried in our objective is the goal of quickly identifying the arm with approximately the highest mean reward so that a substantial amount of time can be spent earning rewards from that arm (e.g., “training” a worker). This goal is related to the pure exploration problem in multiarmed bandits. Several variants of this problem have been studied, where the goal of the decisionmaker is to either minimize the probability of misidentification of the optimal arm given a fixed budget of pulls (Audibert and Bubeck 2010, Carpentier and Locatelli 2016, Kaufmann et al. 2016); or minimize the expected number of pulls to attain a fixed probability of misidentification, possibly within an approximation error (EvenDar et al. 2002, 2006, Mannor and Tsitsiklis 2004, Karnin et al. 2013, Jamieson et al. 2014, Vaidhiyan and Sundaresan 2017, Kaufmann et al. 2016); or to minimize the expected suboptimality (called “simple regret”) of a recommended arm after a fixed budget of pulls (Bubeck et al. 2009, 2011, Carpentier and Valko 2015). Extensions to settings where multiple good arms are needed to be identified have also been considered (Bubeck et al. 2013, Kalyanakrishnan et al. 2012, Zhou et al. 2014, Kaufmann and Kalyanakrishnan 2013). The critical difference from these approaches is that in our scenario, the budget of pulls must not only be spent on identifying an approximately optimal arm but also on earning rewards on that arm. Hence any choice of apportionment of the budget to the identification problem, or a choice for a target for the approximation error or probability of misidentification within a candidate policy, is a priori unclear and must arise endogenously from our primary objective.
2 Problem Setup
Consider the stochastic multiarmed bandit (MAB) problem parameterized by the number of arms, which we denote by ; the length of the decisionmaking horizon (the number of discrete times/stages), which we denote by ; and the probability distributions for arms , denoted by , respectively. To achieve meaningful results, we assume that the rewards are nonnegative and their distributions have a bounded support, assumed to be without loss of generality (although this latter assumption can be easily relaxed to allow, for instance, SubGaussian distributions with bounded ). We define to be the set of all tuples of distributions for the arms having support in . Let be the means of the distributions. Without loss of generality, we assume that for the remainder of the discussion. The distributions of the rewards from the arms are unknown to the decisionmaker. We denote and .
At each time, the decisionmaker chooses an arm to play and observes a reward. Let the arm played at time be denoted as and the reward be denoted as , where is drawn from the distribution , independent from the previous actions and observations. The history of actions and observations at any time is denoted as , and is defined to be the empty set . A policy of the decisionmaker is a sequence of mappings , where maps every possible history to an arm to be played at time . Let denote the set of all such policies.
For an arm , we denote to be the number of times this arm is played until and including time , i.e., . We also denote to be the reward observed from the pull of arm . is thus a sequence of i.i.d. random variables, each distributed as . Note that the definition of implies that we have . We further define to be the cumulative reward obtained from arm until time .
Once a policy is fixed, then for all , , , and for all , become welldefined random variables. We consider the following notion of reward for a policy .
(1) 
In words, the objective value attained by the policy is the expected value of the largest cumulative reward across all arms at the end of the decision making horizon. When the reward distributions are known to the decisionmaker, then for a large , the best reward that the decisionmaker can achieve is
A natural candidate for a “good” policy when the reward distributions are known is the one where the decisionmaker exclusively plays arm (the arm with the with the highest mean), attaining an expected reward of . Let us denote . One can show that, in fact, this is the best reward that one can achieve in our problem.
Proposition 1
For any bandit instance , .
The proof is presented in Section 7 in the Appendix. This shows that the simple policy of always picking the arm with the highest mean is optimal for our problem. Next, we denote the regret of any policy to be
We consider the objective of finding a policy which achieves the smallest regret in the worstcase over all distributions , i.e., we wish to solve the following optimization problem:
Let denote the minmax (or the best worstcase) regret, i.e.,
In the remainder of the paper, we will show that the worstcase regret is of order .
3 Lower Bound
We now show that for our objective, a regret of is inevitable in the worst case. {theorem} Suppose that . Then, The proof is presented in Section 8 in the Appendix. Informally, the argument for the case of arms is as follows. Consider two bandits with Bernoulli rewards, one with the mean rewards , and the other with mean rewards . Then until time , no algorithm can reliably distinguish between the two bandits. Hence, until this time, either pulls are spent on arm 1 irrespective of the underlying bandit, or pulls are spent on arm 2 irrespective of the underlying bandit. In both cases, the algorithm incurs a regret of , essentially because of wasting pulls on a suboptimal arm that could have been spent on earning reward on the optimal arm. This latter argument is not entirely complete, however, since it ignores the possibility of picking a suboptimal arm until time , in which case spending time on the suboptimal arm in the first time periods was not wasteful. However, even in this case, one incurs a regret of . Thus a regret of is unavoidable. Our formal proof builds on this basic argument to additionally determine the optimal dependence on .
4 Adaptive ExplorethenCommit (ADAETC)
We now define an algorithm that we call Adaptive ExplorethenCommit (ADAETC), specifically designed for our problem. It is formally defined in Algorithm 1. The algorithm can be simply described as follows. After choosing each arm once, choose the arm with the highest upper confidence bound, until there is an arm such that (a) it has been played at least times, and (b) its empirical mean is higher than the upper confidence bounds on the means of all other arms. Once such an arm is found, commit to this arm until the end of the decision horizon.
The upper confidence bound is defined in Equation 2. In contrast to its definition in UCB1, it is tuned to eliminate wasteful exploration and to allow stopping early if appropriate. We enforce the requirement that an arm is played at least times before committing to it by defining a trivial ”lower confidence bound” (Equation 3), which takes value until the arm is played less than times, after which both the upper and lower confidence bounds are defined to be the empirical mean of the arm. The stopping criterion can then be simply stated in terms of these upper and lower confidence bounds (Equation 4): stop and commit to an arm when its lower confidence bound is strictly higher than the upper confidence bounds of all other arms (this can never happen before pulls since the rewards are nonnegative).
Note that the collapse of the upper and lower confidence bounds to the empirical mean after pulls ensures that each arm is not pulled more than times during the Explore phase. This is because choosing this arm to explore after pulls would imply that its upper confidence bound = lower confidence bound is higher than the upper confidence bounds for all other arms, which means that the stopping criterion has been met and the algorithm has committed to the arm.
A heuristic rationale behind the choice of the upper confidence bound is as follows. Consider a suboptimal arm whose mean is smaller than the highest mean by . Let be the probability that this arm is misidentified and committed to in the Commit phase. Then the expected regret resulting from this misidentification is approximately . Since we want to ensure that the regret is at most in the worstcase, we can tolerate a of at most . Unfortunately, is not known to the algorithm. However, a reasonable proxy for is , where is the number of times the arm has been pulled. This is because it is right around , when the distinction between this arm and the optimal arm is expected to occur. Thus a good (moving) target for the probability of misidentification is . This necessitates the scaling of the confidence interval in Equation 2. In contrast, our numerical experiments show that utilizing the traditional scaling of as in UCB1 results in significant performance deterioration. Our tuning is reminiscent of similar tuning of confidence bounds under the “sum” objective to improve the performance of UCB1; see Audibert and Bubeck (2009), Lattimore (2018), Auer and Ortner (2010). {remark} Instead of defining the lower confidence bound to be until an arm is pulled times, one may define a nontrivial lower confidence bound to accelerate commitment, perhaps in a symmetric fashion as the upper confidence bound. However, this doesn’t lead to an improvement in the regret bound. The reason is that if an arm looks promising during exploration, then eagerness to commit to it is imprudent, since if it is indeed optimal then it is expected to be chosen frequently during exploration anyway; whereas, if it is suboptimal then we preserve the option of eliminating it by choosing to not commit until after pulls. Thus, to summarize, ADAETC eliminates wasteful exploration primarily by reducing the number of times suboptimal arms are pulled during exploration through the choice of appropriately aggressive upper confidence bounds, rather than by being hasty in commitment.
Let denote the implementation of ADAETC using and as the input for the number of arms and the time horizon, respectively. Also, define for . We characterize the regret guarantees achieved by in the following result.
{theorem}[ADAETC]
Let and suppose that . Then for any , the expected regret of is upper bounded as:
where . In the worst case, we have
(2)  
(3) 
Explore Phase: From time until , pull each arm once. For :
Identify , breaking ties arbitrarily. If
(4) 
then define , break, and enter the Commit phase. Else, continue to Step 2.
Identify , breaking ties arbitrarily. Pull arm .
Commit Phase: Pull arm until time .
The proof of Theorem 4 is presented in Section 9 in the Appendix. Theorem 4 features an instancedependent regret bound and a worstcase bound of . The first two terms in the instancedependent bound arise from the wasted pulls during the Explore phase. Under vanilla ExplorethenCommit, to obtain nearoptimality in the worst case, every arm must be pulled times in the Explore phase (Slivkins 2019). Hence, the expected regret from the Explore phase is irrespective of the instance. On the other hand, our bound on this regret depends on the instance and can be significantly smaller than if the arms are easier to distinguish. For example, if and the instance are fixed (with ), and , then the regret from exploration (and the overall regret) is under ADAETC as opposed to under ETC. The next two terms in our instancedependent bound arise from the regret incurred due to committing to a suboptimal arm, which can be shown to be in the worst case, thus matching the guarantee of ETC. The first of these terms is not problematic since it is the same as the regret arising under ETC. The second term arises due to the inevitably increased misidentifications occurring due to stopping early in adaptive versions of ETC. If the confidence bounds are aggressively small, then this term increases. In ADAETC, the upper confidence bounds used in exploration are tuned to be as small as possible while ensuring that this term is no larger than in the worst case. Thus, our tuning of the Explore phase ensures that the performance gains during exploration does not come at the cost of higher worstcase regret (in the leadingorder) due to misidentification.
5 Experiments
Benchmark Algorithms. We compare the performance of ADAETC with four algorithms described in Table 1. All algorithms, except UCB1 and ETC, have the same algorithmic structure as ADAETC: they explore based on upper confidence bounds and commit if the lower confidence bound of an arm rises above upper confidence bounds for all other arms. They differ from ADAETC in how the upper and lower confidence bounds are defined. These definitions are presented in Table 1. UCB1 never stops exploring and pulls the arm maximizing the upper confidence bound at each time step, while ETC commits to the arm with the highest empirical mean after each arm has been pulled times. Both NADAETC and UCB1s use UCB1’s upper confidence bound, but they differ in their lower confidence bounds.
Name  

ADAETC  
NADAETC  
ETC  
UCB1  
UCB1s 
Instances. We let , where is uniformly sampled from for each arm in each instance. We sample three sets of instances, each of size , with . The regret for an algorithm for each instance is averaged over runs to estimate the expected regret. We vary and . The average regret over the instances under different algorithms and settings is presented in Figure 1.
Discussion. ADAETC shows the best performance uniformly across all settings, although there are settings where its performance is similar to ETC. As anticipated, these are settings where either (a) , in which case, the arms are expected to be close to each other and hence adaptivity in exploring has little benefits, or (b) is relatively small, due to which is small. In these latter situations, the exploration budget of is expected to be exhausted for almost all arms under ADAETC, yielding in performance similar to ETC, e.g., if and , then , i.e., a maximum of only three pulls can be used per arm for exploring. When is smaller, i.e., when arms are easier to distinguish, or when is large, the performance of ADAETC is significantly better than that of ETC. This illustrates the gains from the refined definition of the upper confidence bounds used to guide exploration in ADAETC.
Furthermore, we observe that the performances of UCB1s and NADAETC are essentially the same as ETC. This is an important observation since it shows that naively adding adaptivity to exploration based on UCB1’s upper confidence bounds may not improve the performance of ETC, and appropriate refinement of the confidence bounds is crucial to the gains of ADAETC. Finally, we note that UCB1 performs quite poorly, thus demonstrating the importance of introducing an appropriate stopping criterion for exploration.
6 Conclusion and Future directions
In this paper, we proposed and offered a neartight analysis of a new objective in the classical MAB setting, of optimizing the expected value of the maximum of cumulative rewards across arms. From a theoretical perspective, although the current analysis of ADAETC is tight, it is unclear whether the extraneous (compared to the lower bound) factor from the upper bound can be eliminated via a more refined algorithm design. Additionally, our assumption that the rewards are i.i.d. over time, while appropriate for the application of qualifying an attractor product for ecommerce platforms, may be a limitation in the context of worker training. It would be interesting to study our objective in settings that allow rewards to decrease over time; such models, broadly termed as rotting bandits (Heidari et al. 2016, Levine et al. 2017, Seznec et al. 2019), have attracted recent focus in literature as a part of the study of the more general class of MAB problems with nonstationary rewards (Besbes et al. 2014, 2019). This literature has so far only focused on the traditional “sum” objective.
More importantly, our paper presents the possibility of studying a wide variety of new objectives under existing online learning setups motivated by training applications, where the traditional objective of maximizing the total rewards is inappropriate. A natural generalization of our objective is the optimization of other functionals of the vector of cumulative rewards, e.g., maximizing the highest cumulative reward, which is relevant to online labor platforms as we mentioned in the Section 1, or the optimization of norm of the vector of cumulative rewards for , which has natural fairness interpretations in the context of human training (the traditional objective corresponds to the norm, while our objective corresponds to the norm). More generally, one may consider multiple skill dimensions, with job types that differ in their impact on these dimensions. In such settings, a similar variety of objectives may be considered driven by considerations such as fairness, diversity, and focus.
7 Proof of Proposition 1
For any policy , we have that
Here (a) is obtained due to pushing the max inside the sum; (b) is obtained because for all ; and (c) holds because the reward for an arm in a period is independent of the past history of play and observations. Thus, the reward of is the highest that one can obtain under any policy. And this reward can, in fact, be obtained by the policy of always picking arm . This shows that
8 Proof of Theorem 3
First we fix a policy . Let . We construct two bandit environments with different reward distributions for each of the arms and show that cannot perform well in both environments simultaneously.
We first specify the reward distribution for the arms in the base environment, denoted as the bandit . Assume that the reward for all of the arms have the Bernoulli distribution, i.e., . We let , and for . We let denote the probability distribution induced over events until time under policy in this first environment, i.e., in bandit . Let denote the expectation under .
Define as the (random) number of pulls spent on arm until time (note that ) under policy . Specifically, is the total (random) number of pulls spent on the first arm under policy until time . Under policy , let denote the arm in the set that is pulled the least in expectation until time , i.e., Then clearly, we have that .
Having defined , we can now define the second environment, denoted as the bandit . Again, assume that the reward for all of the arms have the Bernoulli distribution, i.e., . We let , for , and . We let denote the probability distribution induced over events until time under policy in this second environment, i.e., in bandit . Let denote the expectation under .
Suppose that in the first environment. Then we can argue that the regret is at least , up to an error of . To see this, note that this regret is at least the regret of a policy that maximizes the objective in environment , subject to the constraint that under this policy . This regret is at least the regret of a policy that minimizes the regret in environment , subject to the constraint that under this policy, . Now this latter regret can be shown to be at least , or at least (since ), up to an approximation error of .
Consider the Karmed bandit instance with Bernoulli rewards and mean vector , where . Consider a policy that satisfies . Then Hence,
The proof of Lemma 8 is presented below in this section. A similar argument shows that in the second environment, if , then , and hence the regret in the second environment is at least , again up to an approximation error of . {lemma} Consider the Karmed bandit instance with Bernoulli rewards and mean vector , where . Consider a policy that satisfies . Then Hence,
The proof of Lemma 8 is omitted since it is almost identical to that of Lemma 8. These two facts result in the following two inequalities:
(5)  
(6) 
Now, using the BretagnolleHuber inequality (see Thm. 14.2 in Lattimore and Szepesvári (2018)), we have,
(7)  
(8)  
(9) 
Here, () is the probability distribution induced by the policy on events until time under bandit (). The first equality then results from the fact that the two events and depend only on the play until time . In the second inequality, which results from the BretagnolleHuber inequality, is the relative entropy, or the KullbackLeibler (KL) divergence between the distributions and respectively. We can upper bound as,
where () denotes the reward distribution of arm in the first (second) environment. The first equality results from the fact that no arm other than offers any distinguishability between and . The next inequality follows from the fact that , since by definition, is the arm that is pulled the least in expectation until time in bandit under . Now is simply the relative entropy between the distributions and , which, by elementary calculations, can be shown to be at most , resulting in the final inequality. Thus, we finally have,
Substituting gives
Finally, using gives the desired lower bound on the regret.
Proof of Lemma 8. We first have that
Since , by Hoeffding’s inequality, we have that for any ,
Hence, by the union bound we have for any ,
Thus, defining , and for all , we finally have,
Here (a) follows from the fact that .
9 Proof of Theorem 4
The proof of Theorem 4 utilizes two technical lemmas. The first one is the following. {lemma} Let , and , , , be a sequence of independent mean 1SubGaussian random variables. Let . Then for any ,
Its proof is similar to the proof of Lemma 9.3 in Lattimore and Szepesvári (2018), which we present below for completeness.
Proof of Lemma 9. We have,
(10) 
where the first inequality follows from a union bound on a geometric grid. The second inequality is used to set up the argument to apply Theorem 9.2 in Lattimore and Szepesvári (2018) and the third inequality is due to its application. The fourth inequality follows from for . Then, using the property of unimodal functions ( for such a function ), the term