Cost-aware Cascading Bandits
In this paper, we propose a cost-aware cascading bandits model, a new variant of multi-armed bandits with cascading feedback, by considering the random cost of pulling arms. In each step, the learning agent chooses an ordered list of items and examines them sequentially, until certain stopping condition is satisfied. Our objective is then to maximize the expected net reward in each step, i.e., the reward obtained in each step minus the total cost incurred in examining the items, by deciding the ordered list of items, as well as when to stop examination. We study both the offline and online settings, depending on whether the state and cost statistics of the items are known beforehand. For the offline setting, we show that the Unit Cost Ranking with Threshold 1 (UCR-T1) policy is optimal. For the online setting, we propose a Cost-aware Cascading Upper Confidence Bound (CC-UCB) algorithm, and show that the cumulative regret scales in . We also provide a lower bound for all -consistent policies, which scales in and matches our upper bound. The performance of the CC-UCB algorithm is evaluated with both synthetic and real-world data.
Cost-aware Cascading Bandits
Ruida Zhou, Chao Gan, Jing Yang, Cong Shen University of Science and Technology of China The Pennsylvania State University firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com
In this paper, we introduce a new cost-aware cascading bandits (CCB) model. We consider a set of items (arms) denoted as . Each item has two possible states 0 and 1, which evolve according to an independent and identically distributed (i.i.d.) Bernoulli random variable. The learning agent chooses an ordered list of items in each step and examines them sequentially until certain stopping condition is met. The reward that the learning agent receives in a step equals one if one of the examined items in that step has state 1; Otherwise, it equals zero. We associate a random cost for examining each item. The overall reward function, termed as net reward, is the reward obtained in each step minus the total cost incurred in examining the items before the learner stops.
The CCB model is a natural but technically non-trivial extension of cascading bandits [?], and is a more suitable model in many fields, including the following examples.
Opportunistic spectrum access. In cognitive radio systems, a user is able to probe multiple channels sequentially at the beginning of a transmission session before it decides to use at most one of the channels for data transmission. Assuming the channel states evolve between busy and idle in time, the user gets a reward if there exists one idle channel for transmission. The cost then corresponds to the energy and delay incurred in probing each channel.
Dynamic treatment allocation. In clinic trials, a doctor must assign one of several treatments to a patient in order to cure a disease. The doctor accrues information about the outcome of previous treatments before making the next assignment. Whether a treatment cures the disease can be modeled as a Bernoulli random variable, and the doctor gets a reward if the patient is cured. The doctor may not only be interested in the expected effect of the treatment but also its riskiness, which can be interpreted as the cost of the treatment.
In both examples, the net reward in each step is determined not only by the subset of items included in the list, but also by the order that they are pulled. Intuitively, if the costs in examining the items are homogeneous, we would prefer to have the channel with higher probability to be idle, or the more effective treatment ranked higher in the list. Then, the learner would find an available channel or cure the patient after a few attempts and then stop examination, thus saving the cost without decreasing the reward. However, for more general cases where the costs are heterogeneous or even random, the optimal solution is not immediately clear.
In this paper, we consider a general cost model where the costs of pulling arms are heterogeneous and random, and investigate the corresponding solutions. Our main contributions are three-fold:
We propose a novel CCB model, which has implications in many practical scenarios, such as opportunistic spectrum access, dynamic treatment allocation, etc. The CCB model is fundamentally different from its cost-oblivious counterparts, and admits a unique structure in the corresponding learning strategy.
Second, with a priori statistical knowledge of the arm states and the costs, we explicitly identify the special structure of optimal policy (coined as the offline policy), which serves as the baseline for the online algorithm we develop. The optimal offline policy, called Unit Cost Ranking with Threshold 1 (UCR-T1), is to rank the arms based the statistics of their states and costs, and pull those above certain threshold sequentially until a state 1 is observed.
Third, we propose a cost-aware cascading Upper Confidence Bound (CC-UCB) algorithm for the scenario when prior arm statistics are unavailable, and show that it is order-optimal by establishing order-matching upper and lower bounds on the regret. Our analysis indicates that the UCB based algorithm performs well for ranking the arms, i.e., the cumulative regret of ranking the desired arms in a wrong order is bounded.
2 Related Literature
There have been some attempts that take the cost of pulling arms and budget constraint into the multi-armed bandit (MAB) framework recently. They can be summarized in two types. In the first type [?; ?; ?], pulling each arm in the exploration phase has a unit cost and the goal is to find the best arm given the budget constraint on the total number of exploration arms. This type of problems is also referred to as “best-arm identification” or “pure exploration”. In the second type, pulling an arm is always associated with a cost and constrained by a budget, no matter in the exploration phase or the exploitation phase, and the objective usually is to design an arm pulling algorithm in order to maximize the total reward with given cost or budget constraint. References [?; ?; ?] consider the problem when the cost of pulling each arm is fixed and becomes known after the arm is used once. A sample-path cost constraint with known bandit dependent cost is considered in [?]. References [?; ?; ?; ?] study the budgeted bandit problems with random arm pulling costs. Reference [?] considers the knapsack problem where there can be more than one budget constraints and shows how to construct polices with sub-linear regret.
In the proposed CCB model, the net reward function is related to the cost of pulling arms and the learning agent faces a “soft constraint” on the cost instead of a fixed budget constraint. If the learner only pulls one arm in each step, the cost of pulling an arm can be absorbed into the reward of that arm (i.e., net reward). Our model then reduces to a conventional MAB model for this case. In this paper, however, we are interested in the scenario where the learner is allowed to sequentially pull a number of arms in each step, and the reward obtained in each step cannot be decomposed into the summation of the rewards from the pulled arms. Thus, the cost cannot be simply absorbed into the reward of individual arms. The intricate relationship between the cost and the reward, and its implications on the optimal policy require more sophisticated analysis, and that is the focus of this paper.
MAB with more than one arm to pull in each step has been studied in multiple-play MAB (MP-MAB) models in [?; ?; ?], cascading bandits (CB) models in [?; ?; ?], and ranked bandits (RB) models in [?; ?; ?]. Under MP-MAB, the learner is allowed to pull out of arms, and the reward equals to the summation of the rewards from individual arms. Although the proposed CCB model also allows the user to pull multiple arms in each step, the reward is not accumulative, thus leading to different solutions.
The CCB model proposed in this paper is closely related to the CB model investigated in [?]. Specifically, [?] considers the scenario where at each step, out of items are listed by a learning agent and presented to a user. The user examines the ordered list from the first to the last, until he/she finds the first attractive item and clicks it. The system receives a reward if the user finds at least one of the items to be attractive. Our model has the same reward model as that in the cascading bandits setting. However, there are also important distinctions between them. In the CB model in [?], the total number of items to be examined each step is fixed, and the cost of pulling individual arms is not considered. As a result, the same subset of items on a list will give the same expected reward, irrespective of their order on the list. However, for the proposed CCB model, the ranking of the items on a list does affect the expected net reward. Therefore, the structure of the optimal offline policy and the online algorithm we develop are fundamentally different from those in [?].
The proposed CCB model is also related to RB in the sense that the order of the arms to pull in each step matters. A crucial feature of RB is that the click probability for a given item may depend on the item and its position on the list, as well as the items shown above. However, in our case, we assume the state of an arm evolves in an i.i.d. fashion from step to step.
3 System Model and Problem Formulation
Consider a -armed stochastic bandit system where the state of each arm evolves independently from step to step. Let be the state of arm in step . Then, evolves according to an i.i.d. Bernoulli distribution with parameter . Denote as the cost of pulling arm in step , where evolves according to an i.i.d. unknown distribution with .
In step , the learning agent chooses an ordered list of arms from and pull the arms sequentially. Once an arm is pulled, its state and the pulling cost are revealed instantly. Denote the ordered list as , where is the th arm to be pulled, and is the cardinality of . Denote as the list of arms that have been actually pulled in step . We have .
The reward that the learning agent receives in step depends on both and . Specifically, it can be expressed as , i.e., the learning agent gets reward one if one of the arms that have been examined in step has state 1; Otherwise, it equals zero. The cost that is incurred at step also depends on , and it can be expressed as .
With a given ordered list , is random and its realization depends on the observed , and the stopping condition in general. Denote the net reward received by the learning agent at step as Define the per-step regret , where is the net reward that would be obtained at step if the statistics of and were known beforehand and the optimal and stopping condition were adopted. Denote the observations up to step as . Then, . Without a priori statistics about and , our goal is to design an online algorithm to decide based on observations obtained in previous steps , and based on observed states and costs in step , so as to minimize the cumulative regret
In the following, we will first identify the structure of the optimal offline policy with a priori arm statistics, and then develop an online algorithm to learn the channel statistics and track the optimal offline policy progressively.
4 Optimal Offline Policy
We first study the optimal offline policy for the non-trivial case where . We assume that the arm statistics and are known to the learning agent as prior knowledge. However, the instantaneous realization of rewards and costs associated with arms are unknown to the learning agent until they are pulled. Under the assumption that the distributions of arm states and costs are i.i.d. across steps, the optimal offline policy should remain the same for different steps. Thus, in this section, we drop the step index and focus on the policy at an individual step.
According to the definition of the cascading feedback, for any ordered list, the reward in each step will not grow after the learner observes an arm with state 1. Therefore, to maximize the net reward, the learner should stop examining the rest of the list when a state 1 is observed, in order to save the cost of examination. Let the ordered list under the optimal offline policy be , then
We note that the expected net reward structure is more complex than the standard multi-armed bandits problem or the standard cascading model, and the optimal offline policy is not straightforward. By observing (1), we note that there are both a subtraction part and a product part inside each summation term. On one hand we should choose large to increase the value of , but on the other hand we should not choose an arm with a big gap between and , because big contributes to smaller . In the extreme case where no cost is assigned to arms, the optimal policy is to pull all arms in any order, and the problem becomes trivial.
For simplicity of the analysis, we make the following assumptions:
1) , for all . 2) There exists a constant , such that for all .
Under Assumptions 1, we present the optimal offline policy in Theorem 1, which is called Unit Cost Ranking with Threshold 1 (UCR-T1), as it ranks the expected reward normalized by the average cost and compares against threshold one.
Arrange the arm indices such that
Then, , and the corresponding optimal per-step reward is
The proof of Theorem 1 is provided in Appendix A. Theorem 1 indicates that ranking the arms in a descending order of and only including those with in achieves a balanced tradeoff between the subtraction and the product . This is an important observation which will also be useful in the online policy design.
5 Online Policy
With the optimal offline policy explicitly described in Theorem 1, in this section, we will develop an online algorithm to maximize the cumulative expected net rewards without a priori knowledge of and .
Unlike the previous work on MAB, the net reward structure in our setting is rather complex. One difficulty is that the learner has to rank and compare it with threshold for exploitation. A method to deal with this difficulty is using an UCB-type algorithm following the Optimism in Face of Uncertainty (OFU) principle [?]. More specifically, we use an UCB-type indexing policy to rank the arms. Though the upper confidence bound is a biased estimation of the statistics, it will converge to the expectation asymptotically.
5.1 Algorithm and Upper Bound
The cost-aware cascading UCB (CC-UCB) algorithm is described in Algorithm 1. The costs are assumed to be random but the learning agent has no knowledge of their distributions. We use to track the number of steps that arm has been pulled up to step , and , to denote the sample average of and at step , respectively. The UCB padding term on the state and cost of arm at step is , where is a positive constant no less than 1.5.
CC-UCB adopts the OFU principle to construct an upper bound of the ratio . The main technical difficulty and correspondingly our novel contribution, however, lies in the theoretical analysis. This is because we have to deal with two types of regret: the regret caused by pulling “bad” arms ; and that caused by pulling “good” arms in a wrong order. To the authors’ best knowledge, the latter component has not been addressed in the bandit literature before, and is technically challgening. The overall regret analysis of CC-UCB is thus complicated and non-trivial.
We have the following main result for the cumulative regret upper bound of Algorithm 1.
Denote . The accumulative regret under Algorithm 1 is upper bounded as follows:
where includes all items in except those in .
Remark: In Theorem 2, the upper bound depends on , while conventional upper bounds for UCB algorithms usually depend on the gap between the ranking parameters. It is because the regret caused by pulling “good” arms in a wrong order is bounded, as shown in Section 5.2; Thus, the regret is mainly due to pulling “bad” arms, which is determined by arm itself and not related to the gap between the best arm and arm . When s are known a priori, the upper bound can be reduced by a factor of 4.
5.2 Analysis of the Upper Bound
Define , i.e. there exists an arm whose sample average of reward or cost lies outside the corresponding confidence interval. Denote as the complement of . Then, we have the following observations.
If , then, under Algorithm 1, all arms in will be included in .
Proof: According to Algorithm 1, arm will be included in if . When , we have , . Thus, , , which implies that .
Under Algorithm 1, we have .
Then, define , which represents the event that arms from are not ranked in the correct order. Since those arms are pulled linearly often in order to achieve small regret, intuitively, happens with small probability.
The proof of Lemma 2 is based on Hoeffding’s inequality. The proof of Lemma 3 is derived based on the observation that the arms in are pulled linearly often if is true, and the corresponding confidence intervals shrinks fast in time. As decreases below certain threshold, happens. The detailed proofs of Lemma 2 and Lemma 3 can be found in Appendix B and Appendix C, respectively.
Consider an ordered list that includes all arms from with the same relative order as in . Then, under Algorithm 1, .
Proof: First, we point out the difference between and is that in , there may exist arms from that are inserted between the arms in . Denoted such ordered subset of arms as . Then, depending on the realization of the states of the arm on , a random subset of will be pulled (i.e., ), resulting in a different regret. Denote the index of the last pulled arm in as . Then, depending on the state of arm , there are two possible cases:
i) . For this case, the regret comes from the cost of pulling the arms in only. This is because if were the list of arms to pull, with the same realization of arm states, the learner would only pull the arms in and receive the same reward. Thus, given and , we have .
ii) . This indicates that is the last arm on due to the stopping condition. For this case, the learner spends costs on pulling the arms in but also receives the full reward one. If were the list of arms to pull, with the same realization of arm states, the learner would first pull all arms in . Since the states of such arms should be 0, she would then continue pulling the remaining arms in if there is any. Denote the net reward obtained from the remaining pullings as . Then, . Therefore, given and , we have .
Combining both cases, we have , which completes the proof.
Next, we consider the regret resulted from including arms outside in the list . We focus on the scenario when both and are false. Notice that the condition of Lemma 4 is satisfied in this scenario. Thus,
Denote as the largest possible per-step regret, which is bounded by , corresponding to the worst case scenario that the learner pulls all arms but does not receive reward one. Then, combining the results from above, we have
which proves Theorem 2.
6 Lower Bound
Before presenting the lower bound, we first define -consistent policies.
Consider online policies that sequentially pull arms in until one arm with state 1 is observed. If for any , the policy is -consistent.
For any ordered list , the per-step regret in step is lower bounded by .
Proof: Consider an ordered list , i.e., the sub-list of that only contains the arms from while keeping their relative order in . Denote as the reward obtained at step by pulling the arms in sequentially. We have
where inequality (6) follows from the fact that maximizes the expected reward in every step.
Similar to the proof of Lemma 4, we denote the index of the last pulled arm in as . Then, given and , we have .
If , based on the assumption that all policy should stop pulling in a step if a state 1 is observed, we infer that the arms in should have state 0. If were the list of arms to pull, with the same realization of arm states, the learner would continue pulling the remaining arms in if there is any. Denote the net reward obtained from the rest pulling as . Then, due to the definition of , we have . Therefore, given and , we have .
Combining both cases, we have . Taking expectation with respect to , we obtain the lower bound for .
Under any -consistent policy,
where is the KL divergence of Bernoulli distributions with means and .
In this section we will resort to numerical experiments to evaluate the performances of the CC-UCB algorithm in Algorithm 1. We set and . Both synthetic and real-world datasets are used.
7.1 Synthetic Data
We consider a 6-arm bandits setting with , and the mean of the cost for all arms. According to the UCR-T1 policy, we have , i.e., the first three arms should be included in . We then perform the CC-UCB algorithm under the assumption that both and are unknown to the learning agent. We run it for steps, and average the accumulative regret over 20 runs. The result is plotted in Figure 1(a). The error bar corresponds to one standard deviation of the regrets in 20 runs. We also study the case where the mean of the cost is known beforehand, however, the cost of each arm is still random. The result is also plotted in the same figure. As we observe, both curves increase sublinearly in , which is consistent with the bound we derive in Theorem 2. The regret with known cost statistics is significantly smaller than that of the unknown cost statistics case.
Next, we evaluate the impact of system parameters on the performance of the algorithm. We vary and , i.e., the total number of arms , and the number of arms in , respectively. We also change , i.e., , for . Specifically, we set for , for , and let be a constant across all arms. By setting , the value of can be determined. The cumulative regrets at averaged over 20 runs are reported in Table 1. We observe four major trends. First, the regret increases when the number of arms doubles. Second, the regret decreases when the number of arms in (i.e., ) increases. Third, a prori knowledge of cost statistics always improve the regret. Fourth, when is known, the regret increases as decreases. These trends are consistent with Theorem 2. However, the dependency on when is unknown is not obvious from the experiment results. This might be because the algorithm depends on the estimation of the cost as well as the arm state, and the complicated interplay between cost and the optimal arm pulling policy makes the dependency on hard to discern.
7.2 Real-world Data
In this section, we test the proposed CC-UCB algorithm using real-world data extracted from the click log dataset of Yandex Challenge [?]. The click log contains complex user behaviors. To make it suitable for our algorithm, we extract data that contains click history of a specific query. We choose the total number of links for the query to be and set a constant and known cost for all of them. We estimate the probability that a user would click a link based on the dataset and use it as the ground truth . We then test the CC-UCB based on the structure of the optimal offline policy. We plot the cumulative regret in Figure 1(b) with different values of . We observe that the cumulative regret grows sub-linearly in time, and monotonically increases as increases, which are consistent with Theorem 2. This indicates that the CC-UCB algorithm performs very well even when some of the assumptions (such as the i.i.d. evolution of arm states) we used to derive the performance bounds do not hold.
In this paper, we studied a CCB model by taking the cost of pulling arms into the cascading bandits framework. We first explicitly characterized the optimal offline policy UCR-T1, and then developed the CC-UCB algorithm for the online setting. We analyzed the regret behavior of the proposed CC-UCB algorithm by analyzing two different types of error events under the algorithm. We also derived an order-matching lower bound, thus proving that the proposed CC-UCB algorithm achieves order-optimal regret. Experiments using both synthetic data and real-world data were carried out to evaluate the CC-UCB algorithm.
- [Anantharam et al., 1987] Venkatachalam Anantharam, Pravin Varaiya, and Jean Walrand. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part i: Iid rewards. IEEE Transactions on Automatic Control, 32(11):968–976, 1987.
- [Audibert and Bubeck, 2010] Jean-Yves Audibert and Sébastien Bubeck. Best Arm Identification in Multi-Armed Bandits. In the 23th Conference on Learning Theory, Haifa, Israel, June 2010.
- [Auer et al., 2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
- [Badanidiyuru et al., 2013] Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. In Proceedings of the IEEE 54th Annual Symposium on Foundations of Computer Science, pages 207–216, 2013.
- [Bubeck et al., 2009] Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In Proceedings of the 20th International Conference on Algorithmic Learning Theory, pages 23–37, 2009.
- [Burnetas and Kanavetas, 2012] Apostolos Burnetas and Odysseas Kanavetas. Adaptive policies for sequential sampling under incomplete information and a cost constraint. Applications of Mathematics and Informatics in Military Science, page 97–112, 2012.
- [Burnetas et al., 2017] Apostolos Burnetas, Odysseas Kanavetas, and Michael N Katehakis. Asymptotically optimal multi-armed bandit policies under a cost constraint. Probability in the Engineering and Informational Sciences, 31(3):284–310, 2017.
- [Ding et al., 2013] Wenkui Ding, Tao Qiny, Xu-Dong Zhang, and Tie-Yan Liu. Multi-armed bandit with budget constraint and variable costs. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, pages 232–238, 2013.
- [Guha and Munagala, 2007] Sudipto Guha and Kamesh Munagala. Approximation algorithms for budgeted learning problems. In Proceedings of the Thirty-ninth Annual ACM Symposium on Theory of Computing, pages 104–113, New York, NY, USA, 2007. ACM.
- [Int, 2011] “Internet mathematics”. https://academy.yandex.ru/events/data_analysis/relpred2011/, 2011.
- [Kalathil et al., 2014] D. Kalathil, N. Nayyar, and R. Jain. Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory, 60(4):2331–2345, April 2014.
- [Komiyama et al., 2015] Junpei Komiyama, Junya Honda, and Hiroshi Nakagawa. Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays. In Proceedings of the 32nd International Conference on Machine Learning, pages 1152–1161, Lille, France, Jul 2015.
- [Kveton et al., 2015a] Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 767–776, 2015.
- [Kveton et al., 2015b] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Combinatorial cascading bandits. In Advances in Neural Information Processing Systems, pages 1450–1458, 2015.
- [Radlinski et al., 2008] Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. Learning diverse rankings with multi-armed bandits. In In 25th Intl. Conf. on Machine Learning (ICML), pages 784–791, 2008.
- [Slivkins et al., 2013] Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. Ranked bandits in metric spaces: Learning diverse rankings over large document collections. J. Mach. Learn. Res., 14(1):399–436, February 2013.
- [Streeter and Golovin, 2008] Matthew Streeter and Daniel Golovin. An online algorithm for maximizing submodular functions. In Advances in Neural Information Processing Systems 21, pages 1577–1584. 2008.
- [Tran-Thanh et al., 2010] Long Tran-Thanh, Archie C. Chapman, Enrique Munoz de Cote, Alex Rogers, and Nicholas R. Jennings. Epsilon-first policies for budget-limited multi-armed bandits. In AAAI, 2010.
- [Tran-Thanh et al., 2012] Long Tran-Thanh, Archie Chapman, Alex Rogers, and Nicholas R. Jennings. Knapsack based optimal policies for budget-limited multi-armed bandits. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pages 1134–1140, 2012.
- [Xia et al., 2015] Yingce Xia, Haifang Li, Tao Qin, Nenghai Yu, and Tie-Yan Liu. Thompson sampling for budgeted multi-armed bandits. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 3960–3966, 2015.
- [Xia et al., 2016a] Yingce Xia, Wenkui Ding, Xu-Dong Zhang, Nenghai Yu, and Tao Qin. Budgeted bandit problems with continuous random costs. In Geoffrey Holmes and Tie-Yan Liu, editors, Asian Conference on Machine Learning, volume 45, pages 317–332, Hong Kong, Nov 2016.
- [Xia et al., 2016b] Yingce Xia, Tao Qin, Weidong Ma, Nenghai Yu, and Tie-Yan Liu. Budgeted multi-armed bandits with multiple plays. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2210–2216. AAAI Press, 2016.
- [Zong et al., 2016] Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. Cascading bandits for large-scale recommendation problems. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pages 835–844, 2016.
Appendix A Proof of Theorem 1
Let be the list that is presented by the player, where . The expected net reward is
If in the policy there exists satisfying and , then we can define another list , which is only different from policy by swapping the and position: , and . Then the difference between the expected net rewards of and is
which implies that if the presented arms from are not in a descending order in , then we can always create a new list that achieves better expected net reward by swapping positions of some arms.
Besides, the reward is the summation of . Then a term will be positive if . As a result, the optimal offline policy must contain all . Combining with (13) , we reach the conclusion that the reward will be maximized by presenting the top arms in a descending order based on .
Appendix B Proof of Lemma 2
We note that
where (15) follows from the Hoeffding’s inequality.
Appendix C Proof of Lemma 3
Before we proceed to prove Lemma 3, we first introduce the following definitions.
Define a random variable as follows:
Denote . We can verify that
As we will see later, we define in such a way in order to lower bound the probability of the event “ is false and arm is observed”.
For any integer , we define as the smallest step index such that . This definition implies that . Then, according to the definition of in (18), is Bernoulli( ). Since is measurable and are independent, are independent. Therefore, are i.i.d. Bernoulli random variables with parameter .
We then denote , i.e., the total number of steps up to when is false. Then, we have the following observation.
For all ,
Proof: We note that
where (21) is based on the fact that arm will be pulled only when the states of all arms in listed before are , and its probability is lower bounded by that of the extreme case when the states of all arms except are . (22) follows from the definition of in (17).
Since , we have , and Lemma 6 follows.
Next, we are ready to prove Lemma 3.
Denote . Then, we have