Beyond the ClickThrough Rate: Web Link Selection with Multilevel Feedback
Abstract
The web link selection problem is to select a small subset of web links from a large web link pool, and to place the selected links on a web page that can only accommodate a limited number of links, e.g., advertisements, recommendations, or news feeds. Despite the long concerned clickthrough rate which reflects the attractiveness of the link itself, the revenue can only be obtained from user actions after clicks, e.g., purchasing after being directed to the product pages by recommendation links. Thus, the web links have an intrinsic multilevel feedback structure. With this observation, we consider the contextfree web link selection problem, where the objective is to maximize revenue while ensuring that the attractiveness is no less than a preset threshold. The key challenge of the problem is that each link’s multilevel feedbacks are stochastic, and unobservable unless the link is selected. We model this problem with a constrained stochastic multiarmed bandit formulation, and design an efficient link selection algorithm, called Constrained Upper Confidence Bound algorithm (ConUCB), and prove bounds on both the regret and the violation of the attractiveness constraint. We conduct extensive experiments on three realworld datasets, and show that ConUCB outperforms stateoftheart contextfree bandit algorithms concerning the multilevel feedback structure.
Beyond the ClickThrough Rate: Web Link Selection with Multilevel Feedback
Kun Chen^{*}, Kechao Cai^{†}, Longbo Huang^{‡}, John C.S. Lui^{†}, ^{*}^{‡}Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University ^{†}Department of Computer Science & Engineering, The Chinese University of Hong Kong ^{*}chenkun14@mails.tsinghua.edu.cn, ^{†}{kccai, cslui}@cse.cuhk.edu.hk, ^{‡}longbohuang@tsinghua.edu.cn
1 Introduction
With the rapid development of the Internet, web links are playing important roles in websites and mobile apps for attracting users and generating revenues. For example, ecommerce websites such as Amazon and Taobao show featured recommendation links on shopping pages to induce more purchase. Online social networks such as Facebook and Google+ constantly push links of trending topics and friends’ updates to users, so as to increase user engagement. Online media such as HBO and iQIYI present links to popular TV shows and movies on their homepages to attract more subscribers.
Due to the limited space of interest on a web page, only a finite number of links can be shown to a user when the page is browsed. This raises the web link selection problem, i.e., how to properly select a small subset of web links from a large link pool for a web page. Specifically, the web page on which the selected links are shown is called the target page. If clicked, each link directs the user to a subsequent page. This implies that web links provide a multilevel feedback to the web operator. The first level feedback refers to the likelihood that a user clicks a link, also known as the ClickThrough Rate (CTR) at the target page. By tracking user actions after clicking a link, e.g., purchase or subscription, we can determine the revenue collected on the associated subsequent page, which gives the second level feedback. Since purchase or subscription can only happen after the click, the compound feedback is the product of the firstlevel and the secondlevel feedbacks. Intuitively, the firstlevel feedback (CTR) indicates the attractiveness of the link at the target page, while the secondlevel feedback indicates the potential revenue that can be collected from the subsequent page. The compound feedback reveals the compound revenue a web link can bring.
There has been a lot of research concerning the CTR of web links, e.g., [?; ?]. On the other hand, what happens after clicks is also worth great attention, as it generates revenue. For instance, cost per acquisition (CPA) is regarded as the optimal way for an advertiser to buy online advertising [?]. Both the attractiveness and the profitability of a website or an app are important measures [?], because they represent the longterm and shortterm benefits, respectively. This motivates us to move beyond CTR and to pursue both attractiveness and profitability simultaneously in link selection.
In this work, we consider the problem of selecting a finite number of links from a large pool for the target page, so as to maximize the total compound revenue, while keeping the total attractiveness above a certain threshold. The constraint on attractiveness (CTR) was also adopted in the literate of online advertising [?; ?]. In addition, we also take into consideration the fact that contextual information, e.g., user preferences, is not always available, e.g., incognito visits [?], cold start [?], or cookie blocking [?]. Thus, we do not assume any prior contextual information. We refer to our problem as the contextfree web link selection problem.
Our link selection problem is challenging. First of all, the attractiveness and profitability of a link can be conflicting goals, as shown in [?]. As a result, while selecting links with high CTRs satisfies the attractiveness constraint, it does not necessarily guarantee that the target page will have a high total compound revenue, and vice versa. What further complicates the problem is that the multilevel feedbacks, i.e., the CTR (firstlevel feedback) and the potential revenue (secondlevel feedback) of each link, are stochastic and unobservable unless a link is selected and shown on the target page.
We formulate our problem as a constrained stochastic multipleplay multiarmed bandit problem with multilevel rewards. Specifically, there are multiple arms in the system. Each arm represents a link in the pool. Its firstlevel reward, secondlevel reward, and compound reward correspond to the firstlevel feedback (the CTR), the secondlevel feedback (the potential revenue) and the compound feedback (the compound revenue) of that link, respectively. The objective is to select a finite number of links at each time step to minimize the cumulative regret, as well as the cumulative violation of the constraint. We design a constrained bandit algorithm, Constrained Upper Confidence Bound algorithm (ConUCB), to simultaneously achieve sublinear regret and violation bounds.
Our main contributions are as follows. (i) We formulate the link selection problem as a constrained bandit problem with stochastic multilevel rewards (Section 3). (ii) We propose the ConUCB algorithm (Section 4) and prove that ConUCB ensures small regret and violation bounds with high probability, i.e., for any given failure probability , the regret and violation at time are bounded by with probability at least (Section 5). (iii) We conduct extensive experiments on three realworld datasets. Our results show that ConUCB outperforms three stateoftheart contextfree bandit algorithms, CUCB [?], Exp3.M [?], and LExp [?] for the constrained link selection problem (Section 6).
2 Related work
Link selection, or website optimization, has long been an important problem. One common approach for the problem is A/B testing [?; ?], which splits the traffic to two web pages with different designs, and evaluates their performances. However, the overhead of A/B testing can be high when the web link pool is large, as it needs to compare different link combinations. Moreover, A/B testing does not have any loss/regret guarantees. Another approach is to model the link selection problem as a contextual multiarmed bandit problem [?], and to incorporate the collaborative filtering method [?; ?]. However, these contextual bandit formulations neglect the multilevel feedback structures and do not consider any constraint.
The multipleplay multiarmed bandit problem, where multiple arms are selected in each round, has been studied from both theoretical and empirical perspectives, and many policies have been designed [?; ?; ?; ?]. Our constrained multipleplay bandit model differs from aforementioned models in that we consider meeting the constraint on the total firstlevel rewards in selecting multiple arms, which is important for web link selection.
Recently, bandit with budgets [?; ?; ?] and bandit with knapsacks [?; ?] have attracted much research attention. In these problems, pulling an arm costs certain resources, and each resource has a budget. Thus, resource cost is implicitly taken into consideration during the analysis of regret in the above two formulations since the arm selection process stops when resources are depleted. In contrast, since the constraint in our model is a requirement on the average performance, our arm selection procedure can last for an arbitrary length of time, and we need to consider both the regret and the violation of the constraint during the process. Thus, while our work builds upon the results in [?] and [?], the problem is different, and we study the multipleplay case rather than the singleplay case. In addition, we conduct experiments on realworld datasets, which are not included in their works. On the other hand, the thresholding bandit problem in [?] is to find the set of arms whose means are above a given threshold through pure exploration in a fixed time horizon, which is different from our model.
Our work is closest to recent work [?]. They assume the secondlevel reward is adversarial. However, it has been observed that this might not be the case in practice [?], and user behavior is likely to follow certain statistical rules when the number of users is large. So we study the stochastic case. Most importantly, our algorithm guarantees performance with high probability rather than in expectation, and the regret and violation bounds are improved significantly from and in their algorithm (LExp) to both in our algorithm (ConUCB).
3 Model
Consider the twolevel feedback contextfree web link selection problem, where one needs to select links from a pool of web links, , to display on the target page. Each link directs users to a subsequent page. If is shown on the target page, we obtain the following feedbacks when users browse the page:

the clickthrough rate (CTR), i.e., the probability that a user clicks to visit the corresponding subsequent page,

the afterclick revenue, i.e., the revenue collected from each user who clicks and then purchases products (or subscribes to programs) on the corresponding subsequent page.
In practice, the clickthrough rate and the afterclick revenue are stochastic, and we do not assume any prior knowledge about their distributions or expectations. The product of the CTR and the afterclick revenue is the compound revenue, i.e., the revenue that can bring if it is shown on the target page. The objective of the link selection problem is to maximize the total compound revenue of the selected links, subject to the constraint that the total CTR of these selected links is no less than a preset threshold ,^{1}^{1}1CTR measures the attractiveness of a link to users and is an important metric for the link selection problem. where is determined by the web operator based on service requirement. An example is that in online advertising, the constraint on CTR is usually specified in the contract between the publisher (web operator) and the advertising firm [?; ?].
To address the link selection problem, we formulate it as a constrained stochastic multiarmed bandit problem with multiple plays, where each arm has a twolevel reward structure. In this formulation, each time step is a short duration and each arm corresponds to a specific web link. Thus, the set of arms can be written as . Each arm is associated with two sequences of random variables, and , where characterizes arm ’s firstlevel reward (CTR) at time , and characterizes arm ’s secondlevel reward (afterclick revenue). We assume that for any , both and are sequences of i.i.d. random variables. The expectations of and are denoted by and , . We also assume that is independent of for , . Thus, the compound reward of arm at time is with mean . Denote and . Without loss of generality, we assume that and .
As mentioned above, the distributions or expectations of the twolevel reward for any arm are unknown beforehand. At each time step , an algorithm selects a set of arms , and observes the first level reward as well as the second level reward for each arm . The optimal policy is the one that maximizes the expected total compound reward of the selected arms, while keeping the total first level reward above the preset threshold .^{2}^{2}2If , the problem is equivalent to the classic unconstrained multipleplay multiarmed bandit problem (MPMAB) [?]. If , there is no policy that can satisfy the constraint.
The optimal policy is not limited to deterministic policies as in traditional multiarmed bandit problems [?; ?], but can be randomized, i.e., a distribution on the possible selections . In practice, the number of web links can be very large, and the number of possible selections of links at each time step can be as large as , which makes it complicated to consider randomized policies. To simplify the problem, we represent a randomized policy with a probabilistic selection vector , where is the probability of selecting arm and is the one vector.^{3}^{3}3If not specified otherwise, all vectors defined in this paper are column vectors. At each time , the selection set under a randomized policy is generated via a dependent rounding procedure [?], which guarantees the probability that is (see Section 4).
The set of randomized policies can be denoted by . Thus, the optimal stationary randomized policy is
(1) 
Our objective is to design an algorithm to decide the selection set for , such that the regret, i.e., the accumulated difference between the compound reward under and that under the optimal policy, is minimized. Specifically,
(2) 
Note that the total firstlevel reward of arms in may violate the constraint, especially when is small and we have little information about the arms. To measure the overall violation of the constraint at time , we define violation of algorithm as,
(3) 
where . Note that when designing link selection algorithms, we should take both the regret and violation into consideration, so as to achieve both sublinear regret and sublinear violation with respect to . Also, note that our model can be generalized to link selection problems with level () feedback structures, by taking a subsequent page as a new target page and select links for it with the above model, and so on.
4 Algorithm
In this section, we present our Constrained Upper Confidence Bound algorithm (ConUCB), and describe its details in Algorithm 1. Let denote the historical information of chosen actions and observations up to time . Define the empirical average firstlevel reward and compound reward for each arm as
(4)  
where is the number that arm is played before time . Define as in [?] where is a constant. In ConUCB, we use the following Upper Confidence Bounds for the unknown rewards [?]:
Denote , , and , . In the initialization step of Algorithm 1, is set to , where is an input parameter, i.e., the allowed failure probability.
Specifically, in each round, ConUCB solves the optimization problem (5) to get the probabilistic selection vector (line 5). Notice that (5) is similar to the original constrained optimization problem (1) but uses the Upper Confidence Bounds to replace the unknown rewards. Then, is generated via a dependent rounding procedure. In line 7 we receive the twolevel rewards and for arms in and update the empirical average rewards to get the Upper Confidence Bounds for the next round.
(5) 
5 Theoretical Analysis
In this section, we bound the regret and violation of Algorithm 1. We will make use of the concentration inequalities in the following lemmas.
Lemma 1 (AzumaHoeffding inequality [?]).
Suppose is a martingale and almost surely, then with probability at least , we have
Lemma 2 ([?; ?; ?]).
Consider i.i.d random variables in with expectation . Let denote their empirical average. Then, for any , with probability at least , we have
where .
The following lemma is a corollary of Lemma 2.
Lemma 3.
Define the empirical averages and as in (4). Then, for every and , with probability at least , we have
where . The same result holds between and .
Proof.
For every and , applying Lemma 2, we have that, with probability at least ,
This implies that
The last inequality holds because . ∎
Based on the above lemmas, we obtain the following properties about ConUCB.
Lemma 4.
By running ConUCB for rounds with , with probability at least , the following results hold simultaneously:
(6)  
(7)  
(8)  
(9) 
Proof.
See Appendix A. ∎
From Lemma 4, we can obtain the regret and violation bounds for ConUCB.
Theorem 1.
For all , let . By running ConUCB, we have with probability at least that,
Proof.
We bound the regret and violation using (6) to (9), which were shown to hold with probability at least in Lemma 4.
From (6) we know for all , is a feasible solution of the optimization problem (5), i.e., . Then, for all , we have,
(10) 
where the last inequality follows from (8). Combining (9) and (10), we have
On the other hand, since for all , (5) has a feasible solution , we know . Then with (7), we can get
This completes the proof. ∎
6 Experiments
We conduct experiments on three realworld datasets to evaluate the performance of ConUCB. Two datasets, CouponPurchase [?] and AdClicks [?], with coupons and ads respectively, are shown to have a twolevel feedback structure in [?]. In particular, for each coupon in CouponPurchase, a user who clicks the link to the coupon can decide whether to purchase that coupon; for each ad in AdClicks, a user who clicks the link to the ad can decide whether to request the corresponding seller’s phone number. Thus, for CouponPurchase (AdClicks), the firstlevel feedback is the CTR of each coupon (the CTR of each ad) and the secondlevel feedback is the purchase rate of each coupon (the phone request rate of each ad). The third dataset, edXCourse, is extracted from the data on Harvard and MIT edX online courses [?]. In particular, for the online courses, we obtain course participation rates by normalizing the numbers of participants using minmax scaling and treat the course participation rates as the firstlevel feedback; we calculate course certification rates by dividing the numbers of certified participants by the numbers of participants, and treat the course certification rates as the secondlevel feedback.
We treat the coupons, ads, and courses as different sets of arms. To simulate the realtime twolevel feedback of the coupons, ads, and courses, we generate the firstlevel reward of each arm (coupon, ad, and course) using a Bernoulli variable with mean taken from the firstlevel feedback (coupon CTR, ad CTR, and course participation rate) in the three datasets, and generate the secondlevel reward of each arm using another independent Bernoulli variable with mean taken from the secondlevel feedback (coupon purchase rate, ad phone request rate, and course certification rate).
For comparison purposes, we implement three stateoftheart bandit algorithms that can select multiple arms at each round as baselines, i.e., CUCB [?], Exp3.M [?] and LExp [?]. Specifically, CUCB selects the top arms with the highest UCB indices . Exp3.M selects arms using exponential weights on the compound rewards of arms, and LExp selects arms using exponential weights based on the Lagrangian function of reward and violation of arms.
For the three datasets, we run the three algorithms together with ConUCB for rounds with parameter settings as shown in Figure 1–3, respectively. In particular, the parameters of Exp3.M and LExp are set in accordance with Corollary 1 of [?] and Theorem 1 of [?], respectively. We compare the cumulative regrets of LExp and ConUCB at each round , where the optimal policy is computed from the means of the twolevel feedback taken from each datatset. (Note that the regrets of CUCB and Exp3.M are not considered since they both have an unconstrained optimal policy, and therefore have different regret definitions from LExp and ConUCB.) We also compare the cumulative violations and the cumulative compound rewards of the four algorithms. To put things into perspective, we compare the ratios between the cumulative rewards and the cumulative violations of all the algorithms. Such ratios show how much reward an algorithm can gain for each unit violation it has made.
The experiment results are averaged over runs of each algorithm and illustrated in Figure 1–3. Figure 1(a) shows that the cumulative regret of ConUCB is much lower than that of LExp on the CouponPurchase dataset. This shows that ConUCB can reduce the regret significantly by selecting arms using UCBbased optimization instead of exponential weights as in LExp. Figure 1(b) and Figure 1(c) show the cumulative violations and the cumulative rewards of the four algorithms. In particular, the Optimalreward in Figure 1(c) shows the cumulative reward of the optimal policy at each round . As shown in Figure 1(c), CUCB and Exp3.M have larger cumulative rewards than ConUCB and LExp, as both CUCB and Exp3.M neglect the threshold constraint and thereby blindly selecting arms that maximize the cumulative rewards. Therefore, both CUCB and Exp3.M incur huge cumulative violations as shown in Figure 1(b). Moreover, ConUCB has a larger cumulative reward and a lower cumulative violation than LExp. This matches our theoretical results that ConUCB has smaller regret as well as violation bounds than LExp. Figure 1(d) shows that ConUCB achieves the largest reward/violation ratios among the four algorithms. This means that ConUCB achieves the best tradeoff between rewards and violations and accumulates most reward for each unit violation it incurs.
We have similar experiment results on AdClicks and edXCourse to those on CouponPurchase. As shown in Figure 2 and Figure 3, ConUCB achieves lower cumulative regret and higher cumulative rewards than LExp, and has the lowest cumulative violations and largest reward/violation ratios among all algorithms. Due to space limit, we omit the details.
In summary, our experiment results are consistent with our theoretical analysis and demonstrate the effectiveness of our ConUCB algorithm in selecting arms with high cumulative rewards as well as low cumulative violations, thus achieving a good tradeoff between the reward and the violation.
7 Conclusion
In this paper, we consider the web link selection problem with multilevel feedback. We formulate it as a constrained multipleplay stochastic multiarmed bandit problem with multilevel reward. We design an efficient algorithm ConUCB for solving the problem, and prove that for any given allowed failure probability , with probability at least , ConUCB guarantees regret and violation bounds. We conduct extensive experiments on three realworld datasets to compare our ConUCB algorithm with stateoftheart contextfree bandit algorithms. Experiment results show that ConUCB balances regret and violation better than the other algorithms and outperforms LExp in both regret and violation.
Acknowledgment
This work is supported in part by the National Natural Science Foundation of China Grants 61672316, 61303195, the Tsinghua Initiative Research Grant, and the China Youth 1000Talent Grant.
References
 [Aggarwal et al., 2010] Gaurav Aggarwal, Elie Bursztein, Collin Jackson, and Dan Boneh. An analysis of private browsing modes in modern browsers. In Proceedings of the 19th USENIX conference on Security, 2010.
 [Agrawal and Devanur, 2014] Shipra Agrawal and Nikhil R. Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of ACM EC, 2014.
 [Anantharam et al., 1987] Venkatachalam Anantharam, Pravin Varaiya, and Jean Walrand. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple playspart i: Iid rewards. IEEE Transactions on Automatic Control, 32(11):968–976, 1987.
 [Auer et al., 2002] Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 [Azuma, 1967] Kazuoki Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 19:357–367, 1967.
 [Badanidiyuru et al., 2013] Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. In Proceedings of FOCS, 2013.
 [Bresler et al., 2016] Guy Bresler, Devavrat Shah, and Luis Filipe Voloch. Collaborative filtering with low regret. In Proceedings of ACM SIGMETRICS, 2016.
 [Bubeck et al., 2012] Sébastien Bubeck, Nicolo CesaBianchi, et al. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
 [Cai et al., 2017] Kechao Cai, Kun Chen, Longbo Huang, and John C. S. Lui. Multilevel feedback web links selection problem: Learning and optimization. In Proceedings of ICDM, 2017.
 [Chen et al., 2013] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multiarmed bandit: General framework, results and applications. In Proceedings of ICML, 2013.
 [Chuang and Ho, 2016] Isaac Chuang and Andrew Dean Ho. Harvardx and mitx: Four years of open online courses–fall 2012summer 2016. SSRN, 2016.
 [Deng et al., 2017] Alex Deng, Jiannan Lu, and Jonthan Litz. Trustworthy analysis of online a/b tests: Pitfalls, challenges and solutions. In Proceedings of WSDM, 2017.
 [Ding et al., 2013] Wenkui Ding, Tao Qin, XuDong Zhang, and TieYan Liu. Multiarmed bandit with budget constraint and variable costs. In AAAI, 2013.
 [Elahi et al., 2016] Mehdi Elahi, Francesco Ricci, and Neil Rubens. A survey of active learning in collaborative filtering recommender systems. Computer Science Review, 20:29–50, 2016.
 [Gandhi et al., 2006] Rajiv Gandhi, Samir Khuller, Srinivasan Parthasarathy, and Aravind Srinivasan. Dependent rounding and its applications to approximation algorithms. Journal of the ACM (JACM), 53(3):324–360, 2006.
 [Kaggle, 2015] Kaggle. Avito context ad clicks, 2015. https://www.kaggle.com/c/avitocontextadclicks.
 [Kaggle, 2016] Kaggle. Coupon purchase data, 2016. https://www.kaggle.com/c/couponpurchaseprediction.
 [Kleinberg et al., 2008] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multiarmed bandits in metric spaces. In Proceedings of STOC, 2008.
 [Kohavi et al., 2014] Ron Kohavi, Alex Deng, Roger Longbotham, and Ya Xu. Seven rules of thumb for web site experimenters. In Proceedings of SIGKDD, 2014.
 [Komiyama et al., 2015] J. Komiyama, J. Hondaand, and H. Nakagawa. Optimal regret analysis of thompson sampling in stochastic multiarmed bandit problem with multiple plays. In Proceedings of ICML, 2015.
 [Kumar, 2015] Subodha Kumar. Optimization Issues in Web and Mobile Advertising: Past and Future Trends. Springer, 2015.
 [Lagrée et al., 2016] Paul Lagrée, Claire Vernade, and Olivier Cappe. Multipleplay bandits in the positionbased model. In Proceedings of NIPS, 2016.
 [Langheinrich et al., 1999] Marc Langheinrich, Atsuyoshi Nakamura, Naoki Abe, Tomonari Kamba, and Yoshiyuki Koseki. Unintrusive customization techniques for web advertising. Computer Networks, 31(11):1259–1272, 1999.
 [Li et al., 2010] L. Li, W. Chu, J. Langford, and R.E. Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of WWW, 2010.
 [Li et al., 2016] Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. In Proceedings of SIGIR, 2016.
 [Locatelli et al., 2016] Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In Proceedings of ICML, 2016.
 [Lohtia et al., 2003] Ritu Lohtia, Naveen Donthu, and Edmund K Hershberger. The impact of content and design elements on banner advertising clickthrough rates. Journal of advertising Research, 43(4):410–418, 2003.
 [Meng et al., 2016] Wei Meng, Byoungyoung Lee, Xinyu Xing, and Wenke Lee. Trackmeornot: Enabling flexible control on web tracking. In Proceedings of WWW, 2016.
 [Mookerjee et al., 2016] Radha Mookerjee, Subodha Kumar, and Vijay S Mookerjee. Optimizing performancebased internet advertisement campaigns. Operations Research, 65(1):38–54, 2016.
 [Pivazyan, 2004] Karen Arman Pivazyan. Decision making in multiagent systems. Stanford University, 2004.
 [Spooner, 2014] Jason Spooner. Why cost per acquisition is the only metric that really matters, 2014. https://socialmediaexplorer.com.
 [Uchiya et al., 2010] T. Uchiya, A. Nakamura, and M. Kudo. Algorithms for adversarial bandit problems with multiple plays. In Proceedings of ACL’10, 2010.
 [Wu et al., 2015] Huasen Wu, R Srikant, Xin Liu, and Chong Jiang. Algorithms with logarithmic or sublinear regret for constrained contextual bandits. In Proceedings of NIPS, 2015.
 [Xia et al., 2016] Yingce Xia, Tao Qin, Weidong Ma, Nenghai Yu, and TieYan Liu. Budgeted multiarmed bandits with multiple plays. In Proceedings of IJCAI, 2016.
 [Xu et al., 2015] Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. From infrastructure to culture: A/b testing challenges in large scale social networks. In Proceedings of SIGKDD, 2015.
Appendix A Proof of Lemma 4
Proof.
We first show that (6) and (7) hold with probability at least . Notice that . From Lemma 3, by taking a union bound over all and all , we obtain that for all and all , with probability at least ,
(11) 
which means
(12) 
To prove (7), we define a series of random variables as
We know and . Recall that denotes the historical information of chosen actions and observations up to time . Thus, by Lemma 1, we get, with probability at least ,
(13) 
Similarly, with probability at least ,
(14) 
Next we bound . Notice that (11) also implies that for all and ,
Let denote the time that arm is played for the th time. We have
(15)  
(16) 
where (15) follows from the CauchySchwarz inequality and (16) follows from the fact that . Thus, (13), (14) and (16) together give