Beyond the Click-Through Rate: Web Link Selection with Multi-level Feedback

Beyond the Click-Through Rate: Web Link Selection with Multi-level Feedback

Kun Chen*, Kechao Cai, Longbo Huang, John C.S. Lui,
*Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University
Department of Computer Science & Engineering, The Chinese University of Hong Kong
*, {kccai, cslui},

The web link selection problem is to select a small subset of web links from a large web link pool, and to place the selected links on a web page that can only accommodate a limited number of links, e.g., advertisements, recommendations, or news feeds. Despite the long concerned click-through rate which reflects the attractiveness of the link itself, the revenue can only be obtained from user actions after clicks, e.g., purchasing after being directed to the product pages by recommendation links. Thus, the web links have an intrinsic multi-level feedback structure. With this observation, we consider the context-free web link selection problem, where the objective is to maximize revenue while ensuring that the attractiveness is no less than a preset threshold. The key challenge of the problem is that each link’s multi-level feedbacks are stochastic, and unobservable unless the link is selected. We model this problem with a constrained stochastic multi-armed bandit formulation, and design an efficient link selection algorithm, called Constrained Upper Confidence Bound algorithm (Con-UCB), and prove bounds on both the regret and the violation of the attractiveness constraint. We conduct extensive experiments on three real-world datasets, and show that Con-UCB outperforms state-of-the-art context-free bandit algorithms concerning the multi-level feedback structure.

Beyond the Click-Through Rate: Web Link Selection with Multi-level Feedback

Kun Chen*, Kechao Cai, Longbo Huang, John C.S. Lui, *Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University Department of Computer Science & Engineering, The Chinese University of Hong Kong *, {kccai, cslui},

1 Introduction

With the rapid development of the Internet, web links are playing important roles in websites and mobile apps for attracting users and generating revenues. For example, e-commerce websites such as Amazon and Taobao show featured recommendation links on shopping pages to induce more purchase. Online social networks such as Facebook and Google+ constantly push links of trending topics and friends’ updates to users, so as to increase user engagement. Online media such as HBO and iQIYI present links to popular TV shows and movies on their homepages to attract more subscribers.

Due to the limited space of interest on a web page, only a finite number of links can be shown to a user when the page is browsed. This raises the web link selection problem, i.e., how to properly select a small subset of web links from a large link pool for a web page. Specifically, the web page on which the selected links are shown is called the target page. If clicked, each link directs the user to a subsequent page. This implies that web links provide a multi-level feedback to the web operator. The first level feedback refers to the likelihood that a user clicks a link, also known as the Click-Through Rate (CTR) at the target page. By tracking user actions after clicking a link, e.g., purchase or subscription, we can determine the revenue collected on the associated subsequent page, which gives the second level feedback. Since purchase or subscription can only happen after the click, the compound feedback is the product of the first-level and the second-level feedbacks. Intuitively, the first-level feedback (CTR) indicates the attractiveness of the link at the target page, while the second-level feedback indicates the potential revenue that can be collected from the subsequent page. The compound feedback reveals the compound revenue a web link can bring.

There has been a lot of research concerning the CTR of web links, e.g., [??]. On the other hand, what happens after clicks is also worth great attention, as it generates revenue. For instance, cost per acquisition (CPA) is regarded as the optimal way for an advertiser to buy online advertising [?]. Both the attractiveness and the profitability of a website or an app are important measures [?], because they represent the long-term and short-term benefits, respectively. This motivates us to move beyond CTR and to pursue both attractiveness and profitability simultaneously in link selection.

In this work, we consider the problem of selecting a finite number of links from a large pool for the target page, so as to maximize the total compound revenue, while keeping the total attractiveness above a certain threshold. The constraint on attractiveness (CTR) was also adopted in the literate of online advertising [??]. In addition, we also take into consideration the fact that contextual information, e.g., user preferences, is not always available, e.g., incognito visits [?], cold start [?], or cookie blocking [?]. Thus, we do not assume any prior contextual information. We refer to our problem as the context-free web link selection problem.

Our link selection problem is challenging. First of all, the attractiveness and profitability of a link can be conflicting goals, as shown in [?]. As a result, while selecting links with high CTRs satisfies the attractiveness constraint, it does not necessarily guarantee that the target page will have a high total compound revenue, and vice versa. What further complicates the problem is that the multi-level feedbacks, i.e., the CTR (first-level feedback) and the potential revenue (second-level feedback) of each link, are stochastic and unobservable unless a link is selected and shown on the target page.

We formulate our problem as a constrained stochastic multiple-play multi-armed bandit problem with multi-level rewards. Specifically, there are multiple arms in the system. Each arm represents a link in the pool. Its first-level reward, second-level reward, and compound reward correspond to the first-level feedback (the CTR), the second-level feedback (the potential revenue) and the compound feedback (the compound revenue) of that link, respectively. The objective is to select a finite number of links at each time step to minimize the cumulative regret, as well as the cumulative violation of the constraint. We design a constrained bandit algorithm, Constrained Upper Confidence Bound algorithm (Con-UCB), to simultaneously achieve sub-linear regret and violation bounds.

Our main contributions are as follows. (i) We formulate the link selection problem as a constrained bandit problem with stochastic multi-level rewards (Section 3). (ii) We propose the Con-UCB algorithm (Section 4) and prove that Con-UCB ensures small regret and violation bounds with high probability, i.e., for any given failure probability , the regret and violation at time are bounded by with probability at least (Section 5). (iii) We conduct extensive experiments on three real-world datasets. Our results show that Con-UCB outperforms three state-of-the-art context-free bandit algorithms, CUCB [?], Exp3.M [?], and LExp [?] for the constrained link selection problem (Section 6).

2 Related work

Link selection, or website optimization, has long been an important problem. One common approach for the problem is A/B testing [??], which splits the traffic to two web pages with different designs, and evaluates their performances. However, the overhead of A/B testing can be high when the web link pool is large, as it needs to compare different link combinations. Moreover, A/B testing does not have any loss/regret guarantees. Another approach is to model the link selection problem as a contextual multi-armed bandit problem [?], and to incorporate the collaborative filtering method [??]. However, these contextual bandit formulations neglect the multi-level feedback structures and do not consider any constraint.

The multiple-play multi-armed bandit problem, where multiple arms are selected in each round, has been studied from both theoretical and empirical perspectives, and many policies have been designed [????]. Our constrained multiple-play bandit model differs from aforementioned models in that we consider meeting the constraint on the total first-level rewards in selecting multiple arms, which is important for web link selection.

Recently, bandit with budgets [???] and bandit with knapsacks [??] have attracted much research attention. In these problems, pulling an arm costs certain resources, and each resource has a budget. Thus, resource cost is implicitly taken into consideration during the analysis of regret in the above two formulations since the arm selection process stops when resources are depleted. In contrast, since the constraint in our model is a requirement on the average performance, our arm selection procedure can last for an arbitrary length of time, and we need to consider both the regret and the violation of the constraint during the process. Thus, while our work builds upon the results in [?] and [?], the problem is different, and we study the multiple-play case rather than the single-play case. In addition, we conduct experiments on real-world datasets, which are not included in their works. On the other hand, the thresholding bandit problem in [?] is to find the set of arms whose means are above a given threshold through pure exploration in a fixed time horizon, which is different from our model.

Our work is closest to recent work [?]. They assume the second-level reward is adversarial. However, it has been observed that this might not be the case in practice [?], and user behavior is likely to follow certain statistical rules when the number of users is large. So we study the stochastic case. Most importantly, our algorithm guarantees performance with high probability rather than in expectation, and the regret and violation bounds are improved significantly from and in their algorithm (LExp) to both in our algorithm (Con-UCB).

3 Model

Consider the two-level feedback context-free web link selection problem, where one needs to select links from a pool of web links, , to display on the target page. Each link directs users to a subsequent page. If is shown on the target page, we obtain the following feedbacks when users browse the page:

  1. the click-through rate (CTR), i.e., the probability that a user clicks to visit the corresponding subsequent page,

  2. the after-click revenue, i.e., the revenue collected from each user who clicks and then purchases products (or subscribes to programs) on the corresponding subsequent page.

In practice, the click-through rate and the after-click revenue are stochastic, and we do not assume any prior knowledge about their distributions or expectations. The product of the CTR and the after-click revenue is the compound revenue, i.e., the revenue that can bring if it is shown on the target page. The objective of the link selection problem is to maximize the total compound revenue of the selected links, subject to the constraint that the total CTR of these selected links is no less than a preset threshold ,111CTR measures the attractiveness of a link to users and is an important metric for the link selection problem. where is determined by the web operator based on service requirement. An example is that in online advertising, the constraint on CTR is usually specified in the contract between the publisher (web operator) and the advertising firm [??].

To address the link selection problem, we formulate it as a constrained stochastic multi-armed bandit problem with multiple plays, where each arm has a two-level reward structure. In this formulation, each time step is a short duration and each arm corresponds to a specific web link. Thus, the set of arms can be written as . Each arm is associated with two sequences of random variables, and , where characterizes arm ’s first-level reward (CTR) at time , and characterizes arm ’s second-level reward (after-click revenue). We assume that for any , both and are sequences of i.i.d. random variables. The expectations of and are denoted by and , . We also assume that is independent of for , . Thus, the compound reward of arm at time is with mean . Denote and . Without loss of generality, we assume that and .

As mentioned above, the distributions or expectations of the two-level reward for any arm are unknown beforehand. At each time step , an algorithm selects a set of arms , and observes the first level reward as well as the second level reward for each arm . The optimal policy is the one that maximizes the expected total compound reward of the selected arms, while keeping the total first level reward above the preset threshold .222If , the problem is equivalent to the classic unconstrained multiple-play multi-armed bandit problem (MP-MAB) [?]. If , there is no policy that can satisfy the constraint.

The optimal policy is not limited to deterministic policies as in traditional multi-armed bandit problems [??], but can be randomized, i.e., a distribution on the possible selections . In practice, the number of web links can be very large, and the number of possible selections of links at each time step can be as large as , which makes it complicated to consider randomized policies. To simplify the problem, we represent a randomized policy with a probabilistic selection vector , where is the probability of selecting arm and is the one vector.333If not specified otherwise, all vectors defined in this paper are column vectors. At each time , the selection set under a randomized policy is generated via a dependent rounding procedure [?], which guarantees the probability that is (see Section 4).

The set of randomized policies can be denoted by . Thus, the optimal stationary randomized policy is


Our objective is to design an algorithm to decide the selection set for , such that the regret, i.e., the accumulated difference between the compound reward under and that under the optimal policy, is minimized. Specifically,


Note that the total first-level reward of arms in may violate the constraint, especially when is small and we have little information about the arms. To measure the overall violation of the constraint at time , we define violation of algorithm as,


where . Note that when designing link selection algorithms, we should take both the regret and violation into consideration, so as to achieve both sub-linear regret and sub-linear violation with respect to . Also, note that our model can be generalized to link selection problems with -level () feedback structures, by taking a subsequent page as a new target page and select links for it with the above model, and so on.

4 Algorithm

In this section, we present our Constrained Upper Confidence Bound algorithm (Con-UCB), and describe its details in Algorithm 1. Let denote the historical information of chosen actions and observations up to time . Define the empirical average first-level reward and compound reward for each arm as


where is the number that arm is played before time . Define as in [?] where is a constant. In Con-UCB, we use the following Upper Confidence Bounds for the unknown rewards [?]:

Denote , , and , . In the initialization step of Algorithm 1, is set to , where is an input parameter, i.e., the allowed failure probability.

Specifically, in each round, Con-UCB solves the optimization problem (5) to get the probabilistic selection vector (line 5). Notice that (5) is similar to the original constrained optimization problem (1) but uses the Upper Confidence Bounds to replace the unknown rewards. Then, is generated via a dependent rounding procedure. In line 7 we receive the two-level rewards and for arms in and update the empirical average rewards to get the Upper Confidence Bounds for the next round.

1:, , , .
2:Selected arm set for each round.
3:Set , and .
4:for  do
5:      Solve the following linear optimization problem:
     If (5) has no feasible solution, set arbitrarily.
6:     Set
7:      Receive and for . Update
8:end for
9:function DependentRounding()
10:     while exists such that  do
11:         Find , such that .
12:         Set , .
13:         Update and as
15:     end while
16:     return .
17:end function
Algorithm 1 Constrained Upper Confidence Bound

5 Theoretical Analysis

In this section, we bound the regret and violation of Algorithm 1. We will make use of the concentration inequalities in the following lemmas.

Lemma 1 (Azuma-Hoeffding inequality [?]).

Suppose is a martingale and almost surely, then with probability at least , we have

Lemma 2 ([???]).

Consider i.i.d random variables in with expectation . Let denote their empirical average. Then, for any , with probability at least , we have

where .

The following lemma is a corollary of Lemma 2.

Lemma 3.

Define the empirical averages and as in (4). Then, for every and , with probability at least , we have

where . The same result holds between and .


For every and , applying Lemma 2, we have that, with probability at least ,

This implies that

The last inequality holds because . ∎

Based on the above lemmas, we obtain the following properties about Con-UCB.

Lemma 4.

By running Con-UCB for rounds with , with probability at least , the following results hold simultaneously:


See Appendix A. ∎

From Lemma 4, we can obtain the regret and violation bounds for Con-UCB.

Theorem 1.

For all , let . By running Con-UCB, we have with probability at least that,


We bound the regret and violation using (6) to (9), which were shown to hold with probability at least in Lemma 4.

From (6) we know for all , is a feasible solution of the optimization problem (5), i.e., . Then, for all , we have,


where the last inequality follows from (8). Combining (9) and (10), we have

On the other hand, since for all , (5) has a feasible solution , we know . Then with (7), we can get

This completes the proof. ∎

(a) Cumulative regret
(b) Cumulative violation
(c) Cum. compound reward
(d) Reward/Violation ratio
Figure 1: Experiment results on the Coupon-Purchase dataset. .
(a) Cumulative regret
(b) Cumulative violation
(c) Cum. compound reward
(d) Reward/Violation ratio
Figure 2: Experiment results on the Ad-Clicks dataset. .
(a) Cumulative regret
(b) Cumulative violation
(c) Cum. compound reward
(d) Reward/Violation ratio
Figure 3: Experiment results on the edX-Course dataset. .

6 Experiments

We conduct experiments on three real-world datasets to evaluate the performance of Con-UCB. Two datasets, Coupon-Purchase [?] and Ad-Clicks [?], with coupons and ads respectively, are shown to have a two-level feedback structure in [?]. In particular, for each coupon in Coupon-Purchase, a user who clicks the link to the coupon can decide whether to purchase that coupon; for each ad in Ad-Clicks, a user who clicks the link to the ad can decide whether to request the corresponding seller’s phone number. Thus, for Coupon-Purchase (Ad-Clicks), the first-level feedback is the CTR of each coupon (the CTR of each ad) and the second-level feedback is the purchase rate of each coupon (the phone request rate of each ad). The third dataset, edX-Course, is extracted from the data on Harvard and MIT edX online courses [?]. In particular, for the online courses, we obtain course participation rates by normalizing the numbers of participants using min-max scaling and treat the course participation rates as the first-level feedback; we calculate course certification rates by dividing the numbers of certified participants by the numbers of participants, and treat the course certification rates as the second-level feedback.

We treat the coupons, ads, and courses as different sets of arms. To simulate the real-time two-level feedback of the coupons, ads, and courses, we generate the first-level reward of each arm (coupon, ad, and course) using a Bernoulli variable with mean taken from the first-level feedback (coupon CTR, ad CTR, and course participation rate) in the three datasets, and generate the second-level reward of each arm using another independent Bernoulli variable with mean taken from the second-level feedback (coupon purchase rate, ad phone request rate, and course certification rate).

For comparison purposes, we implement three state-of-the-art bandit algorithms that can select multiple arms at each round as baselines, i.e., CUCB [?], Exp3.M [?] and LExp [?]. Specifically, CUCB selects the top- arms with the highest UCB indices . Exp3.M selects arms using exponential weights on the compound rewards of arms, and LExp selects arms using exponential weights based on the Lagrangian function of reward and violation of arms.

For the three datasets, we run the three algorithms together with Con-UCB for rounds with parameter settings as shown in Figure 13, respectively. In particular, the parameters of Exp3.M and LExp are set in accordance with Corollary 1 of [?] and Theorem 1 of [?], respectively. We compare the cumulative regrets of LExp and Con-UCB at each round , where the optimal policy is computed from the means of the two-level feedback taken from each datatset. (Note that the regrets of CUCB and Exp3.M are not considered since they both have an unconstrained optimal policy, and therefore have different regret definitions from LExp and Con-UCB.) We also compare the cumulative violations and the cumulative compound rewards of the four algorithms. To put things into perspective, we compare the ratios between the cumulative rewards and the cumulative violations of all the algorithms. Such ratios show how much reward an algorithm can gain for each unit violation it has made.

The experiment results are averaged over runs of each algorithm and illustrated in Figure 13. Figure 1(a) shows that the cumulative regret of Con-UCB is much lower than that of LExp on the Coupon-Purchase dataset. This shows that Con-UCB can reduce the regret significantly by selecting arms using UCB-based optimization instead of exponential weights as in LExp. Figure 1(b) and Figure 1(c) show the cumulative violations and the cumulative rewards of the four algorithms. In particular, the Optimal-reward in Figure 1(c) shows the cumulative reward of the optimal policy at each round . As shown in Figure 1(c), CUCB and Exp3.M have larger cumulative rewards than Con-UCB and LExp, as both CUCB and Exp3.M neglect the threshold constraint and thereby blindly selecting arms that maximize the cumulative rewards. Therefore, both CUCB and Exp3.M incur huge cumulative violations as shown in Figure 1(b). Moreover, Con-UCB has a larger cumulative reward and a lower cumulative violation than LExp. This matches our theoretical results that Con-UCB has smaller regret as well as violation bounds than LExp. Figure 1(d) shows that Con-UCB achieves the largest reward/violation ratios among the four algorithms. This means that Con-UCB achieves the best tradeoff between rewards and violations and accumulates most reward for each unit violation it incurs.

We have similar experiment results on Ad-Clicks and edX-Course to those on Coupon-Purchase. As shown in Figure 2 and Figure 3, Con-UCB achieves lower cumulative regret and higher cumulative rewards than LExp, and has the lowest cumulative violations and largest reward/violation ratios among all algorithms. Due to space limit, we omit the details.

In summary, our experiment results are consistent with our theoretical analysis and demonstrate the effectiveness of our Con-UCB algorithm in selecting arms with high cumulative rewards as well as low cumulative violations, thus achieving a good tradeoff between the reward and the violation.

7 Conclusion

In this paper, we consider the web link selection problem with multi-level feedback. We formulate it as a constrained multiple-play stochastic multi-armed bandit problem with multi-level reward. We design an efficient algorithm Con-UCB for solving the problem, and prove that for any given allowed failure probability , with probability at least , Con-UCB guarantees regret and violation bounds. We conduct extensive experiments on three real-world datasets to compare our Con-UCB algorithm with state-of-the-art context-free bandit algorithms. Experiment results show that Con-UCB balances regret and violation better than the other algorithms and outperforms LExp in both regret and violation.


This work is supported in part by the National Natural Science Foundation of China Grants 61672316, 61303195, the Tsinghua Initiative Research Grant, and the China Youth 1000-Talent Grant.


  • [Aggarwal et al., 2010] Gaurav Aggarwal, Elie Bursztein, Collin Jackson, and Dan Boneh. An analysis of private browsing modes in modern browsers. In Proceedings of the 19th USENIX conference on Security, 2010.
  • [Agrawal and Devanur, 2014] Shipra Agrawal and Nikhil R. Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of ACM EC, 2014.
  • [Anantharam et al., 1987] Venkatachalam Anantharam, Pravin Varaiya, and Jean Walrand. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part i: Iid rewards. IEEE Transactions on Automatic Control, 32(11):968–976, 1987.
  • [Auer et al., 2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • [Azuma, 1967] Kazuoki Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 19:357–367, 1967.
  • [Badanidiyuru et al., 2013] Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. In Proceedings of FOCS, 2013.
  • [Bresler et al., 2016] Guy Bresler, Devavrat Shah, and Luis Filipe Voloch. Collaborative filtering with low regret. In Proceedings of ACM SIGMETRICS, 2016.
  • [Bubeck et al., 2012] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • [Cai et al., 2017] Kechao Cai, Kun Chen, Longbo Huang, and John C. S. Lui. Multi-level feedback web links selection problem: Learning and optimization. In Proceedings of ICDM, 2017.
  • [Chen et al., 2013] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework, results and applications. In Proceedings of ICML, 2013.
  • [Chuang and Ho, 2016] Isaac Chuang and Andrew Dean Ho. Harvardx and mitx: Four years of open online courses–fall 2012-summer 2016. SSRN, 2016.
  • [Deng et al., 2017] Alex Deng, Jiannan Lu, and Jonthan Litz. Trustworthy analysis of online a/b tests: Pitfalls, challenges and solutions. In Proceedings of WSDM, 2017.
  • [Ding et al., 2013] Wenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. Multi-armed bandit with budget constraint and variable costs. In AAAI, 2013.
  • [Elahi et al., 2016] Mehdi Elahi, Francesco Ricci, and Neil Rubens. A survey of active learning in collaborative filtering recommender systems. Computer Science Review, 20:29–50, 2016.
  • [Gandhi et al., 2006] Rajiv Gandhi, Samir Khuller, Srinivasan Parthasarathy, and Aravind Srinivasan. Dependent rounding and its applications to approximation algorithms. Journal of the ACM (JACM), 53(3):324–360, 2006.
  • [Kaggle, 2015] Kaggle. Avito context ad clicks, 2015.
  • [Kaggle, 2016] Kaggle. Coupon purchase data, 2016.
  • [Kleinberg et al., 2008] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of STOC, 2008.
  • [Kohavi et al., 2014] Ron Kohavi, Alex Deng, Roger Longbotham, and Ya Xu. Seven rules of thumb for web site experimenters. In Proceedings of SIGKDD, 2014.
  • [Komiyama et al., 2015] J. Komiyama, J. Hondaand, and H. Nakagawa. Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays. In Proceedings of ICML, 2015.
  • [Kumar, 2015] Subodha Kumar. Optimization Issues in Web and Mobile Advertising: Past and Future Trends. Springer, 2015.
  • [Lagrée et al., 2016] Paul Lagrée, Claire Vernade, and Olivier Cappe. Multiple-play bandits in the position-based model. In Proceedings of NIPS, 2016.
  • [Langheinrich et al., 1999] Marc Langheinrich, Atsuyoshi Nakamura, Naoki Abe, Tomonari Kamba, and Yoshiyuki Koseki. Unintrusive customization techniques for web advertising. Computer Networks, 31(11):1259–1272, 1999.
  • [Li et al., 2010] L. Li, W. Chu, J. Langford, and R.E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of WWW, 2010.
  • [Li et al., 2016] Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. In Proceedings of SIGIR, 2016.
  • [Locatelli et al., 2016] Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In Proceedings of ICML, 2016.
  • [Lohtia et al., 2003] Ritu Lohtia, Naveen Donthu, and Edmund K Hershberger. The impact of content and design elements on banner advertising click-through rates. Journal of advertising Research, 43(4):410–418, 2003.
  • [Meng et al., 2016] Wei Meng, Byoungyoung Lee, Xinyu Xing, and Wenke Lee. Trackmeornot: Enabling flexible control on web tracking. In Proceedings of WWW, 2016.
  • [Mookerjee et al., 2016] Radha Mookerjee, Subodha Kumar, and Vijay S Mookerjee. Optimizing performance-based internet advertisement campaigns. Operations Research, 65(1):38–54, 2016.
  • [Pivazyan, 2004] Karen Arman Pivazyan. Decision making in multi-agent systems. Stanford University, 2004.
  • [Spooner, 2014] Jason Spooner. Why cost per acquisition is the only metric that really matters, 2014.
  • [Uchiya et al., 2010] T. Uchiya, A. Nakamura, and M. Kudo. Algorithms for adversarial bandit problems with multiple plays. In Proceedings of ACL’10, 2010.
  • [Wu et al., 2015] Huasen Wu, R Srikant, Xin Liu, and Chong Jiang. Algorithms with logarithmic or sublinear regret for constrained contextual bandits. In Proceedings of NIPS, 2015.
  • [Xia et al., 2016] Yingce Xia, Tao Qin, Weidong Ma, Nenghai Yu, and Tie-Yan Liu. Budgeted multi-armed bandits with multiple plays. In Proceedings of IJCAI, 2016.
  • [Xu et al., 2015] Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. From infrastructure to culture: A/b testing challenges in large scale social networks. In Proceedings of SIGKDD, 2015.

Appendix A Proof of Lemma 4


We first show that (6) and (7) hold with probability at least . Notice that . From Lemma 3, by taking a union bound over all and all , we obtain that for all and all , with probability at least ,


which means


Recall that . Together with (12), we see that (6) holds.

To prove (7), we define a series of random variables as

We know and . Recall that denotes the historical information of chosen actions and observations up to time . Thus, by Lemma 1, we get, with probability at least ,


Similarly, with probability at least ,


Next we bound . Notice that (11) also implies that for all and ,

Let denote the time that arm is played for the th time. We have


where (15) follows from the Cauchy-Schwarz inequality and (16) follows from the fact that . Thus, (13), (14) and (16) together give

Repeating the same analysis, we can show that (8) and (9) also hold with probability at least . Then, we can prove the lemma using the union bound. ∎

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description